date:20180928

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 11:29 AM Cong Wang  wrote:
>
> On Fri, Sep 28, 2018 at 11:11 AM Ido Schimmel  wrote:
> > I don't think this will work given the reference count already dropped
> > to 0, which is why the template deletion function was invoked. I didn't
> > test the patch, but I don't see what would prevent the chain from being
> > freed.
>
> Good catch! So I should just "move" tcf_chain_destroy() into cls_flower.
>
> A patch is being compiled now.

Just FYI, the crash is fixed, but another RCU warning pops up after
my fix. So I am still debugging it, it is related to the kfree_rcu() in
tcf_chain_destroy().

Re: [PATCH v2] net/ncsi: Add NCSI OEM command support

2018-09-28 Thread Vijay Khemka



> On 9/28/18, 6:07 PM, "Vijay Khemka"  wrote:

 >   This patch adds OEM commands and response handling. It also defines OEM
 >   command and response structure as per NCSI specification along with its
 >   handlers.
 >   
 >   ncsi_cmd_handler_oem: This is a generic command request handler for OEM
 >   commands
 >   ncsi_rsp_handler_oem: This is a generic response handler for OEM commands
   
  This is a generic patch for OEM command handling, There will be another patch 
  following this to handle specific OEM commands for each vendor. Currently I 
have
  defined 2 vendor/manufacturer id below in internal.h, more can be added here 
for
  other vendors. I have not defined ncsi_rsp_oem_handler in this patch as they 
are 
  NULL, but there will be a defined handlers for each vendor in next patch. 
 
Signed-off-by: Vijay Khemka 
---
 net/ncsi/internal.h |  4 
 net/ncsi/ncsi-cmd.c | 31 ---
 net/ncsi/ncsi-pkt.h | 16 
 net/ncsi/ncsi-rsp.c | 44 +++-
 4 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 8055e3965cef..c16cb7223064 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -68,6 +68,10 @@ enum {
NCSI_MODE_MAX
 };
 
+/* OEM Vendor Manufacture ID */
+#define NCSI_OEM_MFR_MLX_ID 0x8119
+#define NCSI_OEM_MFR_BCM_ID 0x113d
+
 struct ncsi_channel_version {
u32 version;/* Supported BCD encoded NCSI version */
u32 alpha2; /* Supported BCD encoded NCSI version */
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 7567ca63aae2..2f98533eba46 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -211,6 +211,26 @@ static int ncsi_cmd_handler_snfc(struct sk_buff *skb,
return 0;
 }
 
+static int ncsi_cmd_handler_oem(struct sk_buff *skb,
+   struct ncsi_cmd_arg *nca)
+{
+   struct ncsi_cmd_oem_pkt *cmd;
+   unsigned int len;
+
+   len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
+   if (nca->payload < 26)
+   len += 26;
+   else
+   len += nca->payload;
+
+   cmd = skb_put_zero(skb, len);
+   cmd->mfr_id = nca->dwords[0];
+   memcpy(cmd->data, >dwords[1], nca->payload - 4);
+   ncsi_cmd_build_header(>cmd.common, nca);
+
+   return 0;
+}
+
 static struct ncsi_cmd_handler {
unsigned char type;
int   payload;
@@ -244,7 +264,7 @@ static struct ncsi_cmd_handler {
{ NCSI_PKT_CMD_GNS,0, ncsi_cmd_handler_default },
{ NCSI_PKT_CMD_GNPTS,  0, ncsi_cmd_handler_default },
{ NCSI_PKT_CMD_GPS,0, ncsi_cmd_handler_default },
-   { NCSI_PKT_CMD_OEM,0, NULL },
+   { NCSI_PKT_CMD_OEM,   -1, ncsi_cmd_handler_oem },
{ NCSI_PKT_CMD_PLDM,   0, NULL },
{ NCSI_PKT_CMD_GPUUID, 0, ncsi_cmd_handler_default }
 };
@@ -316,8 +336,13 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
return -ENOENT;
}
 
-   /* Get packet payload length and allocate the request */
-   nca->payload = nch->payload;
+   /* Get packet payload length and allocate the request
+* It is expected that if length set as negative in
+* handler structure means caller is initializing it
+* and setting length in nca before calling xmit function
+*/
+   if (nch->payload >= 0)
+   nca->payload = nch->payload;
nr = ncsi_alloc_command(nca);
if (!nr)
return -ENOMEM;
diff --git a/net/ncsi/ncsi-pkt.h b/net/ncsi/ncsi-pkt.h
index 91b4b66438df..1f338386810d 100644
--- a/net/ncsi/ncsi-pkt.h
+++ b/net/ncsi/ncsi-pkt.h
@@ -151,6 +151,22 @@ struct ncsi_cmd_snfc_pkt {
unsigned char   pad[22];
 };
 
+/* OEM Request Command as per NCSI Specification */
+struct ncsi_cmd_oem_pkt {
+   struct ncsi_cmd_pkt_hdr cmd; /* Command header*/
+   __be32  mfr_id;  /* Manufacture ID*/
+   unsigned char   data[64];/* OEM Payload Data  */
+   __be32  checksum;/* Checksum  */
+};
+
+/* OEM Response Packet as per NCSI Specification */
+struct ncsi_rsp_oem_pkt {
+   struct ncsi_rsp_pkt_hdr rsp; /* Command header*/
+   __be32  mfr_id;  /* Manufacture ID*/
+   unsigned char   data[64];/* Payload data  */
+   __be32  checksum;/* Checksum  */
+};
+
 /* Get Link Status */
 struct ncsi_rsp_gls_pkt {
struct ncsi_rsp_pkt_hdr rsp;/* Response header   */
diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index

[PATCH iproute2 net-next v1 5/6] tc: Add support for configuring the taprio scheduler

2018-09-28 Thread Vinicius Costa Gomes

This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.

Example configuration:

tc qdisc replace dev enp3s0 parent root handle 100 taprio \
  num_tc 3 \
  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
  queues 1@0 1@1 2@2 \
  base-time 1528743495910289987 \
  sched-entry S 01 30 \
  sched-entry S 02 30 \
  sched-entry S 04 30 \
  clockid CLOCK_TAI

The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:

 sched-entry   

The only supported  is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). 
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry.  is a time duration
in nanoseconds that specifies for how long that state defined by 
and  should be held before moving to the next entry.

This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.

The other parameters can be defined as follows:

 - base-time: specifies the instant when the schedule starts, if
  'base-time' is a time in the past, the schedule will start at

  base-time + (N * cycle-time)

   where N is the smallest integer so the resulting time is greater
   than "now", and "cycle-time" is the sum of all the intervals of the
   entries in the schedule;

 - clockid: specifies the reference clock to be used;

The parameters should be similar to what the IEEE 802.1Q family of
specification defines.

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile   |   1 +
 tc/q_taprio.c | 410 ++
 2 files changed, 411 insertions(+)
 create mode 100644 tc/q_taprio.c

diff --git a/tc/Makefile b/tc/Makefile
index 5a1a7ff9..25a28284 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -74,6 +74,7 @@ TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
 TCMODULES += q_etf.o
+TCMODULES += q_taprio.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_taprio.c b/tc/q_taprio.c
new file mode 100644
index ..645a7613
--- /dev/null
+++ b/tc/q_taprio.c
@@ -0,0 +1,410 @@
+/*
+ * q_taprio.c  Time Aware Priority Scheduler
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ * Jesus Sanchez-Palencia 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+#include "list.h"
+
+struct sched_entry {
+   struct list_head list;
+   uint32_t index;
+   uint32_t interval;
+   uint32_t gatemask;
+   uint8_t cmd;
+};
+
+#define CLOCKID_INVALID (-1)
+static const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+} clockids_sysv[] = {
+   { "REALTIME", CLOCK_REALTIME },
+   { "TAI", CLOCK_TAI },
+   { "BOOTTIME", CLOCK_BOOTTIME },
+   { "MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+};
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... taprio clockid CLOCKID\n");
+   fprintf(stderr, "  [num_tc NUMBER] [map P0 P1 ...] ");
+   fprintf(stderr, "  [queues COUNT@OFFSET COUNT@OFFSET 
COUNT@OFFSET ...] ");
+   fprintf(stderr, "  [ [sched-entry index cmd gate-mask 
interval] ... ] ");
+   fprintf(stderr, "  [base-time time] ");
+   fprintf(stderr, "\nCLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)");
+   fprintf(stderr, "\n");
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "taprio: illegal value for \"clockid\": \"%s\".\n", 
val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid *c;
+
+   /* Drop the CLOCK_ prefix if that is being used. */
+   if (strcasestr(arg, "CLOCK_") != NULL)
+   arg += sizeof("CLOCK_") - 1;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strcasecmp(c->name, arg) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+static const char* get_clock_name(clockid_t clockid)
+{
+   const struct static_clockid *c;
+
+

[PATCH iproute2 net-next v1 1/6] utils: Implement get_s64()

2018-09-28 Thread Vinicius Costa Gomes

Add this helper to read signed 64-bit integers from a string.

Signed-off-by: Vinicius Costa Gomes 
---
 include/utils.h |  1 +
 lib/utils.c | 21 +
 2 files changed, 22 insertions(+)

diff --git a/include/utils.h b/include/utils.h
index 8cb4349e..58574a05 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -139,6 +139,7 @@ int get_time_rtt(unsigned *val, const char *arg, int *raw);
 #define get_byte get_u8
 #define get_ushort get_u16
 #define get_short get_s16
+int get_s64(__s64 *val, const char *arg, int base);
 int get_u64(__u64 *val, const char *arg, int base);
 int get_u32(__u32 *val, const char *arg, int base);
 int get_s32(__s32 *val, const char *arg, int base);
diff --git a/lib/utils.c b/lib/utils.c
index e87ecf31..1b84b801 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -383,6 +383,27 @@ int get_u8(__u8 *val, const char *arg, int base)
return 0;
 }
 
+int get_s64(__s64 *val, const char *arg, int base)
+{
+   long res;
+   char *ptr;
+
+   errno = 0;
+
+   if (!arg || !*arg)
+   return -1;
+   res = strtoll(arg, , base);
+   if (!ptr || ptr == arg || *ptr)
+   return -1;
+   if ((res == LLONG_MIN || res == LLONG_MAX) && errno == ERANGE)
+   return -1;
+   if (res > INT64_MAX || res < INT64_MIN)
+   return -1;
+
+   *val = res;
+   return 0;
+}
+
 int get_s32(__s32 *val, const char *arg, int base)
 {
long res;
-- 
2.19.0

[PATCH iproute2 net-next v1 2/6] include: Add helper to retrieve a __s64 from a netlink msg

2018-09-28 Thread Vinicius Costa Gomes

This allows signed 64-bit integers to be retrieved from a netlink
message.
---
 include/libnetlink.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 9d9249e6..88164975 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -185,6 +185,13 @@ static inline __u64 rta_getattr_u64(const struct rtattr 
*rta)
memcpy(, RTA_DATA(rta), sizeof(__u64));
return tmp;
 }
+static inline __s64 rta_getattr_s64(const struct rtattr *rta)
+{
+   __s64 tmp;
+
+   memcpy(, RTA_DATA(rta), sizeof(__s64));
+   return tmp;
+}
 static inline const char *rta_getattr_str(const struct rtattr *rta)
 {
return (const char *)RTA_DATA(rta);
-- 
2.19.0

[PATCH iproute2 net-next v1 0/6] introduce the taprio scheduler

2018-09-28 Thread Vinicius Costa Gomes

Hi,

Changes from RFC:
  - Removed support for the sched-file parameter, in favor of
supporting the batch mode feature;

This is the iproute2 side of the taprio v1 series.

Please see the kernel side cover letter for more information about how
to test this.

Cheers,
--
Vinicius


Jesus Sanchez-Palencia (1):
  libnetlink: Add helper for getting a __s32 from netlink msgs

Vinicius Costa Gomes (5):
  utils: Implement get_s64()
  include: Add helper to retrieve a __s64 from a netlink msg
  include: add definitions for taprio [DO NOT COMMIT]
  tc: Add support for configuring the taprio scheduler
  taprio: Add manpage for tc-taprio(8)

 include/libnetlink.h   |  11 +
 include/uapi/linux/pkt_sched.h |  52 -
 include/utils.h|   1 +
 lib/utils.c|  21 ++
 man/man8/tc-taprio.8   | 142 
 tc/Makefile|   1 +
 tc/q_taprio.c  | 410 +
 7 files changed, 635 insertions(+), 3 deletions(-)
 create mode 100644 man/man8/tc-taprio.8
 create mode 100644 tc/q_taprio.c

-- 
2.19.0

[PATCH iproute2 net-next v1 3/6] libnetlink: Add helper for getting a __s32 from netlink msgs

2018-09-28 Thread Vinicius Costa Gomes

From: Jesus Sanchez-Palencia 

This function retrieves a signed 32-bit integer from a netlink message
and returns it.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/libnetlink.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 88164975..79ba793e 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -185,6 +185,10 @@ static inline __u64 rta_getattr_u64(const struct rtattr 
*rta)
memcpy(, RTA_DATA(rta), sizeof(__u64));
return tmp;
 }
+static inline __s32 rta_getattr_s32(const struct rtattr *rta)
+{
+   return *(__s32 *)RTA_DATA(rta);
+}
 static inline __s64 rta_getattr_s64(const struct rtattr *rta)
 {
__s64 tmp;
-- 
2.19.0

[PATCH iproute2 net-next v1 6/6] taprio: Add manpage for tc-taprio(8)

2018-09-28 Thread Vinicius Costa Gomes

This documents the parameters and provides an example of usage.

Signed-off-by: Vinicius Costa Gomes 
---
 man/man8/tc-taprio.8 | 142 +++
 1 file changed, 142 insertions(+)
 create mode 100644 man/man8/tc-taprio.8

diff --git a/man/man8/tc-taprio.8 b/man/man8/tc-taprio.8
new file mode 100644
index ..92055b43
--- /dev/null
+++ b/man/man8/tc-taprio.8
@@ -0,0 +1,142 @@
+.TH TAPRIO 8 "25 Sept 2018" "iproute2" "Linux"
+.SH NAME
+TAPRIO \- Time Aware Priority Shaper
+.SH SYNOPSIS
+.B tc qdisc ... dev
+dev
+.B parent
+classid
+.B [ handle
+major:
+.B ] taprio num_tc
+tcs
+.ti +8
+.B map
+P0 P1 P2 ...
+.B queues
+count1@offset1 count2@offset2 ...
+.ti +8
+.B base-time
+base-time
+.B clockid
+clockid
+.ti +8
+.B sched-entry
+  
+.ti +8
+.B sched-entry
+  
+.ti +8
+.B sched-entry
+  
+.ti +8
+.B sched-entry
+  
+
+.SH DESCRIPTION
+The TAPRIO qdisc implements a simplified version of the scheduling
+state machine defined by IEEE 802.1Q-2018 Section 8.6.9, which allows
+configuration of a sequence of gate states, where each gate state
+allows outgoing traffic for a subset (potentially empty) of traffic
+classes.
+
+How traffic is mapped to different hardware queues is similar to
+.BR mqprio(8)
+and so the
+.B map
+and
+.Q queues
+parameters have the same meaning.
+
+The other parameters specify the schedule, and at what point in time
+it should start (it can behave as the schedule started in the past).
+
+.SH PARAMETERS
+.TP
+num_tc
+.BR
+Number of traffic classes to use. Up to 16 classes supported.
+
+.TP
+map
+.br
+The priority to traffic class map. Maps priorities 0..15 to a specified
+traffic class. See
+.BR mqprio(8)
+for more details.
+
+.TP
+queues
+.br
+Provide count and offset of queue range for each traffic class. In the
+format,
+.B count@offset.
+Queue ranges for each traffic classes cannot overlap and must be a
+contiguous range of queues.
+
+.TP
+base-time
+.br
+Specifies the instant in nanoseconds, using the reference of
+.B clockid,
+defining the time when the schedule starts. If 'base-time' is a time
+in the past, the schedule will start at
+
+base-time + (N * cycle-time)
+
+where N is the smallest integer so the resulting time is greater than
+"now", and "cycle-time" is the sum of all the intervals of the entries
+in the schedule;
+
+.TP
+clockid
+.br
+Specifies the clock to be used by qdisc's internal timer for measuring
+time and scheduling events.
+
+.TP
+sched-entry
+.br
+There may multiple
+.B sched-entry
+parameters in a single schedule. Each one has the
+
+sched-entry   
+
+format. The only supported  is "S", which
+means "SetGateStates", following the IEEE 802.1Q-2018 definition
+(Table 8-7).  is a bitmask where each bit is a associated
+with a traffic class, so bit 0 (the least significant bit) being "on"
+means that traffic class 0 is "active" for that schedule entry.
+ is a time duration, in nanoseconds, that specifies for how
+long that state defined by  and  should be held
+before moving to the next entry.
+
+.SH EXAMPLES
+
+The following example shows how an traffic schedule with three traffic
+classes ("num_tc 3"), which are separated different traffic classes,
+we are going to call these TC 0, TC 1 and TC 2. We could read the
+"map" parameter below as: traffic with priority 3 is classified as TC
+0, priority 2 is classified as TC 1 and the rest is classified as TC
+2.
+
+The schedule will start at instant 1528743495910289987 using the
+reference CLOCK_TAI. The schedule is composed of three entries each of
+300us duration.
+
+.EX
+# tc qdisc replace dev eth0 parent root handle 100 taprio \\
+  num_tc 3 \\
+  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+  queues 1@0 1@1 2@2 \\
+  base-time 1528743495910289987 \\
+  sched-entry S 01 30 \\
+  sched-entry S 02 30 \\
+  sched-entry S 04 30 \\
+  clockid CLOCK_TAI
+.EE
+
+
+.SH AUTHORS
+Vinicius Costa Gomes 
-- 
2.19.0

[PATCH iproute2 net-next v1 4/6] include: add definitions for taprio [DO NOT COMMIT]

2018-09-28 Thread Vinicius Costa Gomes

DO NOT COMMIT

This patch exists only to ease the testing, until this header is
updated with the definitions from the kernel.

Signed-off-by: Vinicius Costa Gomes 
---
 include/uapi/linux/pkt_sched.h | 52 --
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 8975fd1a..89ee47c2 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -395,9 +395,9 @@ enum {
 struct tc_htb_xstats {
__u32 lends;
__u32 borrows;
-   __u32 giants;   /* too big packets (rate will not be accurate) */
-   __u32 tokens;
-   __u32 ctokens;
+   __u32 giants;   /* unused since 'Make HTB scheduler work with TSO.' */
+   __s32 tokens;
+   __s32 ctokens;
 };
 
 /* HFSC section */
@@ -1084,4 +1084,50 @@ enum {
CAKE_ATM_MAX
 };
 
+
+/* TAPRIO */
+enum {
+   TC_TAPRIO_CMD_SET_GATES = 0x00,
+   TC_TAPRIO_CMD_SET_AND_HOLD = 0x01,
+   TC_TAPRIO_CMD_SET_AND_RELEASE = 0x02,
+};
+
+enum {
+   TCA_TAPRIO_SCHED_ENTRY_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY_INDEX, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_CMD, /* u8 */
+   TCA_TAPRIO_SCHED_ENTRY_GATE_MASK, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_INTERVAL, /* u32 */
+   __TCA_TAPRIO_SCHED_ENTRY_MAX,
+};
+#define TCA_TAPRIO_SCHED_ENTRY_MAX (__TCA_TAPRIO_SCHED_ENTRY_MAX - 1)
+
+/* The format for schedule entry list is:
+ * [TCA_TAPRIO_SCHED_ENTRY_LIST]
+ *   [TCA_TAPRIO_SCHED_ENTRY]
+ * [TCA_TAPRIO_SCHED_ENTRY_CMD]
+ * [TCA_TAPRIO_SCHED_ENTRY_GATES]
+ * [TCA_TAPRIO_SCHED_ENTRY_INTERVAL]
+ */
+enum {
+   TCA_TAPRIO_SCHED_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY,
+   __TCA_TAPRIO_SCHED_MAX,
+};
+
+#define TCA_TAPRIO_SCHED_MAX (__TCA_TAPRIO_SCHED_MAX - 1)
+
+enum {
+   TCA_TAPRIO_ATTR_UNSPEC,
+   TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
+   TCA_TAPRIO_ATTR_SCHED_ENTRY_LIST, /* nested of entry */
+   TCA_TAPRIO_ATTR_SCHED_BASE_TIME, /* s64 */
+   TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY, /* single entry */
+   TCA_TAPRIO_ATTR_SCHED_CLOCKID, /* s32 */
+   TCA_TAPRIO_PAD,
+   __TCA_TAPRIO_ATTR_MAX,
+};
+
+#define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
+
 #endif
-- 
2.19.0

[PATCH net-next v1 1/1] tc: Add support for configuring the taprio scheduler

2018-09-28 Thread Vinicius Costa Gomes

This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.

Example configuration:

tc qdisc replace dev enp3s0 parent root handle 100 taprio \
  num_tc 3 \
  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
  queues 1@0 1@1 2@2 \
  base-time 1528743495910289987 \
  sched-entry S 01 30 \
  sched-entry S 02 30 \
  sched-entry S 04 30 \
  clockid CLOCK_TAI

The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:

 sched-entry   

The only supported  is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). 
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry.  is a time duration
in nanoseconds that specifies for how long that state defined by 
and  should be held before moving to the next entry.

This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.

The other parameters can be defined as follows:

 - base-time: specifies the instant when the schedule starts, if
  'base-time' is a time in the past, the schedule will start at

  base-time + (N * cycle-time)

   where N is the smallest integer so the resulting time is greater
   than "now", and "cycle-time" is the sum of all the intervals of the
   entries in the schedule;

 - clockid: specifies the reference clock to be used;

The parameters should be similar to what the IEEE 802.1Q family of
specification defines.

Signed-off-by: Vinicius Costa Gomes 
---
 include/uapi/linux/pkt_sched.h |  46 ++
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_taprio.c | 962 +
 4 files changed, 1020 insertions(+)
 create mode 100644 net/sched/sch_taprio.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index e9b7244ac381..89ee47c2f17d 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1084,4 +1084,50 @@ enum {
CAKE_ATM_MAX
 };
 
+
+/* TAPRIO */
+enum {
+   TC_TAPRIO_CMD_SET_GATES = 0x00,
+   TC_TAPRIO_CMD_SET_AND_HOLD = 0x01,
+   TC_TAPRIO_CMD_SET_AND_RELEASE = 0x02,
+};
+
+enum {
+   TCA_TAPRIO_SCHED_ENTRY_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY_INDEX, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_CMD, /* u8 */
+   TCA_TAPRIO_SCHED_ENTRY_GATE_MASK, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_INTERVAL, /* u32 */
+   __TCA_TAPRIO_SCHED_ENTRY_MAX,
+};
+#define TCA_TAPRIO_SCHED_ENTRY_MAX (__TCA_TAPRIO_SCHED_ENTRY_MAX - 1)
+
+/* The format for schedule entry list is:
+ * [TCA_TAPRIO_SCHED_ENTRY_LIST]
+ *   [TCA_TAPRIO_SCHED_ENTRY]
+ * [TCA_TAPRIO_SCHED_ENTRY_CMD]
+ * [TCA_TAPRIO_SCHED_ENTRY_GATES]
+ * [TCA_TAPRIO_SCHED_ENTRY_INTERVAL]
+ */
+enum {
+   TCA_TAPRIO_SCHED_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY,
+   __TCA_TAPRIO_SCHED_MAX,
+};
+
+#define TCA_TAPRIO_SCHED_MAX (__TCA_TAPRIO_SCHED_MAX - 1)
+
+enum {
+   TCA_TAPRIO_ATTR_UNSPEC,
+   TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
+   TCA_TAPRIO_ATTR_SCHED_ENTRY_LIST, /* nested of entry */
+   TCA_TAPRIO_ATTR_SCHED_BASE_TIME, /* s64 */
+   TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY, /* single entry */
+   TCA_TAPRIO_ATTR_SCHED_CLOCKID, /* s32 */
+   TCA_TAPRIO_PAD,
+   __TCA_TAPRIO_ATTR_MAX,
+};
+
+#define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e95741388311..1b9afdee5ba9 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -194,6 +194,17 @@ config NET_SCH_ETF
  To compile this code as a module, choose M here: the
  module will be called sch_etf.
 
+config NET_SCH_TAPRIO
+   tristate "Time Aware Priority (taprio) Scheduler"
+   help
+ Say Y here if you want to use the Time Aware Priority (taprio) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_taprio.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index f0403f49edcb..8a40431d7b5c 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_NET_SCH_HHF) += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)  += sch_etf.o
+obj-$(CONFIG_NET_SCH_TAPRIO)   += sch_taprio.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o

[PATCH net-next v1 0/1] net/sched: Introduce the taprio scheduler

2018-09-28 Thread Vinicius Costa Gomes

Hi,

Changes from the RFC:
  - Moved some fields from the per-qdisc data structure to the per
schedule entry one, mainly "expires" (now called "close_time",
when an entry ends) and "budget" (how many bytes can be sent
during an entry);

  - Removed support for the schedule file, in favour of using iproute2
batch mode (only affects the iproute2 patches) (Jiri Pirko,
Stephen Hemminger);

  - Removed support for manually setting a cycle-time (it will be
added in a later series);


Original cover letter
=
(lightly edited, updated references and usage)


This series provides a set of interfaces that can be used by
applications that require (time-based) Scheduled Transmission of
packets. It is comprised by 3 new components to the kernel:

  - etf: the per-queue TxTime-Based scheduling qdisc;
  - taprio: the per-port Time-Aware scheduler qdisc;
  - SO_TXTIME: a socket option + cmsg APIs.

ETF and SO_TXTIME are already applied[1] into the net-next tree. This
is the remaining piece.

Overview


The CBS qdisc proposal RFC [2] included some rough ideas about the
design and API of a "taprio" (Time Aware Priority) qdisc. The idea of
presenting the taprio ideas at that point (almost one year ago!) was
to show our vision of how things would fit together going forward.
>From that concept stage to this (almost) realised stage the main
differences are:

  - As of now, taprio is a software only implementation of a schedule
executor;
  - Instead of taprio centralising all the time based decisions, taprio
can work together with ETF (the Earliest TxTime First), a qdisc
meant to use the LaunchTime (or similar) feature of various network
controllers;

In a nutshell, taprio is a root qdisc that can execute a pre-defined
schedule, etf is a qdisc that provides time based admission control
and "earliest deadline first" dequeue mode, and SO_TXTIME is a socket
option that is used for enabling a socket to be used for time-based
Tx and configuring its parameters.

taprio
==

This scheduler allows the network administrator to configure schedules
for classes of traffic, the configuration interface is similar to what
IEEE 802.1Q-2018 defines.

Example configuration:

$ tc qdisc add dev enp2s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
sched-entry S 01 30 \
sched-entry S 02 30 \
sched-entry S 04 30 \
base-time 1528743495910289987 \
clockid CLOCK_TAI

This qdisc borrows a few concepts from mqprio and so, most the
parameters are similar to mqprio. The main difference is the sequence of 
'sched-entry' parameters, that constitute one schedule:

  sched-entry S 01 30
  sched-entry S 02 30
  sched-entry S 04 30

The format of each entry is:
sched-entry   

The only supported  is "S", which means "SetGateStates",
following the IEEE 802.1Q-2018 definition (Table 8-7). 
is a bit-mask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry.  is a time duration
in nanoseconds that specifies for how long that state defined by 
and  should be held before moving to the next entry.

This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.

The other parameters can be defined as follows:
  - base-time: allows that multiple systems can have synchronised
schedules, it specifies the instant when the schedule starts;
  - clockid: specifies the reference clock to be used;

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [3]

The basic design of the scheduler is simple, after we calculate the
first expiration of the hrtimer, we set the next expiration to be the
previous plus the current entry's interval. At each time the function
runs, we set the current_entry, which has a gate_mask (that controls
which traffic classes are allowed to "go out" during each interval),
and we reuse this callback to "kick" the qdisc (this is the reason
that the usual qdisc watchdog isn't used).


Future work
===

  - Add support for multiple schedules, so something like the Admin
and Oper schedules from IEEE 802.1Q-2018 can be implemented,
probably "cycle-time" will be re-implemented at this time;

  - Add support for HW offloading;

  - Add support for Frame Preemption related commands (formerly
802.1Qbu, now part of 802.1Q);

Known Issues


  - As taprio is a software only implementation, and there's another
layer of queuing in the network controller, packets can still
leave the controller outside their "correct" windows. This happens
mostly for low-priority classes, and only if they are

[PATCH v2] net/ncsi: Add NCSI OEM command support

2018-09-28 Thread Vijay Khemka

This patch adds OEM commands and response handling. It also defines OEM
command and response structure as per NCSI specification along with its
handlers.

ncsi_cmd_handler_oem: This is a generic command request handler for OEM
commands
ncsi_rsp_handler_oem: This is a generic response handler for OEM commands

Signed-off-by: Vijay Khemka 
---
 net/ncsi/internal.h |  4 
 net/ncsi/ncsi-cmd.c | 31 ---
 net/ncsi/ncsi-pkt.h | 16 
 net/ncsi/ncsi-rsp.c | 44 +++-
 4 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 8055e3965cef..c16cb7223064 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -68,6 +68,10 @@ enum {
NCSI_MODE_MAX
 };
 
+/* OEM Vendor Manufacture ID */
+#define NCSI_OEM_MFR_MLX_ID 0x8119
+#define NCSI_OEM_MFR_BCM_ID 0x113d
+
 struct ncsi_channel_version {
u32 version;/* Supported BCD encoded NCSI version */
u32 alpha2; /* Supported BCD encoded NCSI version */
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 7567ca63aae2..2f98533eba46 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -211,6 +211,26 @@ static int ncsi_cmd_handler_snfc(struct sk_buff *skb,
return 0;
 }
 
+static int ncsi_cmd_handler_oem(struct sk_buff *skb,
+   struct ncsi_cmd_arg *nca)
+{
+   struct ncsi_cmd_oem_pkt *cmd;
+   unsigned int len;
+
+   len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
+   if (nca->payload < 26)
+   len += 26;
+   else
+   len += nca->payload;
+
+   cmd = skb_put_zero(skb, len);
+   cmd->mfr_id = nca->dwords[0];
+   memcpy(cmd->data, >dwords[1], nca->payload - 4);
+   ncsi_cmd_build_header(>cmd.common, nca);
+
+   return 0;
+}
+
 static struct ncsi_cmd_handler {
unsigned char type;
int   payload;
@@ -244,7 +264,7 @@ static struct ncsi_cmd_handler {
{ NCSI_PKT_CMD_GNS,0, ncsi_cmd_handler_default },
{ NCSI_PKT_CMD_GNPTS,  0, ncsi_cmd_handler_default },
{ NCSI_PKT_CMD_GPS,0, ncsi_cmd_handler_default },
-   { NCSI_PKT_CMD_OEM,0, NULL },
+   { NCSI_PKT_CMD_OEM,   -1, ncsi_cmd_handler_oem },
{ NCSI_PKT_CMD_PLDM,   0, NULL },
{ NCSI_PKT_CMD_GPUUID, 0, ncsi_cmd_handler_default }
 };
@@ -316,8 +336,13 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
return -ENOENT;
}
 
-   /* Get packet payload length and allocate the request */
-   nca->payload = nch->payload;
+   /* Get packet payload length and allocate the request
+* It is expected that if length set as negative in
+* handler structure means caller is initializing it
+* and setting length in nca before calling xmit function
+*/
+   if (nch->payload >= 0)
+   nca->payload = nch->payload;
nr = ncsi_alloc_command(nca);
if (!nr)
return -ENOMEM;
diff --git a/net/ncsi/ncsi-pkt.h b/net/ncsi/ncsi-pkt.h
index 91b4b66438df..1f338386810d 100644
--- a/net/ncsi/ncsi-pkt.h
+++ b/net/ncsi/ncsi-pkt.h
@@ -151,6 +151,22 @@ struct ncsi_cmd_snfc_pkt {
unsigned char   pad[22];
 };
 
+/* OEM Request Command as per NCSI Specification */
+struct ncsi_cmd_oem_pkt {
+   struct ncsi_cmd_pkt_hdr cmd; /* Command header*/
+   __be32  mfr_id;  /* Manufacture ID*/
+   unsigned char   data[64];/* OEM Payload Data  */
+   __be32  checksum;/* Checksum  */
+};
+
+/* OEM Response Packet as per NCSI Specification */
+struct ncsi_rsp_oem_pkt {
+   struct ncsi_rsp_pkt_hdr rsp; /* Command header*/
+   __be32  mfr_id;  /* Manufacture ID*/
+   unsigned char   data[64];/* Payload data  */
+   __be32  checksum;/* Checksum  */
+};
+
 /* Get Link Status */
 struct ncsi_rsp_gls_pkt {
struct ncsi_rsp_pkt_hdr rsp;/* Response header   */
diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index 930c1d3796f0..22664ebdc93a 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -596,6 +596,48 @@ static int ncsi_rsp_handler_snfc(struct ncsi_request *nr)
return 0;
 }
 
+static struct ncsi_rsp_oem_handler {
+   unsigned intmfr_id;
+   int (*handler)(struct ncsi_request *nr);
+} ncsi_rsp_oem_handlers[] = {
+   { NCSI_OEM_MFR_MLX_ID, NULL },
+   { NCSI_OEM_MFR_BCM_ID, NULL }
+};
+
+
+/* Response handler for OEM command */
+static int ncsi_rsp_handler_oem(struct ncsi_request *nr)
+{
+   struct ncsi_rsp_oem_pkt *rsp;
+   struct ncsi_rsp_oem_handler *nrh = NULL;
+   unsigned int mfr_id, i;
+
+   /* Get the response header */
+   rsp = (struct ncsi_rsp_oem_pkt

[PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue

2018-09-28 Thread Manlunas, Felix

Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL,
but the LICENCE.cavium file neglects to indicate that.  However,
LICENCE.cavium does correctly specify the license that covers the other
Cavium firmware images that do not contain any GPL code.

Fix the GPL compliance issue by adding a new file, LICENCE.cavium_liquidio,
which correctly shows the GPL boilerplate.  This new file specifies the
licenses for all liquidio firmware, including the ones that do not have
GPL code.

Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio.

Reported-by: Florian Weimer 
Signed-off-by: Manish Awasthi 
Signed-off-by: Manoj Panicker 
Signed-off-by: Faisal Masood 
Signed-off-by: Felix Manlunas 
---
 LICENCE.cavium_liquidio | 429 
 WHENCE  |   2 +-
 2 files changed, 430 insertions(+), 1 deletion(-)
 create mode 100644 LICENCE.cavium_liquidio

diff --git a/LICENCE.cavium_liquidio b/LICENCE.cavium_liquidio
new file mode 100644
index 000..72e6ae7
--- /dev/null
+++ b/LICENCE.cavium_liquidio
@@ -0,0 +1,429 @@
+This file contains licences pertaining to the following firmwares for
+LiquidIO (c) adapters
+
+1. lio_nic_23xx.bin, lio_210nv_nic.bin, lio_410nv_nic.bin
+2. lio_vsw_23xx.bin
+
+###
+
+1. lio_nic_23xx.bin, lio_210nv_nic.bin, lio_410nv_nic.bin
+
+Copyright (c) 2018, Cavium, Inc. All rights reserved.
+
+Software License Agreement
+
+ANY USE, REPRODUCTION, OR DISTRIBUTION OF THE ACCOMPANYING BINARY SOFTWARE
+CONSTITUTES LICENSEEE'S ACCEPTANCE OF THE TERMS AND CONDITIONS OF THIS 
AGREEMENT.
+
+Licensed Software. Subject to the terms and conditions of this Agreement,
+Cavium, Inc. ("Cavium") grants to Licensee a worldwide, non-exclusive, and
+royalty-free license to use, reproduce, and distribute the binary software in
+its complete and unmodified form as provided by Cavium.
+
+Restrictions. Licensee must reproduce the Cavium copyright notice above with
+each binary software copy. Licensee must not reverse engineer, decompile,
+disassemble or modify in any way the binary software. Licensee must not use
+the binary software in violation of any applicable law or regulation. This
+Agreement shall automatically terminate upon Licensee's breach of any term or
+condition of this Agreement in which case, Licensee shall destroy all copies of
+the binary software.
+
+Warranty Disclaimer.  THE LICENSED SOFTWARE IS OFFERED "AS IS," AND CAVIUM
+GRANTS AND LICENSEE RECEIVES NO WARRANTIES OF ANY KIND, WHETHER EXPRESS,
+IMPLIED, STATUTORY, OR BY COURSE OF COMMUNICATION OR DEALING WITH LICENSEE, OR
+OTHERWISE.  CAVIUM AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY IMPLIED
+WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, OR
+NONINFRINGEMENT OF THIRD PARTY RIGHTS, CONCERNING THE LICENSED SOFTWARE,
+DERIVATIVE WORKS, OR ANY DOCUMENTATION PROVIDED WITH THE FOREGOING.  WITHOUT
+LIMITING THE GENERALITY OF THE FOREGOING, CAVIUM DOES NOT WARRANT THAT THE
+LICENSED SOFTWARE IS ERROR-FREE OR WILL OPERATE WITHOUT INTERRUPTION, AND
+CAVIUM GRANTS NO WARRANTY REGARDING ITS USE OR THE RESULTS THEREFROM, INCLUDING
+ITS CORRECTNESS, ACCURACY, OR RELIABILITY.
+
+Limitation of Liability. IN NO EVENT WILL LICENSEE, CAVIUM, OR ANY OF CAVIUM'S
+LICENSORS HAVE ANY LIABILITY HEREUNDER FOR ANY INDIRECT, SPECIAL, OR
+CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+FOR BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, ARISING OUT
+OF THIS AGREEMENT, INCLUDING DAMAGES FOR LOSS OF PROFITS, OR THE COST OF
+PROCUREMENT OF SUBSTITUTE GOODS, EVEN IF SUCH PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+Export and Import Laws.  Licensee acknowledges and agrees that the Licensed
+Software (including technical data and related technology) may be controlled by
+the export control laws, rules, regulations, restrictions and national security
+controls of the United States and other applicable foreign agencies (the
+"Export Controls"), and agrees not export or re-export, or allow the export or
+re-export of export-controlled the Licensed Software (including technical data
+and related technology) or any copy, portion or direct product of the foregoing
+in violation of the Export Controls. Licensee hereby represents that
+(i) Licensee is not an entity or person to whom provision of the Licensed
+Software (including technical data and related technology) is restricted or
+prohibited by the Export Controls; and (ii) Licensee will not export, re-export
+or otherwise transfer the export-controlled Licensed Software (including
+technical data and related technology) in violation of U.S. sanction programs
+or export control regulations to (a) any country, or  national or resident of
+any country, subject to a United States trade embargo, (b) any person or entity
+to whom shipment is restricted or prohibited by the Export Controls,

Re: [PATCH net-next] geneve: fix ttl inherit type

2018-09-28 Thread Michal Kubecek

On Fri, Sep 28, 2018 at 09:09:58AM +0800, Hangbin Liu wrote:
> Phil pointed out that there is a mismatch between vxlan and geneve ttl
> inherit. We should define it as a flag and use nla_put_flag to export this
> opiton.
> 
> Fixes: 52d0d404d39dd ("geneve: add ttl inherit support")
> Reported-by: Phil Sutter 
> Signed-off-by: Hangbin Liu 
> ---
>  drivers/net/geneve.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 6625fab..09ab2fd 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1100,7 +1100,7 @@ static const struct nla_policy 
> geneve_policy[IFLA_GENEVE_MAX + 1] = {
>   [IFLA_GENEVE_UDP_CSUM]  = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
> - [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_U8 },
> + [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_FLAG },
>  };
>  
>  static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
> @@ -1582,7 +1582,7 @@ static size_t geneve_get_size(const struct net_device 
> *dev)
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX 
> */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX 
> */
> - nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */
> + nla_total_size(0) + /* IFLA_GENEVE_TTL_INHERIT */
>   0;
>  }
>  
> @@ -1636,7 +1636,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   goto nla_put_failure;
>  #endif
>  
> - if (nla_put_u8(skb, IFLA_GENEVE_TTL_INHERIT, ttl_inherit))
> + if (ttl_inherit && nla_put_flag(skb, IFLA_GENEVE_TTL_INHERIT))
>   goto nla_put_failure;
>  
>   return 0;

Is it desirable to switch to a flag? If I read geneve_changelink() and
geneve_nl2info() correctly, it allows you to set the ttl_inherit flag
for an existing device but doesn't allow you to clear it. With NLA_U8,
you could distinguish three cases: set the flag (non-zero value), clear
the flag (zero value) and preserve current state (attribute not
present).

The same problem exists for vxlan but vxlan code intentionally disallows
changing the flag value for an existing device (I'm not sure if it's
because it's really impossible or just due to limits of the interface).
Unfortunately it has been already released with NLA_FLAG in 4.18,
AFAICS, so we have to live with it. But it's not too late for geneve.

Michal Kubecek

[PATCH net-next 2/3] ibmvnic: Introduce driver limits for ring sizes

2018-09-28 Thread Thomas Falcon

Introduce driver-defined maximums for queue ring sizes. Devices
available for IBM vNIC today will likely not allow this amount,
but this should give us some leeway for future devices that may
support larger ring sizes.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.h 
b/drivers/net/ethernet/ibm/ibmvnic.h
index 2c1787109f1c..f9a12e5843c4 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -40,6 +40,7 @@
 /* when changing this, update IBMVNIC_IO_ENTITLEMENT_DEFAULT */
 #define IBMVNIC_BUFFS_PER_POOL 100
 #define IBMVNIC_MAX_QUEUES 16
+#define IBMVNIC_MAX_QUEUE_SZ   4096
 
 #define IBMVNIC_TSO_BUF_SZ 65536
 #define IBMVNIC_TSO_BUFS   64
-- 
2.12.3

[PATCH net-next 1/3] ibmvnic: Increase maximum queue size limit

2018-09-28 Thread Thomas Falcon

Increase queue size limit to 16. Devices available for IBM vNIC today
will not allow this amount, but this should give us some leeway for
future devices that may support more RX or TX queues.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.h 
b/drivers/net/ethernet/ibm/ibmvnic.h
index f06eec145ca6..2c1787109f1c 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -39,7 +39,7 @@
 #define IBMVNIC_RX_WEIGHT  16
 /* when changing this, update IBMVNIC_IO_ENTITLEMENT_DEFAULT */
 #define IBMVNIC_BUFFS_PER_POOL 100
-#define IBMVNIC_MAX_QUEUES 10
+#define IBMVNIC_MAX_QUEUES 16
 
 #define IBMVNIC_TSO_BUF_SZ 65536
 #define IBMVNIC_TSO_BUFS   64
-- 
2.12.3

[PATCH net-next 0/3] ibmvnic: Implement driver-defined queue limits

2018-09-28 Thread Thomas Falcon

In this patch series, update the ibmvnic driver to use driver-defined
queue limits instead of limits imposed by the Virtual I/O server
management partition. For some deviced, initial max queue size and
amount limits, despite their definition, can actually be exceeded if
the client driver requests it. With this in mind, define a private
ethtool flag that toggles the use of driver-defined limits. These
limits are currently more than what supported hardware will likely
allow, so the driver will attempt to get as close as possible to
the user request but may not fully succeed.

Thomas Falcon (3):
  ibmvnic: Increase maximum queue size limit
  ibmvnic: Introduce driver limits for ring sizes
  ibmvnic: Add ethtool private flag for driver-defined queue limits

 drivers/net/ethernet/ibm/ibmvnic.c | 129 +++--
 drivers/net/ethernet/ibm/ibmvnic.h |   9 ++-
 2 files changed, 102 insertions(+), 36 deletions(-)

-- 
2.12.3

[PATCH net-next 3/3] ibmvnic: Add ethtool private flag for driver-defined queue limits

2018-09-28 Thread Thomas Falcon

When choosing channel amounts and ring sizes, the maximums in the
ibmvnic driver are defined by the virtual i/o server management
partition. Even though they are defined as maximums, the client
driver may in fact successfully request resources that exceed
these limits, which are mostly dependent on a user's hardware

With this in mind, provide an ethtool flag that when enabled will
allow the user to request resources limited by driver-defined
maximums instead of limits defined by the management partition.
The driver will try to honor the user's request but may not allowed
by the management partition. In this case, the driver requests
as close as it can get to the desired amount until it succeeds.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 129 +++--
 drivers/net/ethernet/ibm/ibmvnic.h |   6 ++
 2 files changed, 100 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index a8369addfe68..ad898e8eaca1 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2364,8 +2364,13 @@ static void ibmvnic_get_ringparam(struct net_device 
*netdev,
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
 
-   ring->rx_max_pending = adapter->max_rx_add_entries_per_subcrq;
-   ring->tx_max_pending = adapter->max_tx_entries_per_subcrq;
+   if (adapter->priv_flags & IBMVNIC_USE_SERVER_MAXES) {
+   ring->rx_max_pending = adapter->max_rx_add_entries_per_subcrq;
+   ring->tx_max_pending = adapter->max_tx_entries_per_subcrq;
+   } else {
+   ring->rx_max_pending = IBMVNIC_MAX_QUEUE_SZ;
+   ring->tx_max_pending = IBMVNIC_MAX_QUEUE_SZ;
+   }
ring->rx_mini_max_pending = 0;
ring->rx_jumbo_max_pending = 0;
ring->rx_pending = adapter->req_rx_add_entries_per_subcrq;
@@ -2378,21 +2383,23 @@ static int ibmvnic_set_ringparam(struct net_device 
*netdev,
 struct ethtool_ringparam *ring)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+   int ret;
 
-   if (ring->rx_pending > adapter->max_rx_add_entries_per_subcrq  ||
-   ring->tx_pending > adapter->max_tx_entries_per_subcrq) {
-   netdev_err(netdev, "Invalid request.\n");
-   netdev_err(netdev, "Max tx buffers = %llu\n",
-  adapter->max_rx_add_entries_per_subcrq);
-   netdev_err(netdev, "Max rx buffers = %llu\n",
-  adapter->max_tx_entries_per_subcrq);
-   return -EINVAL;
-   }
-
+   ret = 0;
adapter->desired.rx_entries = ring->rx_pending;
adapter->desired.tx_entries = ring->tx_pending;
 
-   return wait_for_reset(adapter);
+   ret = wait_for_reset(adapter);
+
+   if (!ret &&
+   (adapter->req_rx_add_entries_per_subcrq != ring->rx_pending ||
+adapter->req_tx_entries_per_subcrq != ring->tx_pending))
+   netdev_info(netdev,
+   "Could not match full ringsize request. Requested: 
RX %d, TX %d; Allowed: RX %llu, TX %llu\n",
+   ring->rx_pending, ring->tx_pending,
+   adapter->req_rx_add_entries_per_subcrq,
+   adapter->req_tx_entries_per_subcrq);
+   return ret;
 }
 
 static void ibmvnic_get_channels(struct net_device *netdev,
@@ -2400,8 +2407,14 @@ static void ibmvnic_get_channels(struct net_device 
*netdev,
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
 
-   channels->max_rx = adapter->max_rx_queues;
-   channels->max_tx = adapter->max_tx_queues;
+   if (adapter->priv_flags & IBMVNIC_USE_SERVER_MAXES) {
+   channels->max_rx = adapter->max_rx_queues;
+   channels->max_tx = adapter->max_tx_queues;
+   } else {
+   channels->max_rx = IBMVNIC_MAX_QUEUES;
+   channels->max_tx = IBMVNIC_MAX_QUEUES;
+   }
+
channels->max_other = 0;
channels->max_combined = 0;
channels->rx_count = adapter->req_rx_queues;
@@ -2414,11 +2427,23 @@ static int ibmvnic_set_channels(struct net_device 
*netdev,
struct ethtool_channels *channels)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+   int ret;
 
+   ret = 0;
adapter->desired.rx_queues = channels->rx_count;
adapter->desired.tx_queues = channels->tx_count;
 
-   return wait_for_reset(adapter);
+   ret = wait_for_reset(adapter);
+
+   if (!ret &&
+   (adapter->req_rx_queues != channels->rx_count ||
+adapter->req_tx_queues != channels->tx_count))
+   netdev_info(netdev,
+   "Could not match full channels request. Requested: 
RX %d, TX %d; Allowed: RX %llu, TX %llu\n",
+   channels->rx_count,

Re: [PATCH net v2] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-09-28 Thread Vijay Khemka



>On 9/28/18, 11:16 AM, "justin.l...@dell.com"  wrote:

  >  The new command (NCSI_CMD_SEND_CMD) is added to allow user space 
application 
  >  to send NC-SI command to the network card.
  >  Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
response.
  > 
  >  The work flow is as below. 
  >  
  >  Request:
  >  User space application -> Netlink interface (msg)
  -> new Netlink handler - 
ncsi_send_cmd_nl()
  -> ncsi_xmit_cmd()
  >  Response:
  >  Response received - ncsi_rcv_rsp() -> internal response handler - 
ncsi_rsp_handler_xxx()
-> 
ncsi_rsp_handler_netlink()
-> 
ncsi_send_netlink_rsp ()
-> 
Netlink interface (msg)
-> 
user space application
  >  Command timeout - ncsi_request_timeout() -> ncsi_send_netlink_timeout ()

-> Netlink interface (msg with zero data length)

-> user space application
  >  Error:
  >  Error detected -> ncsi_send_netlink_err () -> Netlink interface (err msg)

   -> user space application


I will request to keep 2 patch, one for OEM handler and other one for 
netlink user space handling.

Re: [PATCH] net/ncsi: Add NCSI OEM command for FB Tiogapass

2018-09-28 Thread Vijay Khemka

On 9/26/18, 8:44 PM, "Samuel Mendoza-Jonas"  wrote:

>Hi Vijay,

>Having had a chance to take a closer look, there is probably room for
>   both this patchset and Justin's potential changes to coexist; while
>  Justin's is a more general solution for sending arbitrary commands, the
>approach of this patch is also useful for handling commands we want
>included in the configure process (such as get-mac-address).

Hi Sam,
I have created a more generic approach based patch, will send you after clean 
up. I agree we need both Justin patch as well as mine to handle OEM specific 
command. I am creating 2 patches one for generic OEM command and response 
handling and another is OEM vendor specific command handling. Please see my 
responses to your comments below.

Some comments below:

> ---
>  net/ncsi/Kconfig   |  3 ++
>  net/ncsi/internal.h| 11 +--
>  net/ncsi/ncsi-cmd.c| 24 +--
>  net/ncsi/ncsi-manage.c | 68 ++
>  net/ncsi/ncsi-pkt.h| 16 ++
>  net/ncsi/ncsi-rsp.c| 33 +++-
>  6 files changed, 149 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
> index 08a8a6031fd7..b8bf89fea7c8 100644
> --- a/net/ncsi/Kconfig
> +++ b/net/ncsi/Kconfig
> @@ -10,3 +10,6 @@ config NET_NCSI
> support. Enable this only if your system connects to a network
> device via NCSI and the ethernet driver you're using supports
> the protocol explicitly.
> +config NCSI_OEM_CMD_GET_MAC
> + bool "Get NCSI OEM MAC Address"
> + depends on NET_NCSI

For the moment this isn't too bad but I wonder if in the future it would
be more flexible to have something like NCSI_OEM_CMD_MELLANOX etc, so we
could selectively enable a class of OEM commands based on vendor rather
than per-command.
Certainly, we can have like that and will be an addition to above config. 
Above config is to enable retrieving and setting mac address from device 
during channel configuration. And then we can have vendor selection like 
MELLANOX, Broadcom etc. But I will rather check GV ID under this config 
and see if there is available function for specific vendor. This can be 
revised
and made more generic based on vendor.

> diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> index 8055e3965cef..da17958e6a4b 100644
> --- a/net/ncsi/internal.h
> +++ b/net/ncsi/internal.h
> @@ -68,6 +68,10 @@ enum {
>   NCSI_MODE_MAX
>  };
>  
> +#define NCSI_OEM_MFR_MLX_ID 0x8119
> +#define NCSI_OEM_MLX_CMD_GET_MAC0x1b00
> +#define NCSI_OEM_MLX_CMD_SET_AFFINITY   0x010700

I gather this is part of the OEM command but it would be good to 
describe
what these bits mean. Is this command documented anywhere by Mellanox?
I will add more description here, please see in next patch. Yes Mellanox 
specification describes these commands.

> +
>  struct ncsi_channel_version {
>   u32 version;/* Supported BCD encoded NCSI version */
>   u32 alpha2; /* Supported BCD encoded NCSI version */
> @@ -236,6 +240,7 @@ enum {
>   ncsi_dev_state_probe_dp,
>   ncsi_dev_state_config_sp= 0x0301,
>   ncsi_dev_state_config_cis,
> + ncsi_dev_state_config_oem_gma,
>   ncsi_dev_state_config_clear_vids,
>   ncsi_dev_state_config_svf,
>   ncsi_dev_state_config_ev,
> @@ -301,9 +306,9 @@ struct ncsi_cmd_arg {
>   unsigned short   payload; /* Command packet payload length */
>   unsigned int req_flags;   /* NCSI request properties   */
>   union {
> - unsigned char  bytes[16]; /* Command packet specific data  */
> - unsigned short words[8];
> - unsigned int   dwords[4];
> + unsigned char  bytes[64]; /* Command packet specific data  */
> + unsigned short words[32];
> + unsigned int   dwords[16];
>   };
>  };
>  
> diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
> index 7567ca63aae2..3205e22c1734 100644
> --- a/net/ncsi/ncsi-cmd.c
> +++ b/net/ncsi/ncsi-cmd.c
> @@ -211,6 +211,25 @@ static int ncsi_cmd_handler_snfc(struct sk_buff *skb,
>   return 0;
>  }
>  
> +static int ncsi_cmd_handler_oem(struct sk_buff *skb,
> + struct ncsi_cmd_arg *nca)
> +{
> + struct ncsi_cmd_oem_pkt *cmd;
> + unsigned int len;
> +
> + len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
> + if (nca->payload < 26)
> + len += 26;

This will have already happened in ncsi_alloc_command(), is this check
needed?
Yes it is needed to find length for skbuff data to initialize. 
ncsi_alloc_command() does the same and allocate skbuff.

Re: [PATCH net-next] tcp/fq: move back to CLOCK_MONOTONIC

2018-09-28 Thread Eric Dumazet




On 09/28/2018 02:27 PM, Leonard Crestez wrote:
> On Fri, 2018-09-28 at 10:28 -0700, Eric Dumazet wrote:
>> In the recent TCP/EDT patch series, I switched TCP and sch_fq
>> clocks from MONOTONIC to TAI, in order to meet the choice done
>> earlier for sch_etf packet scheduler.
>>
>> But sure enough, this broke some setups were the TAI clock
>> jumps forward (by almost 50 year...), as reported
>> by Leonard Crestez.
>>
>> If we want to converge later, we'll probably need to add
>> an skb field to differentiate the clock bases, or a socket option.
>>
>> In the meantime, an UDP application will need to use CLOCK_MONOTONIC
>> base for its SCM_TXTIME timestamps if using fq packet scheduler.
>>
>> Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
>> Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
>> Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
>> Signed-off-by: Eric Dumazet 
>> Reported-by: Leonard Crestez 
> 
> Tested-by: Leonard Crestez 
> 
> Fixes the problem reported earlier when applied on top of next-20180928
> 

Thanks again ;)

Re: [net] r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO

2018-09-28 Thread Heiner Kallweit

On 28.09.2018 23:47, Maciej S. Szmigiero wrote:
> On 28.09.2018 22:19, Heiner Kallweit wrote:
>> Some of the chip-specific hw_start functions set bit TXCFG_AUTO_FIFO
>> in register TxConfig. The original patch changed the order of some
>> calls resulting in these changes being overwritten by
>> rtl_set_tx_config_registers() in rtl_hw_start(). This eventually
>> resulted in network stalls especially under high load.
>>
>> Analyzing the chip-specific hw_start functions all chip version from
>> 34, with the exception of version 39, need this bit set.
>> This patch moves setting this bit to rtl_set_tx_config_registers().
>>
>> Fixes: 4fd48c4ac0a0 ("r8169: move common initializations to tp->hw_start")
>> Reported-by: Ortwin Glück 
>> Reported-by: David Arendt 
>> Tested-by: Tony Atkinson 
>> Tested-by: David Arendt 
>> Tested-by: Ortwin Glück 
>> Signed-off-by: Heiner Kallweit 
> 
> Please add:
> Root-caused-by: Maciej S. Szmigiero 
> 
Sure, forgot you in the list.

Heiner

> Thanks,
> Maciej
>

[PATCH net 1/3] tun: remove unused parameters

2018-09-28 Thread Eric Dumazet

tun_napi_disable() and tun_napi_del() do not need
a pointer to the tun_struct

Signed-off-by: Eric Dumazet 
---
 drivers/net/tun.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
e2648b5a3861e51dc6c40d19e1198a5f3f7ca7af..71d10fb59849bff091ee64b6f7e9cc8ae2e0cf6f
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -324,13 +324,13 @@ static void tun_napi_init(struct tun_struct *tun, struct 
tun_file *tfile,
}
 }
 
-static void tun_napi_disable(struct tun_struct *tun, struct tun_file *tfile)
+static void tun_napi_disable(struct tun_file *tfile)
 {
if (tfile->napi_enabled)
napi_disable(>napi);
 }
 
-static void tun_napi_del(struct tun_struct *tun, struct tun_file *tfile)
+static void tun_napi_del(struct tun_file *tfile)
 {
if (tfile->napi_enabled)
netif_napi_del(>napi);
@@ -690,8 +690,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun = rtnl_dereference(tfile->tun);
 
if (tun && clean) {
-   tun_napi_disable(tun, tfile);
-   tun_napi_del(tun, tfile);
+   tun_napi_disable(tfile);
+   tun_napi_del(tfile);
}
 
if (tun && !tfile->detached) {
@@ -758,7 +758,7 @@ static void tun_detach_all(struct net_device *dev)
for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
BUG_ON(!tfile);
-   tun_napi_disable(tun, tfile);
+   tun_napi_disable(tfile);
tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
@@ -774,7 +774,7 @@ static void tun_detach_all(struct net_device *dev)
synchronize_net();
for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
-   tun_napi_del(tun, tfile);
+   tun_napi_del(tfile);
/* Drop read queue */
tun_queue_purge(tfile);
xdp_rxq_info_unreg(>xdp_rxq);
-- 
2.19.0.605.g01d371f741-goog

[PATCH net 3/3] tun: napi flags belong to tfile

2018-09-28 Thread Eric Dumazet

Since tun->flags might be shared by multiple tfile structures,
it is better to make sure tun_get_user() is using the flags
for the current tfile.

Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint
of what could happen, but we need something stronger to please
syzbot.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] PREEMPT SMP KASAN
CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427
Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 
4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 
00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4
RSP: 0018:8801c400f410 EFLAGS: 00010202
RAX: dc00 RBX:  RCX: 8618d325
RDX: 001a RSI: 86189f97 RDI: 00d0
RBP: 8801c400f608 R08: 8801c8fb4300 R09: 
R10: ed0038801ed7 R11: 0003 R12: 8801d327d358
R13:  R14: 8801c16dd8c0 R15: 0004
FS:  7fe003615700() GS:8801dac0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fe1f3c43db8 CR3: 0001bebb2000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715
 tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922
 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967
 call_write_iter include/linux/fs.h:1808 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fe003614c78 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 0003 RCX: 00457579
RDX: 0012 RSI: 2000 RDI: 000a
RBP: 0072c040 R08:  R09: 
R10:  R11: 0246 R12: 7fe0036156d4
R13: 004c5574 R14: 004d8e98 R15: 
Modules linked in:

RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427
Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 
4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 
00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4
RSP: 0018:8801c400f410 EFLAGS: 00010202
RAX: dc00 RBX:  RCX: 8618d325
RDX: 001a RSI: 86189f97 RDI: 00d0
RBP: 8801c400f608 R08: 8801c8fb4300 R09: 
R10: ed0038801ed7 R11: 0003 R12: 8801d327d358
R13:  R14: 8801c16dd8c0 R15: 0004
FS:  7fe003615700() GS:8801dac0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fe1f3c43db8 CR3: 0001bebb2000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 drivers/net/tun.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
729686babbf3b7d2f76ce64a5ebf7676e45eb681..50e9cc19023a701bad861ac117665a024ba776b1
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -181,6 +181,7 @@ struct tun_file {
};
struct napi_struct napi;
bool napi_enabled;
+   bool napi_frags_enabled;
struct mutex napi_mutex;/* Protects access to the above napi */
struct list_head next;
struct tun_struct *detached;
@@ -313,9 +314,10 @@ static int tun_napi_poll(struct napi_struct *napi, int 
budget)
 }
 
 static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile,
- bool napi_en)
+ bool napi_en, bool napi_frags)
 {
tfile->napi_enabled = napi_en;
+   tfile->napi_frags_enabled = napi_en && napi_frags;
if (napi_en) {
netif_napi_add(tun->dev, >napi, tun_napi_poll,

[PATCH net v2] r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO

2018-09-28 Thread Heiner Kallweit

Some of the chip-specific hw_start functions set bit TXCFG_AUTO_FIFO
in register TxConfig. The original patch changed the order of some
calls resulting in these changes being overwritten by
rtl_set_tx_config_registers() in rtl_hw_start(). This eventually
resulted in network stalls especially under high load.

Analyzing the chip-specific hw_start functions all chip version from
34, with the exception of version 39, need this bit set.
This patch moves setting this bit to rtl_set_tx_config_registers().

Fixes: 4fd48c4ac0a0 ("r8169: move common initializations to tp->hw_start")
Reported-by: Ortwin Glück 
Reported-by: David Arendt 
Root-caused-by: Maciej S. Szmigiero 
Tested-by: Tony Atkinson 
Tested-by: David Arendt 
Tested-by: Ortwin Glück 
Signed-off-by: Heiner Kallweit 
---
Hint for applying this change to stable:
It may collide with 05212ba8132b ("r8169: set RxConfig after tx/rx is
enabled for RTL8169sb/8110sb devices") which renamed
rtl_set_rx_tx_config_registers() to rtl_set_tx_config_registers().
---
v2:
- added Maciej as root-caused-by
---
 drivers/net/ethernet/realtek/r8169.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index f882be49f..ae8abe900 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4514,9 +4514,14 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 
 static void rtl_set_tx_config_registers(struct rtl8169_private *tp)
 {
-   /* Set DMA burst size and Interframe Gap Time */
-   RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
-   (InterFrameGap << TxInterFrameGapShift));
+   u32 val = TX_DMA_BURST << TxDMAShift |
+ InterFrameGap << TxInterFrameGapShift;
+
+   if (tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
+   tp->mac_version != RTL_GIGA_MAC_VER_39)
+   val |= TXCFG_AUTO_FIFO;
+
+   RTL_W32(tp, TxConfig, val);
 }
 
 static void rtl_set_rx_max_size(struct rtl8169_private *tp)
@@ -5011,7 +5016,6 @@ static void rtl_hw_start_8168e_2(struct rtl8169_private 
*tp)
 
rtl_disable_clock_request(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
 
/* Adjust EEE LED frequency */
@@ -5045,7 +5049,6 @@ static void rtl_hw_start_8168f(struct rtl8169_private *tp)
 
rtl_disable_clock_request(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) | PFM_EN);
RTL_W32(tp, MISC, RTL_R32(tp, MISC) | PWM_EN);
@@ -5090,8 +5093,6 @@ static void rtl_hw_start_8411(struct rtl8169_private *tp)
 
 static void rtl_hw_start_8168g(struct rtl8169_private *tp)
 {
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x38, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x48, ERIAR_EXGMAC);
@@ -5189,8 +5190,6 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private 
*tp)
rtl_hw_aspm_clkreq_enable(tp, false);
rtl_ephy_init(tp, e_info_8168h_1, ARRAY_SIZE(e_info_8168h_1));
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x00080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x38, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x48, ERIAR_EXGMAC);
@@ -5273,8 +5272,6 @@ static void rtl_hw_start_8168ep(struct rtl8169_private 
*tp)
 {
rtl8168ep_stop_cmac(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x00080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x2f, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x5f, ERIAR_EXGMAC);
@@ -5596,7 +5593,6 @@ static void rtl_hw_start_8402(struct rtl8169_private *tp)
/* Force LAN exit from ASPM if Rx/Tx are not idle */
RTL_W32(tp, FuncEvent, RTL_R32(tp, FuncEvent) | 0x002800);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
 
rtl_ephy_init(tp, e_info_8402, ARRAY_SIZE(e_info_8402));
-- 
2.19.0

[PATCH net 0/3] tun: address two syzbot reports

2018-09-28 Thread Eric Dumazet

Small changes addressing races discovered by syzbot.

First patch is a cleanup.
Second patch moves a mutex init sooner.
Third patch makes sure each tfile gets its own napi enable flags.

Eric Dumazet (3):
  tun: remove unused parameters
  tun: initialize napi_mutex unconditionally
  tun: napi flags belong to tfile

 drivers/net/tun.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

-- 
2.19.0.605.g01d371f741-goog

[PATCH net 2/3] tun: initialize napi_mutex unconditionally

2018-09-28 Thread Eric Dumazet

This is the first part to fix following syzbot report :

console output: https://syzkaller.appspot.com/x/log.txt?x=145378e640
kernel config:  https://syzkaller.appspot.com/x/.config?x=443816db871edd66
dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80

Following patch is fixing the race condition, but it seems safer
to initialize this mutex at tfile creation anyway.

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+e662df0ac1d753b57...@syzkaller.appspotmail.com
---
 drivers/net/tun.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
71d10fb59849bff091ee64b6f7e9cc8ae2e0cf6f..729686babbf3b7d2f76ce64a5ebf7676e45eb681
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -320,7 +320,6 @@ static void tun_napi_init(struct tun_struct *tun, struct 
tun_file *tfile,
netif_napi_add(tun->dev, >napi, tun_napi_poll,
   NAPI_POLL_WEIGHT);
napi_enable(>napi);
-   mutex_init(>napi_mutex);
}
 }
 
@@ -3199,6 +3198,7 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
return -ENOMEM;
}
 
+   mutex_init(>napi_mutex);
RCU_INIT_POINTER(tfile->tun, NULL);
tfile->flags = 0;
tfile->ifindex = 0;
-- 
2.19.0.605.g01d371f741-goog

Re: [net] r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO

2018-09-28 Thread Maciej S. Szmigiero

On 28.09.2018 22:19, Heiner Kallweit wrote:
> Some of the chip-specific hw_start functions set bit TXCFG_AUTO_FIFO
> in register TxConfig. The original patch changed the order of some
> calls resulting in these changes being overwritten by
> rtl_set_tx_config_registers() in rtl_hw_start(). This eventually
> resulted in network stalls especially under high load.
> 
> Analyzing the chip-specific hw_start functions all chip version from
> 34, with the exception of version 39, need this bit set.
> This patch moves setting this bit to rtl_set_tx_config_registers().
> 
> Fixes: 4fd48c4ac0a0 ("r8169: move common initializations to tp->hw_start")
> Reported-by: Ortwin Glück 
> Reported-by: David Arendt 
> Tested-by: Tony Atkinson 
> Tested-by: David Arendt 
> Tested-by: Ortwin Glück 
> Signed-off-by: Heiner Kallweit 

Please add:
Root-caused-by: Maciej S. Szmigiero 

Thanks,
Maciej

Re: [PATCH net-next] tcp/fq: move back to CLOCK_MONOTONIC

2018-09-28 Thread Leonard Crestez

On Fri, 2018-09-28 at 10:28 -0700, Eric Dumazet wrote:
> In the recent TCP/EDT patch series, I switched TCP and sch_fq
> clocks from MONOTONIC to TAI, in order to meet the choice done
> earlier for sch_etf packet scheduler.
> 
> But sure enough, this broke some setups were the TAI clock
> jumps forward (by almost 50 year...), as reported
> by Leonard Crestez.
> 
> If we want to converge later, we'll probably need to add
> an skb field to differentiate the clock bases, or a socket option.
> 
> In the meantime, an UDP application will need to use CLOCK_MONOTONIC
> base for its SCM_TXTIME timestamps if using fq packet scheduler.
> 
> Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
> Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
> Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
> Signed-off-by: Eric Dumazet 
> Reported-by: Leonard Crestez 

Tested-by: Leonard Crestez 

Fixes the problem reported earlier when applied on top of next-20180928

bond: take rcu lock in netpoll_send_skb_on_dev

2018-09-28 Thread Dave Jones

The bonding driver lacks the rcu lock when it calls down into
netdev_lower_get_next_private_rcu from bond_poll_controller, which
results in a trace like:

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

We're also doing rcu dereferences a layer up in netpoll_send_skb_on_dev
before we call down into netpoll_poll_dev, so just take the lock there.

Suggested-by: Cong Wang 
Signed-off-by: Dave Jones 

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3219a2932463..692367d7c280 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
/* It is up to the caller to keep npinfo alive. */
struct netpoll_info *npinfo;
 
+   rcu_read_lock_bh();
lockdep_assert_irqs_disabled();
 
npinfo = rcu_dereference_bh(np->dev->npinfo);
@@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
skb_queue_tail(>txq, skb);
schedule_delayed_work(>tx_work,0);
}
+   rcu_read_unlock_bh();
 }
 EXPORT_SYMBOL(netpoll_send_skb_on_dev);

[PATCH net] r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO

2018-09-28 Thread Heiner Kallweit

Some of the chip-specific hw_start functions set bit TXCFG_AUTO_FIFO
in register TxConfig. The original patch changed the order of some
calls resulting in these changes being overwritten by
rtl_set_tx_config_registers() in rtl_hw_start(). This eventually
resulted in network stalls especially under high load.

Analyzing the chip-specific hw_start functions all chip version from
34, with the exception of version 39, need this bit set.
This patch moves setting this bit to rtl_set_tx_config_registers().

Fixes: 4fd48c4ac0a0 ("r8169: move common initializations to tp->hw_start")
Reported-by: Ortwin Glück 
Reported-by: David Arendt 
Tested-by: Tony Atkinson 
Tested-by: David Arendt 
Tested-by: Ortwin Glück 
Signed-off-by: Heiner Kallweit 
---
Hint for applying this change to stable:
It may collide with 05212ba8132b ("r8169: set RxConfig after tx/rx is
enabled for RTL8169sb/8110sb devices") which renamed
rtl_set_rx_tx_config_registers() to rtl_set_tx_config_registers().
---
 drivers/net/ethernet/realtek/r8169.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index f882be49f..ae8abe900 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4514,9 +4514,14 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 
 static void rtl_set_tx_config_registers(struct rtl8169_private *tp)
 {
-   /* Set DMA burst size and Interframe Gap Time */
-   RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
-   (InterFrameGap << TxInterFrameGapShift));
+   u32 val = TX_DMA_BURST << TxDMAShift |
+ InterFrameGap << TxInterFrameGapShift;
+
+   if (tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
+   tp->mac_version != RTL_GIGA_MAC_VER_39)
+   val |= TXCFG_AUTO_FIFO;
+
+   RTL_W32(tp, TxConfig, val);
 }
 
 static void rtl_set_rx_max_size(struct rtl8169_private *tp)
@@ -5011,7 +5016,6 @@ static void rtl_hw_start_8168e_2(struct rtl8169_private 
*tp)
 
rtl_disable_clock_request(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
 
/* Adjust EEE LED frequency */
@@ -5045,7 +5049,6 @@ static void rtl_hw_start_8168f(struct rtl8169_private *tp)
 
rtl_disable_clock_request(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) | PFM_EN);
RTL_W32(tp, MISC, RTL_R32(tp, MISC) | PWM_EN);
@@ -5090,8 +5093,6 @@ static void rtl_hw_start_8411(struct rtl8169_private *tp)
 
 static void rtl_hw_start_8168g(struct rtl8169_private *tp)
 {
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x38, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x48, ERIAR_EXGMAC);
@@ -5189,8 +5190,6 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private 
*tp)
rtl_hw_aspm_clkreq_enable(tp, false);
rtl_ephy_init(tp, e_info_8168h_1, ARRAY_SIZE(e_info_8168h_1));
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x00080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x38, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x48, ERIAR_EXGMAC);
@@ -5273,8 +5272,6 @@ static void rtl_hw_start_8168ep(struct rtl8169_private 
*tp)
 {
rtl8168ep_stop_cmac(tp);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
-
rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x00080002, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x2f, ERIAR_EXGMAC);
rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x5f, ERIAR_EXGMAC);
@@ -5596,7 +5593,6 @@ static void rtl_hw_start_8402(struct rtl8169_private *tp)
/* Force LAN exit from ASPM if Rx/Tx are not idle */
RTL_W32(tp, FuncEvent, RTL_R32(tp, FuncEvent) | 0x002800);
 
-   RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
 
rtl_ephy_init(tp, e_info_8402, ARRAY_SIZE(e_info_8402));
-- 
2.19.0

[PATCH net-next v2] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-09-28 Thread Yuchung Cheng

Previously TCP initial receive buffer is ~87KB by default and
the initial receive window is ~29KB (20 MSS). This patch changes
the two numbers to 128KB and ~64KB (rounding down to the multiples
of MSS) respectively. The patch also simplifies the calculations s.t.
the two numbers are directly controlled by sysctl tcp_rmem[1]:

  1) Initial receiver buffer budget (sk_rcvbuf): while this should
 be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
 always override and set a larger size when a new connection
 establishes.

  2) Initial receive window in SYN: previously it is set to 20
 packets if MSS <= 1460. The number 20 was based on the initial
 congestion window of 10: the receiver needs twice amount to
 avoid being limited by the receive window upon out-of-order
 delivery in the first window burst. But since this only
 applies if the receiving MSS <= 1460, connection using large MTU
 (e.g. to utilize receiver zero-copy) may be limited by the
 receive window.

This patch also lowers the initial bytes expected to receive in
the receiver buffer autotuning algorithm - otherwise the receiver
may take two to three rounds to increase the buffer to the
appropriate level (2x sender congestion window).

With this patch TCP memory configuration is more straight-forward and
more properly sized to modern high-speed networks by default. Several
popular stacks have been announcing 64KB rwin in SYNs as well.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp.c|  4 ++--
 net/ipv4/tcp_input.c  | 30 +-
 net/ipv4/tcp_output.c | 25 -
 3 files changed, 11 insertions(+), 48 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 69c236943f56..dcf51fbf5ec7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3896,8 +3896,8 @@ void __init tcp_init(void)
init_net.ipv4.sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
 
init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
-   init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
-   init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
+   init_net.ipv4.sysctl_tcp_rmem[1] = 131072;
+   init_net.ipv4.sysctl_tcp_rmem[2] = max(131072, max_rshare);
 
pr_info("Hash tables configured (established %u bind %u)\n",
tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d703a0b3b6a2..4f714a031618 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -426,27 +426,9 @@ static void tcp_grow_window(struct sock *sk, const struct 
sk_buff *skb)
}
 }
 
-/* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
-{
-   u32 mss = tcp_sk(sk)->advmss;
-   int rcvmem;
-
-   rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-tcp_default_init_rwnd(mss);
-
-   /* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
-* Allow enough cushion so that sender is not limited by our window
-*/
-   if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)
-   rcvmem <<= 2;
-
-   if (sk->sk_rcvbuf < rcvmem)
-   sk->sk_rcvbuf = min(rcvmem, 
sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
-}
-
-/* 4. Try to fixup all. It is made immediately after connection enters
- *established state.
+/* 3. Try to fixup all. It is made immediately after connection enters
+ *established state. Budget the space to the expected initial window
+ *of burst to auto-tune the receive buffer right after the first round.
  */
 void tcp_init_buffer_space(struct sock *sk)
 {
@@ -454,12 +436,10 @@ void tcp_init_buffer_space(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
int maxwin;
 
-   if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-   tcp_fixup_rcvbuf(sk);
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = tp->rcv_wnd;
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
@@ -485,7 +465,7 @@ void tcp_init_buffer_space(struct sock *sk)
tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
-/* 5. Recalculate window clamp after socket hit its memory bounds. */
+/* 4. Recalculate window clamp after socket hit its memory bounds. */
 static void tcp_clamp_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fe7855b090e4..059b67af28b1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -195,21 +195,6 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts,
inet_csk_clear_xmit_timer(sk,

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones

On Fri, Sep 28, 2018 at 12:03:22PM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 12:02 PM Cong Wang  wrote:
 > >
 > > On Fri, Sep 28, 2018 at 11:26 AM Dave Jones  
 > > wrote:
 > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
 > > > index 3219a2932463..4f9494381635 100644
 > > > --- a/net/core/netpoll.c
 > > > +++ b/net/core/netpoll.c
 > > > @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, 
 > > > struct sk_buff *skb,
 > > > /* It is up to the caller to keep npinfo alive. */
 > > > struct netpoll_info *npinfo;
 > > >
 > > > +   rcu_read_lock();
 > > > lockdep_assert_irqs_disabled();
 > > >
 > > > npinfo = rcu_dereference_bh(np->dev->npinfo);
 > >
 > > I think you probably need rcu_read_lock_bh() to satisfy
 > > rcu_deference_bh()...
 > 
 > But irq is disabled here, so not sure if rcu_read_lock_bh()
 > could cause trouble... Interesting...

I was wondering for a moment why I never got a warning, then I
remembered I disabled lockdep for that machine because nfs spews stuff.

I'll doublecheck, and post v4. lol, this looked like a 2 minute fix at first.

Dave

[net-next:master 729/738] drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:496:31: error: 'HNAE3_REVISION_ID_21' undeclared; did you mean 'FADT2_REVISION_ID'?

2018-09-28 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   5362700c942b2cc4bab328361545a6d6fe649534
commit: 4dc13b9668d8ba7a5d1a26b88fa30baa8a214dcc [729/738] net: hns3: Add 
serdes parallel inner loopback support
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
git checkout 4dc13b9668d8ba7a5d1a26b88fa30baa8a214dcc
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c: In function 
'hclge_get_sset_count':
>> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:496:31: error: 
>> 'HNAE3_REVISION_ID_21' undeclared (first use in this function); did you mean 
>> 'FADT2_REVISION_ID'?
  if (hdev->pdev->revision >= HNAE3_REVISION_ID_21 ||
  ^~~~
  FADT2_REVISION_ID
   drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:496:31: note: each 
undeclared identifier is reported only once for each function it appears in

vim +496 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c

   476  
   477  static int hclge_get_sset_count(struct hnae3_handle *handle, int 
stringset)
   478  {
   479  #define HCLGE_LOOPBACK_TEST_FLAGS (HNAE3_SUPPORT_APP_LOOPBACK |\
   480  HNAE3_SUPPORT_PHY_LOOPBACK |\
   481  HNAE3_SUPPORT_SERDES_SERIAL_LOOPBACK |\
   482  HNAE3_SUPPORT_SERDES_PARALLEL_LOOPBACK)
   483  
   484  struct hclge_vport *vport = hclge_get_vport(handle);
   485  struct hclge_dev *hdev = vport->back;
   486  int count = 0;
   487  
   488  /* Loopback test support rules:
   489   * mac: only GE mode support
   490   * serdes: all mac mode will support include GE/XGE/LGE/CGE
   491   * phy: only support when phy device exist on board
   492   */
   493  if (stringset == ETH_SS_TEST) {
   494  /* clear loopback bit flags at first */
   495  handle->flags = (handle->flags & 
(~HCLGE_LOOPBACK_TEST_FLAGS));
 > 496  if (hdev->pdev->revision >= HNAE3_REVISION_ID_21 ||
   497  hdev->hw.mac.speed == HCLGE_MAC_SPEED_10M ||
   498  hdev->hw.mac.speed == HCLGE_MAC_SPEED_100M ||
   499  hdev->hw.mac.speed == HCLGE_MAC_SPEED_1G) {
   500  count += 1;
   501  handle->flags |= HNAE3_SUPPORT_APP_LOOPBACK;
   502  }
   503  
   504  count += 2;
   505  handle->flags |= HNAE3_SUPPORT_SERDES_SERIAL_LOOPBACK;
   506  handle->flags |= HNAE3_SUPPORT_SERDES_PARALLEL_LOOPBACK;
   507  } else if (stringset == ETH_SS_STATS) {
   508  count = ARRAY_SIZE(g_mac_stats_string) +
   509  hclge_tqps_get_sset_count(handle, stringset);
   510  }
   511  
   512  return count;
   513  }
   514  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH net v2 0/2] net: phy: fix WoL handling when suspending the PHY

2018-09-28 Thread Heiner Kallweit

On 27.09.2018 05:04, David Miller wrote:
> From: Heiner Kallweit 
> Date: Mon, 24 Sep 2018 21:58:04 +0200
> 
>> phy_suspend doesn't always recognize that WoL is enabled and therefore
>> suspends the PHY when it should not. First idea to address the issue
>> was to reuse checks used in mdio_bus_phy_may_suspend which check
>> whether relevant devices are wakeup-enabled.
>> Florian raised some concerns because drivers may enable wakeup even if
>> WoL isn't enabled (e.g. certain USB network drivers).
>>
>> The new approach focuses on reducing the risk to break existing stuff.
>> We add a flag wol_enabled to struct net_device which is set in
>> ethtool_set_wol(). Then this flag is checked in phy_suspend().
>> This doesn't cover 100% of the cases yet (e.g. if WoL is enabled w/o
>> explicit configuration), but it covers the most relevant cases with
>> very little risk of regressions.
>>
>> v2:
>> - Fix a typo
> 
> Series applied, thanks.
> 
> Please deal with the extra 4 byte size of net_device in net-next.
> 
Sure. Next week I'm on travel, will do it the week after.

> Thanks.
>

[PATCH net-next] rtnetlink: fix rtnl_fdb_dump() for shorter family headers

2018-09-28 Thread Mauricio Faria de Oliveira

Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
which is not always true.  For example, 'struct ndmsg' is used by iproute2
as well (in the 'ip neigh' command).

The problem is, the function bails out early if nlmsg_parse() fails, which
does occur for iproute2 usage of 'struct ndmsg' because the payload length
is shorter than the family header alone (as 'struct ifinfomsg' is assumed).

This breaks backward compatibility with userspace (different response) and
is a regression due to commit 0ff50e83b512 ("net: rtnetlink: bail out from 
 rtnl_fdb_dump() on parse error").

Some examples with iproute2 and netlink library for go [1]:

 1) $ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent

  This one works, as it uses 'struct ifinfomsg'.

  fdb_show() @ iproute2/bridge/fdb.c
"""
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
...
if (rtnl_dump_request(, RTM_GETNEIGH, [...]
"""

 2) $ ip --family bridge neigh
RTNETLINK answers: Invalid argument
Dump terminated

  This one fails, as it uses 'struct ndmsg'.

  do_show_or_flush() @ iproute2/ip/ipneigh.c
"""
.n.nlmsg_type = RTM_GETNEIGH,
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
"""

 3) $ ./neighlist
< no output >

  This one fails, as it uses 'struct ndmsg'-based.

  neighList() @ netlink/neigh_linux.go
"""
req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
msg := Ndmsg{
"""

The actual breakage was introduced by commit 0ff50e83b512 ("net: rtnetlink:
bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
if the payload length (with the _actual_ family header) is less than the
family header length alone (which is assumed, in parameter 'hdrlen').
This is true in the examples above with struct ndmsg, with size and payload
length shorter than struct ifinfomsg.

However, that commit just intends to fix something under the assumption the
family header is indeed an 'struct ifinfomsg' - by preventing access to the
payload as such (via 'ifm' pointer) if the payload length is not sufficient
to actually contain it.

The assumption was introduced by commit 5e6d24358799 ("bridge: netlink dump
interface at par with brctl"), to support iproute2's 'bridge fdb' command
(not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.

So, in order to unbreak shorter family headers as 'struct ndmsg' and still
prevent access to invalid payload data, let's revert that former fix, and
move ifinfomsg payload access (via ifm) into nlmsg_parse() successful case
(i.e., the payload length is sufficient to be accessed as struct ifinfomsg)
which also works/returns 0 even in case there are no attributes in payload.

Same examples with this patch applied (or revert/before the original fix):

$ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent

$ ip --family bridge neigh
dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT

$ ./neighlist
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, 
LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, 
LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 
0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}

Tested on mainline (ad0371482b1e) and net-next (a804e5e21875) on Sep. 28.

References:

[1] netlink library for go (test-case)
https://github.com/vishvananda/netlink

$ cat ~/go/src/neighlist/main.go
package main
import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
func main() {
neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
}

$ export GOPATH=~/go
$ go get github.com/vishvananda/netlink
$ go build neighlist
$ ~/go/src/neighlist/neighlist

Fixes: 0ff50e83b512 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse 
error")
Fixes: 5e6d24358799 ("bridge: netlink dump interface at par with brctl")
Reported-by: Aidan Obley 
Signed-off-by: Mauricio Faria de Oliveira 
---
P.S.: this may be 'net', but labeling as 'net-next' for possible relation to 
recent thread
[PATCH RFC net-next 0/5] rtnetlink: Add support for rigid checking of data in 
dump request

 net/core/rtnetlink.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git

[PATCH net] rtnetlink: Fail dump if target netnsid is invalid

2018-09-28 Thread David Ahern

From: David Ahern 

Link dumps can return results from a target namespace. If the namespace id
is invalid, then the dump request should fail if get_target_net fails
rather than continuing with a dump of the current namespace.

Fixes: 79e1ad148c844 ("rtnetlink: use netnsid to query interface")
Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 63ce2283a456..7f37fe9c65a5 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1898,10 +1898,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
tgt_net = get_target_net(skb->sk, netnsid);
-   if (IS_ERR(tgt_net)) {
-   tgt_net = net;
-   netnsid = -1;
-   }
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
}
 
if (tb[IFLA_EXT_MASK])
-- 
2.11.0

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 12:02 PM Cong Wang  wrote:
>
> On Fri, Sep 28, 2018 at 11:26 AM Dave Jones  wrote:
> > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> > index 3219a2932463..4f9494381635 100644
> > --- a/net/core/netpoll.c
> > +++ b/net/core/netpoll.c
> > @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
> > sk_buff *skb,
> > /* It is up to the caller to keep npinfo alive. */
> > struct netpoll_info *npinfo;
> >
> > +   rcu_read_lock();
> > lockdep_assert_irqs_disabled();
> >
> > npinfo = rcu_dereference_bh(np->dev->npinfo);
>
> I think you probably need rcu_read_lock_bh() to satisfy
> rcu_deference_bh()...

But irq is disabled here, so not sure if rcu_read_lock_bh()
could cause trouble... Interesting...

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 11:26 AM Dave Jones  wrote:
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 3219a2932463..4f9494381635 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
> sk_buff *skb,
> /* It is up to the caller to keep npinfo alive. */
> struct netpoll_info *npinfo;
>
> +   rcu_read_lock();
> lockdep_assert_irqs_disabled();
>
> npinfo = rcu_dereference_bh(np->dev->npinfo);

I think you probably need rcu_read_lock_bh() to satisfy
rcu_deference_bh()...

Re: [PATCH RFC net-next 0/5] rtnetlink: Add support for rigid checking of data in dump request

2018-09-28 Thread Christian Brauner

On Fri, Sep 28, 2018 at 08:44:57AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> There are many use cases where a user wants to influence what is
> returned in a dump for some rtnetlink command: one is wanting data
> for a different namespace than the one the request is received and
> another is limiting the amount of data returned in the dump to a
> specific set of interest to userspace, reducing the cpu overhead of
> both kernel and userspace. Unfortunately, the kernel has historically
> not been strict with checking for the proper header or checking the
> values passed in the header. This lenient implementation has allowed
> iproute2 and other packages to pass any struct or data in the dump
> request as long as the family is the first byte. For example, ifinfomsg
> struct is used by iproute2 for all generic dump requests - links,
> addresses, routes and rules when it is really only valid for link
> requests.
> 
> There is 1 is example where the kernel deals with the wrong struct: link
> dumps after VF support was added. Older iproute2 was sending rtgenmsg as
> the header instead of ifinfomsg so a patch was added to try and detect
> old userspace vs new:
> e5eca6d41f53 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0")
> 
> The latest example is Christian's patch set wanting to return addresses for
> a target namespace. It guesses the header struct is an ifaddrmsg and if it
> guesses wrong a netlink warning is generated in the kernel log on every
> address dump which is unacceptable.
> 
> Another example where the kernel is a bit lenient is route dumps: iproute2
> can send either a request with either ifinfomsg or a rtmsg as the header
> struct, yet the kernel always treats the header as an rtmsg (see
> inet_dump_fib and rtm_flags check).
> 
> How to resolve the problem of not breaking old userspace yet be able to
> move forward with new features such as kernel side filtering which are
> crucial for efficient operation at high scale?
> 
> This patch set addresses the problem by adding a new netlink flag,
> NLM_F_DUMP_PROPER_HDR, that userspace can set to say "I have a clue, and
> I am sending the right header struct" and that the struct fields and any
> attributes after it should be used for filtering the data returned in the
> dump.
> 
> Kernel side, the dump handlers are updated to check every field in the
> header struct and all attributes passed. Only ones where filtering is
> implemented are allowed to be set. Any other values cause the dump to fail
> with EINVAL. If the new flag is honored by the kernel and the dump contents
> adjusted by any data passed in the request, the dump handler sets the
> NLM_F_DUMP_FILTERED flag in the netlink message header.
> 
> This is an RFC set with the address handlers updated. If the approach is
> acceptable, then I will do the same to the other rtnetlink dump handlers.

I like the idea and I think this might be a good solution to this
problem. If we can agree on this in favor of mine I'm all for it!
Thanks, David!

Christian

> 
> 
> David Ahern (5):
>   net/netlink: Pass extack to dump callbacks
>   net/ipv6: Refactor address dump to push inet6_fill_args to
> in6_dump_addrs
>   netlink: introduce NLM_F_DUMP_PROPER_HDR flag
>   net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR
>   net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR
> 
>  include/linux/netlink.h  |   2 +
>  include/uapi/linux/netlink.h |   1 +
>  net/core/rtnetlink.c |   1 +
>  net/ipv4/devinet.c   |  52 +-
>  net/ipv6/addrconf.c  | 101 
> +--
>  net/netlink/af_netlink.c |   1 +
>  6 files changed, 114 insertions(+), 44 deletions(-)
> 
> -- 
> 2.11.0
>

Re: [PATCH RFC net-next 2/5] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-09-28 Thread Christian Brauner

On Fri, Sep 28, 2018 at 08:44:59AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid
> into it. Since IFA_TARGET_NETNSID is a kernel side filter add the
> NLM_F_DUMP_FILTERED flag so userspace knows the request was honored.
> 
> Signed-off-by: David Ahern 

Acked-by: Christian Brauner 

> ---
>  net/ipv6/addrconf.c | 59 
> +
>  1 file changed, 32 insertions(+), 27 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index a9a317322388..375ea9d9869b 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4793,12 +4793,19 @@ static inline int inet6_ifaddr_msgsize(void)
>  + nla_total_size(4)  /* IFA_RT_PRIORITY */;
>  }
>  
> +enum addr_type_t {
> + UNICAST_ADDR,
> + MULTICAST_ADDR,
> + ANYCAST_ADDR,
> +};
> +
>  struct inet6_fill_args {
>   u32 portid;
>   u32 seq;
>   int event;
>   unsigned int flags;
>   int netnsid;
> + enum addr_type_t type;
>  };
>  
>  static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
> @@ -4930,39 +4937,28 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, 
> struct ifacaddr6 *ifaca,
>   return 0;
>  }
>  
> -enum addr_type_t {
> - UNICAST_ADDR,
> - MULTICAST_ADDR,
> - ANYCAST_ADDR,
> -};
> -
>  /* called with rcu_read_lock() */
>  static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
> -   struct netlink_callback *cb, enum addr_type_t type,
> -   int s_ip_idx, int *p_ip_idx, int netnsid)
> +   struct netlink_callback *cb,
> +   int s_ip_idx, int *p_ip_idx,
> +   struct inet6_fill_args *fillargs)
>  {
> - struct inet6_fill_args fillargs = {
> - .portid = NETLINK_CB(cb->skb).portid,
> - .seq = cb->nlh->nlmsg_seq,
> - .flags = NLM_F_MULTI,
> - .netnsid = netnsid,
> - };
>   struct ifmcaddr6 *ifmca;
>   struct ifacaddr6 *ifaca;
>   int err = 1;
>   int ip_idx = *p_ip_idx;
>  
>   read_lock_bh(>lock);
> - switch (type) {
> + switch (fillargs->type) {
>   case UNICAST_ADDR: {
>   struct inet6_ifaddr *ifa;
> - fillargs.event = RTM_NEWADDR;
> + fillargs->event = RTM_NEWADDR;
>  
>   /* unicast address incl. temp addr */
>   list_for_each_entry(ifa, >addr_list, if_list) {
>   if (++ip_idx < s_ip_idx)
>   continue;
> - err = inet6_fill_ifaddr(skb, ifa, );
> + err = inet6_fill_ifaddr(skb, ifa, fillargs);
>   if (err < 0)
>   break;
>   nl_dump_check_consistent(cb, nlmsg_hdr(skb));
> @@ -4970,26 +4966,26 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
> struct sk_buff *skb,
>   break;
>   }
>   case MULTICAST_ADDR:
> - fillargs.event = RTM_GETMULTICAST;
> + fillargs->event = RTM_GETMULTICAST;
>  
>   /* multicast address */
>   for (ifmca = idev->mc_list; ifmca;
>ifmca = ifmca->next, ip_idx++) {
>   if (ip_idx < s_ip_idx)
>   continue;
> - err = inet6_fill_ifmcaddr(skb, ifmca, );
> + err = inet6_fill_ifmcaddr(skb, ifmca, fillargs);
>   if (err < 0)
>   break;
>   }
>   break;
>   case ANYCAST_ADDR:
> - fillargs.event = RTM_GETANYCAST;
> + fillargs->event = RTM_GETANYCAST;
>   /* anycast address */
>   for (ifaca = idev->ac_list; ifaca;
>ifaca = ifaca->aca_next, ip_idx++) {
>   if (ip_idx < s_ip_idx)
>   continue;
> - err = inet6_fill_ifacaddr(skb, ifaca, );
> + err = inet6_fill_ifacaddr(skb, ifaca, fillargs);
>   if (err < 0)
>   break;
>   }
> @@ -5005,10 +5001,16 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
> struct sk_buff *skb,
>  static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>  enum addr_type_t type)
>  {
> + struct inet6_fill_args fillargs = {
> + .portid = NETLINK_CB(cb->skb).portid,
> + .seq = cb->nlh->nlmsg_seq,
> + .flags = NLM_F_MULTI,
> + .netnsid = -1,
> + .type = type,
> + };
>   struct net *net = sock_net(skb->sk);
>   struct nlattr *tb[IFA_MAX+1];
>   struct net *tgt_net = net;
> - int netnsid = -1;
>   int h, s_h;
>   int idx, ip_idx;
>   int s_idx, s_ip_idx;
> @@ -5023,11 +5025,14 @@

Re: [PATCH RFC net-next 1/5] net/netlink: Pass extack to dump callbacks

2018-09-28 Thread Christian Brauner

On Fri, Sep 28, 2018 at 08:44:58AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> Pass extack to dump callbacks by adding extack to netlink_dump_control
> and transferring to netlink_callback. Update rtnetlink as the first
> user.
> 
> Signed-off-by: David Ahern 

I like the idea of passing down extack.

Acked-by: Christian Brauner 

> ---
>  include/linux/netlink.h  | 2 ++
>  net/core/rtnetlink.c | 1 +
>  net/netlink/af_netlink.c | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/netlink.h b/include/linux/netlink.h
> index 71f121b66ca8..8fc90308a653 100644
> --- a/include/linux/netlink.h
> +++ b/include/linux/netlink.h
> @@ -176,6 +176,7 @@ struct netlink_callback {
>   void*data;
>   /* the module that dump function belong to */
>   struct module   *module;
> + struct netlink_ext_ack  *extack;
>   u16 family;
>   u16 min_dump_alloc;
>   unsigned intprev_seq, seq;
> @@ -197,6 +198,7 @@ struct netlink_dump_control {
>   int (*done)(struct netlink_callback *);
>   void *data;
>   struct module *module;
> + struct netlink_ext_ack *extack;
>   u16 min_dump_alloc;
>  };
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 35162e1b06ad..da91b38297d3 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -4689,6 +4689,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, 
> struct nlmsghdr *nlh,
>   .dump   = dumpit,
>   .min_dump_alloc = min_dump_alloc,
>   .module = owner,
> + .extack = extack
>   };
>   err = netlink_dump_start(rtnl, skb, nlh, );
>   /* netlink_dump_start() will keep a reference on
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index e3a0538ec0be..7d9e735b32c4 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2307,6 +2307,7 @@ int __netlink_dump_start(struct sock *ssk, struct 
> sk_buff *skb,
>   cb->module = control->module;
>   cb->min_dump_alloc = control->min_dump_alloc;
>   cb->skb = skb;
> + cb->extack = control->extack;
>  
>   if (control->start) {
>   ret = control->start(cb);
> -- 
> 2.11.0
>

Re: [PATCH RFC net-next 4/5] net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR

2018-09-28 Thread David Ahern

On 9/28/18 12:41 PM, Christian Brauner wrote:
>> @@ -1683,15 +1683,45 @@ static int inet_dump_ifaddr(struct sk_buff *skb, 
>> struct netlink_callback *cb)
>>  s_idx = idx = cb->args[1];
>>  s_ip_idx = ip_idx = cb->args[2];
>>  
>> -if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
>> -ifa_ipv4_policy, NULL) >= 0) {
>> -if (tb[IFA_TARGET_NETNSID]) {
>> -fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
>> +if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
>> +struct nlattr *tb[IFA_MAX+1];
>> +struct ifaddrmsg *ifm;
>> +int err, i;
>> +
>> +if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
>> +NL_SET_ERR_MSG(extack, "Invalid header");
>> +return -EINVAL;
>> +}
>> +
>> +ifm = (struct ifaddrmsg *) nlmsg_data(cb->nlh);
>> +if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
>> +NL_SET_ERR_MSG(extack, "Invalid values in header for 
>> dump request");
>> +return -EINVAL;
>> +}
>> +if (ifm->ifa_index) {
>> +NL_SET_ERR_MSG(extack, "Filter by device index not 
>> supported");
>> +return -EINVAL;
>> +}
>> +err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, 
>> IFA_MAX,
>> +ifa_ipv4_policy, NULL);
>> +if (err < 0)
>> +return err;
>>  
>> -tgt_net = rtnl_get_net_ns_capable(skb->sk,
>> -  fillargs.netnsid);
>> -if (IS_ERR(tgt_net))
>> -return PTR_ERR(tgt_net);
>> +for (i = 0; i < IFA_MAX; ++i) {
>> +if (i == IFA_TARGET_NETNSID) {
>> +fillargs.netnsid = nla_get_s32(tb[i]);
>> +
>> +tgt_net = rtnl_get_net_ns_capable(skb->sk,
>> +  
>> fillargs.netnsid);
>> +if (IS_ERR(tgt_net))
>> +return PTR_ERR(tgt_net);
>> +
>> +fillargs.flags |= NLM_F_DUMP_FILTERED;
>> +}
>> +if (tb[i]) {
>> +NL_SET_ERR_MSG(extack, "Unsupported attribute 
>> in dump request");
>> +return -EINVAL;
>> +}
> 
> That loop doesn't do what it promises, no? Shouldn't it be:

your right, that should be:
} else if (tb[i]) {

Re: [PATCH RFC net-next 5/5] net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR

2018-09-28 Thread Christian Brauner

On Fri, Sep 28, 2018 at 08:45:02AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> Update inet6_dump_addr to check for NLM_F_DUMP_PROPER_HDR in the netlink
> message header. If the flag is set, the dump request is expected to have
> an ifaddrmsg struct as the header potentially followed by one or more
> attributes. Any data passed in the header or as an attribute is taken as
> a request to influence the data returned. Only values suppored by the
> dump handler are allowed to be non-0 or set in the request. At the moment
> only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
> will support for other fields (e.g., honor ifa_index and only return data
> for the given device index).
> 
> Signed-off-by: David Ahern 
> ---
>  net/ipv6/addrconf.c | 50 --
>  1 file changed, 40 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 375ea9d9869b..888e5a4b8dd2 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5001,6 +5001,8 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
> struct sk_buff *skb,
>  static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>  enum addr_type_t type)
>  {
> + struct netlink_ext_ack *extack = cb->extack;
> + const struct nlmsghdr *nlh = cb->nlh;
>   struct inet6_fill_args fillargs = {
>   .portid = NETLINK_CB(cb->skb).portid,
>   .seq = cb->nlh->nlmsg_seq,
> @@ -5009,7 +5011,6 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
> netlink_callback *cb,
>   .type = type,
>   };
>   struct net *net = sock_net(skb->sk);
> - struct nlattr *tb[IFA_MAX+1];
>   struct net *tgt_net = net;
>   int h, s_h;
>   int idx, ip_idx;
> @@ -5022,17 +5023,46 @@ static int inet6_dump_addr(struct sk_buff *skb, 
> struct netlink_callback *cb,
>   s_idx = idx = cb->args[1];
>   s_ip_idx = ip_idx = cb->args[2];
>  
> - if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
> - ifa_ipv6_policy, NULL) >= 0) {
> - if (tb[IFA_TARGET_NETNSID]) {
> - fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
> + if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
> + struct nlattr *tb[IFA_MAX+1];
> + struct ifaddrmsg *ifm;
> + int err, i;
> +
> + if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
> + NL_SET_ERR_MSG(extack, "Invalid header");
> + return -EINVAL;
> + }
>  
> - tgt_net = rtnl_get_net_ns_capable(skb->sk,
> -   fillargs.netnsid);
> - if (IS_ERR(tgt_net))
> - return PTR_ERR(tgt_net);
> + ifm = (struct ifaddrmsg *)nlmsg_data(cb->nlh);
> + if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
> + NL_SET_ERR_MSG(extack, "Invalid values in header for 
> dump request");
> + return -EINVAL;
> + }
> + if (ifm->ifa_index) {
> + NL_SET_ERR_MSG(extack, "Filter by device index not 
> supported");
> + return -EINVAL;
> + }
>  
> - fillargs.flags |= NLM_F_DUMP_FILTERED;
> + err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, 
> IFA_MAX,
> +   ifa_ipv6_policy, NULL);
> + if (err < 0)
> + return err;
> +
> + for (i = 0; i < IFA_MAX; ++i) {
> + if (i == IFA_TARGET_NETNSID) {
> + fillargs.netnsid = nla_get_s32(tb[i]);
> +
> + tgt_net = rtnl_get_net_ns_capable(skb->sk,
> +   
> fillargs.netnsid);
> + if (IS_ERR(tgt_net))
> + return PTR_ERR(tgt_net);
> +
> + fillargs.flags |= NLM_F_DUMP_FILTERED;
> + }
> + if (tb[i]) {
> + NL_SET_ERR_MSG(extack, "Unsupported attribute 
> in dump request");
> + return -EINVAL;
> + }

Same comment/question as for ipv4. Shouldn't it be:


for (i = 0; i < IFA_MAX; ++i) {
if (i == IFA_TARGET_NETNSID && tb[i]) {
fillargs.netnsid = nla_get_s32(tb[i]);

tgt_net = rtnl_get_net_ns_capable(skb->sk,
  fillargs.netnsid);
if (IS_ERR(tgt_net))
return PTR_ERR(tgt_net);

fillargs.flags |= NLM_F_DUMP_FILTERED;
continue;
}

Re: [PATCH RFC net-next 4/5] net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR

2018-09-28 Thread Christian Brauner

On Fri, Sep 28, 2018 at 08:45:01AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> Update inet_dump_ifaddr to check for NLM_F_DUMP_PROPER_HDR in the netlink
> message header. If the flag is set, the dump request is expected to have
> an ifaddrmsg struct as the header potentially followed by one or more
> attributes. Any data passed in the header or as an attribute is taken as
> a request to influence the data returned. Only values suppored by the
> dump handler are allowed to be non-0 or set in the request. At the moment
> only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
> will support for other fields (e.g., honor ifa_index and only return data
> for the given device index).
> 
> Signed-off-by: David Ahern 
> ---
>  net/ipv4/devinet.c | 52 +---
>  1 file changed, 41 insertions(+), 11 deletions(-)
> 
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index 44d931a3cd50..1e06a21cd8f4 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -1661,15 +1661,15 @@ static int inet_fill_ifaddr(struct sk_buff *skb, 
> struct in_ifaddr *ifa,
>  
>  static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
>  {
> + struct netlink_ext_ack *extack = cb->extack;
> + const struct nlmsghdr *nlh = cb->nlh;
>   struct inet_fill_args fillargs = {
>   .portid = NETLINK_CB(cb->skb).portid,
> - .seq = cb->nlh->nlmsg_seq,
> + .seq = nlh->nlmsg_seq,
>   .event = RTM_NEWADDR,
> - .flags = NLM_F_MULTI,
>   .netnsid = -1,
>   };
>   struct net *net = sock_net(skb->sk);
> - struct nlattr *tb[IFA_MAX+1];
>   struct net *tgt_net = net;
>   int h, s_h;
>   int idx, s_idx;
> @@ -1683,15 +1683,45 @@ static int inet_dump_ifaddr(struct sk_buff *skb, 
> struct netlink_callback *cb)
>   s_idx = idx = cb->args[1];
>   s_ip_idx = ip_idx = cb->args[2];
>  
> - if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
> - ifa_ipv4_policy, NULL) >= 0) {
> - if (tb[IFA_TARGET_NETNSID]) {
> - fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
> + if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
> + struct nlattr *tb[IFA_MAX+1];
> + struct ifaddrmsg *ifm;
> + int err, i;
> +
> + if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
> + NL_SET_ERR_MSG(extack, "Invalid header");
> + return -EINVAL;
> + }
> +
> + ifm = (struct ifaddrmsg *) nlmsg_data(cb->nlh);
> + if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
> + NL_SET_ERR_MSG(extack, "Invalid values in header for 
> dump request");
> + return -EINVAL;
> + }
> + if (ifm->ifa_index) {
> + NL_SET_ERR_MSG(extack, "Filter by device index not 
> supported");
> + return -EINVAL;
> + }
> + err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, 
> IFA_MAX,
> + ifa_ipv4_policy, NULL);
> + if (err < 0)
> + return err;
>  
> - tgt_net = rtnl_get_net_ns_capable(skb->sk,
> -   fillargs.netnsid);
> - if (IS_ERR(tgt_net))
> - return PTR_ERR(tgt_net);
> + for (i = 0; i < IFA_MAX; ++i) {
> + if (i == IFA_TARGET_NETNSID) {
> + fillargs.netnsid = nla_get_s32(tb[i]);
> +
> + tgt_net = rtnl_get_net_ns_capable(skb->sk,
> +   
> fillargs.netnsid);
> + if (IS_ERR(tgt_net))
> + return PTR_ERR(tgt_net);
> +
> + fillargs.flags |= NLM_F_DUMP_FILTERED;
> + }
> + if (tb[i]) {
> + NL_SET_ERR_MSG(extack, "Unsupported attribute 
> in dump request");
> + return -EINVAL;
> + }

That loop doesn't do what it promises, no? Shouldn't it be:

for (i = 0; i < IFA_MAX; ++i) {
if (i == IFA_TARGET_NETNSID && tb[IFA_TARGET_NETNSID]) {
fillargs.netnsid = nla_get_s32(tb[i]);

tgt_net = rtnl_get_net_ns_capable(skb->sk,
  
fillargs.netnsid);
if (IS_ERR(tgt_net))
return PTR_ERR(tgt_net);

fillargs.flags |= NLM_F_DUMP_FILTERED;
continue;
}

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Eric Dumazet




On 09/28/2018 11:24 AM, Dave Jones wrote:
> Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
> otherwise a trace like below is shown
> 
> WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
> netdev_lower_get_next_private_rcu+0x34/0x40
> CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
>
> 
> Suggested-by: Cong Wang 
> Signed-off-by: Dave Jones 
> 


You forgot to change patch title.

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 11:11 AM Ido Schimmel  wrote:
> I don't think this will work given the reference count already dropped
> to 0, which is why the template deletion function was invoked. I didn't
> test the patch, but I don't see what would prevent the chain from being
> freed.

Good catch! So I should just "move" tcf_chain_destroy() into cls_flower.

A patch is being compiled now.

bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones

Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
otherwise a trace like below is shown

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

Suggested-by: Cong Wang 
Signed-off-by: Dave Jones 

-- 
v3: Do this in netpoll_send_skb_on_dev as Cong suggests.

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3219a2932463..4f9494381635 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
/* It is up to the caller to keep npinfo alive. */
struct netpoll_info *npinfo;
 
+   rcu_read_lock();
lockdep_assert_irqs_disabled();
 
npinfo = rcu_dereference_bh(np->dev->npinfo);
@@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
skb_queue_tail(>txq, skb);
schedule_delayed_work(>tx_work,0);
}
+   rcu_read_unlock();
 }
 EXPORT_SYMBOL(netpoll_send_skb_on_dev);

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones

On Fri, Sep 28, 2018 at 10:31:39AM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 10:25 AM Dave Jones  wrote:
 > >
 > > On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote:
 > >  > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  
 > > wrote:
 > >  > >
 > >  > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
 > >  > > otherwise a trace like below is shown
 > >  >
 > >  > So why not take rcu read lock in netpoll_send_skb_on_dev() where
 > >  > RCU is also assumed?
 > >
 > > that does seem to solve the backtrace spew I saw too, so if that's
 > > preferable I can respin the patch.
 > 
 > 
 > >From my observations, netpoll_send_skb_on_dev() does not take
 > RCU read lock _and_ it relies on rcu read lock because it calls
 > rcu_dereference_bh().
 > 
 > If my observation is correct, you should catch a RCU warning like
 > this but within netpoll_send_skb_on_dev().
 >
 > >  > As I said, I can't explain why you didn't trigger the RCU warning in
 > >  > netpoll_send_skb_on_dev()...
 > >
 > > netpoll_send_skb_on_dev takes the rcu lock itself.
 > 
 > Could you please point me where exactly is the rcu lock here?
 > 
 > I am too stupid to see it. :)

No, I'm the stupid one. I looked at the tree I had just edited to try your
proposed change. 

Now that I've untangled myself, I'll repost with your suggested change.

Dave

[PATCH net v2] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-09-28 Thread Justin.Lee1

The new command (NCSI_CMD_SEND_CMD) is added to allow user space application 
to send NC-SI command to the network card.
Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
response.

The work flow is as below. 

Request:
User space application -> Netlink interface (msg)
  -> new Netlink handler - 
ncsi_send_cmd_nl()
  -> ncsi_xmit_cmd()
Response:
Response received - ncsi_rcv_rsp() -> internal response handler - 
ncsi_rsp_handler_xxx()
-> 
ncsi_rsp_handler_netlink()
-> 
ncsi_send_netlink_rsp ()
-> 
Netlink interface (msg)
-> user 
space application
Command timeout - ncsi_request_timeout() -> ncsi_send_netlink_timeout ()

-> Netlink interface (msg with zero data length)

-> user space application
Error:
Error detected -> ncsi_send_netlink_err () -> Netlink interface (err msg)

   -> user space application


Signed-off-by: Justin Lee 

---
 include/uapi/linux/ncsi.h |   3 +
 net/ncsi/internal.h   |  12 ++-
 net/ncsi/ncsi-cmd.c   |  47 ++-
 net/ncsi/ncsi-manage.c|  22 +
 net/ncsi/ncsi-netlink.c   | 205 ++
 net/ncsi/ncsi-netlink.h   |  12 +++
 net/ncsi/ncsi-rsp.c   |  71 ++--
 7 files changed, 363 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/ncsi.h b/include/uapi/linux/ncsi.h
index 4c292ec..4992bfc 100644
--- a/include/uapi/linux/ncsi.h
+++ b/include/uapi/linux/ncsi.h
@@ -30,6 +30,7 @@ enum ncsi_nl_commands {
NCSI_CMD_PKG_INFO,
NCSI_CMD_SET_INTERFACE,
NCSI_CMD_CLEAR_INTERFACE,
+   NCSI_CMD_SEND_CMD,
 
__NCSI_CMD_AFTER_LAST,
NCSI_CMD_MAX = __NCSI_CMD_AFTER_LAST - 1
@@ -43,6 +44,7 @@ enum ncsi_nl_commands {
  * @NCSI_ATTR_PACKAGE_LIST: nested array of NCSI_PKG_ATTR attributes
  * @NCSI_ATTR_PACKAGE_ID: package ID
  * @NCSI_ATTR_CHANNEL_ID: channel ID
+ * @NCSI_ATTR_DATA: command payload
  * @NCSI_ATTR_MAX: highest attribute number
  */
 enum ncsi_nl_attrs {
@@ -51,6 +53,7 @@ enum ncsi_nl_attrs {
NCSI_ATTR_PACKAGE_LIST,
NCSI_ATTR_PACKAGE_ID,
NCSI_ATTR_CHANNEL_ID,
+   NCSI_ATTR_DATA,
 
__NCSI_ATTR_AFTER_LAST,
NCSI_ATTR_MAX = __NCSI_ATTR_AFTER_LAST - 1
diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 8055e39..1a3ef9e 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -171,6 +171,8 @@ struct ncsi_package;
 #define NCSI_RESERVED_CHANNEL  0x1f
 #define NCSI_CHANNEL_INDEX(c)  ((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1))
 #define NCSI_TO_CHANNEL(p, c)  (((p) << NCSI_PACKAGE_SHIFT) | (c))
+#define NCSI_MAX_PACKAGE   8
+#define NCSI_MAX_CHANNEL   32
 
 struct ncsi_channel {
unsigned char   id;
@@ -215,12 +217,17 @@ struct ncsi_request {
unsigned charid;  /* Request ID - 0 to 255   */
bool used;/* Request that has been assigned  */
unsigned int flags;   /* NCSI request property   */
-#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#define NCSI_REQ_FLAG_NETLINK_DRIVEN   2
struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
struct sk_buff   *cmd;/* Associated NCSI command packet  */
struct sk_buff   *rsp;/* Associated NCSI response packet */
struct timer_listtimer;   /* Timer on waiting for response   */
bool enabled; /* Time has been enabled or not*/
+
+   u32  snd_seq; /* netlink sending sequence number */
+   u32  snd_portid;  /* netlink portid of sender*/
+   struct nlmsghdr  nlhdr;   /* netlink message header  */
 };
 
 enum {
@@ -305,6 +312,9 @@ struct ncsi_cmd_arg {
unsigned short words[8];
unsigned int   dwords[4];
};
+
+   unsigned char*data;   /* Netlink data  */
+   struct genl_info *info;   /* Netlink information   */
 };
 
 extern struct list_head ncsi_dev_list;
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 7567ca63..43b544c 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 #include "ncsi-pkt.h"
@@ -211,6 +212,39 @@ static int

Re: [PATCH net 00/11] netpoll: second round of fixes.

2018-09-28 Thread David Miller

From: Eric Dumazet 
Date: Thu, 27 Sep 2018 09:31:50 -0700

> As diagnosed by Song Liu, ndo_poll_controller() can
> be very dangerous on loaded hosts, since the cpu
> calling ndo_poll_controller() might steal all NAPI
> contexts (for all RX/TX queues of the NIC).
> 
> This capture, showing one ksoftirqd eating all cycles
> can last for unlimited amount of time, since one
> cpu is generally not able to drain all the queues under load.
> 
> It seems that all networking drivers that do use NAPI
> for their TX completions, should not provide a ndo_poll_controller() :
> 
> Most NAPI drivers have netpoll support already handled
> in core networking stack, since netpoll_poll_dev()
> uses poll_napi(dev) to iterate through registered
> NAPI contexts for a device.
> 
> First patch is a fix in poll_one_napi().
> 
> Then following patches take care of ten drivers.

Series applied, thanks Eric.

Re: [net-next 0/8][pull request] 100GbE Intel Wired LAN Driver Updates 2018-09-27

2018-09-28 Thread David Miller

From: Jeff Kirsher 
Date: Thu, 27 Sep 2018 09:21:53 -0700

> This series contains fixes to the ice driver only.
 ...
> The following are changes since commit 
> 1042caa79e9351b81ed19dc8d2d7fd6ff51a4422:
>   net-ipv4: remove 2 always zero parameters from ipv4_redirect()
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Ido Schimmel

On Fri, Sep 28, 2018 at 10:56:47AM -0700, Cong Wang wrote:
> On Fri, Sep 28, 2018 at 7:59 AM Ido Schimmel  wrote:
> >
> > On Wed, Sep 19, 2018 at 04:37:29PM -0700, Cong Wang wrote:
> > > From: Vlad Buslov 
> > >
> > > From: Vlad Buslov 
> > >
> > > Action API was changed to work with actions and action_idr in concurrency
> > > safe manner, however tcf_del_walker() still uses actions without taking a
> > > reference or idrinfo->lock first, and deletes them directly, disregarding
> > > possible concurrent delete.
> > >
> > > Change tcf_del_walker() to take idrinfo->lock while iterating over actions
> > > and use new tcf_idr_release_unsafe() to release them while holding the
> > > lock.
> > >
> > > And the blocking function fl_hw_destroy_tmplt() could be called when we
> > > put a filter chain, so defer it to a work queue.
> >
> > I'm getting a use-after-free when running tc_chains.sh selftest and I
> > believe it's caused by this patch.
> >
> > To reproduce:
> > # cd tools/testing/selftests/net/forwarding
> > # export TESTS="template_filter_fits"; ./tc_chains.sh veth0 veth1
> >
> > __tcf_chain_put()
> > tc_chain_tmplt_del()
> > fl_tmplt_destroy()
> > tcf_queue_work(>rwork, fl_tmplt_destroy_work)
> > tcf_chain_destroy()
> > kfree(chain)
> >
> > Some time later fl_tmplt_destroy_work() starts executing and
> > dereferencing 'chain'.
> 
> Oops, forgot to hold the chain... I will test this:
> 
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 92dd5071a708..cbb68d5515d6 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -1444,6 +1444,7 @@ static void fl_tmplt_destroy_work(struct
> work_struct *work)
>  struct fl_flow_tmplt, rwork);
> 
> fl_hw_destroy_tmplt(tmplt->chain, tmplt);
> +   tcf_chain_put(tmplt->chain);
> kfree(tmplt);
>  }
> 
> @@ -1451,6 +1452,7 @@ static void fl_tmplt_destroy(void *tmplt_priv)
>  {
> struct fl_flow_tmplt *tmplt = tmplt_priv;
> 
> +   tcf_chain_hold(tmplt->chain);
> tcf_queue_work(>rwork, fl_tmplt_destroy_work);
>  }

I don't think this will work given the reference count already dropped
to 0, which is why the template deletion function was invoked. I didn't
test the patch, but I don't see what would prevent the chain from being
freed.

Thanks for looking into this.

Re: [PATCH net-next] net: sched: make function qdisc_free_cb() static

2018-09-28 Thread David Miller

From: Wei Yongjun 
Date: Thu, 27 Sep 2018 14:47:56 +

> Fixes the following sparse warning:
> 
> net/sched/sch_generic.c:944:6: warning:
>  symbol 'qdisc_free_cb' was not declared. Should it be static?
> 
> Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu")
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next] geneve: fix ttl inherit type

2018-09-28 Thread David Ahern

On 9/27/18 7:09 PM, Hangbin Liu wrote:
> Phil pointed out that there is a mismatch between vxlan and geneve ttl
> inherit. We should define it as a flag and use nla_put_flag to export this
> opiton.
> 
> Fixes: 52d0d404d39dd ("geneve: add ttl inherit support")

same here .. getting an unknown commit id.


> Reported-by: Phil Sutter 
> Signed-off-by: Hangbin Liu 
> ---
>  drivers/net/geneve.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 6625fab..09ab2fd 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1100,7 +1100,7 @@ static const struct nla_policy 
> geneve_policy[IFLA_GENEVE_MAX + 1] = {
>   [IFLA_GENEVE_UDP_CSUM]  = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
> - [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_U8 },
> + [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_FLAG },
>  };
>  
>  static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
> @@ -1582,7 +1582,7 @@ static size_t geneve_get_size(const struct net_device 
> *dev)
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX 
> */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX 
> */
> - nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */
> + nla_total_size(0) + /* IFLA_GENEVE_TTL_INHERIT */
>   0;
>  }
>  
> @@ -1636,7 +1636,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   goto nla_put_failure;
>  #endif
>  
> - if (nla_put_u8(skb, IFLA_GENEVE_TTL_INHERIT, ttl_inherit))
> + if (ttl_inherit && nla_put_flag(skb, IFLA_GENEVE_TTL_INHERIT))
>   goto nla_put_failure;
>  
>   return 0;
>

Re: [PATCH net] vxlan: use nla_put_flag for ttl inherit

2018-09-28 Thread David Ahern

On 9/27/18 7:08 PM, Hangbin Liu wrote:
> Phil pointed out that there is a mismatch between vxlan and geneve ttl 
> inherit.
> We should define it as a flag and use nla_put_flag to export this opiton.
> 
> Fixes: 8fd780698745b ("vxlan: fill ttl inherit info")

Wrong Fixes tag:

kenny:mgmt:iproute2-next.git$ git show 8fd780698745b
fatal: ambiguous argument '8fd780698745b': unknown revision or path not
in the working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'

> Reported-by: Phil Sutter 
> Signed-off-by: Hangbin Liu 
> ---
>  drivers/net/vxlan.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 2b8da2b..479dda4 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3539,7 +3539,7 @@ static size_t vxlan_get_size(const struct net_device 
> *dev)
>   nla_total_size(sizeof(__u32)) + /* IFLA_VXLAN_LINK */
>   nla_total_size(sizeof(struct in6_addr)) + /* 
> IFLA_VXLAN_LOCAL{6} */
>   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL */
> - nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL_INHERIT */
> + nla_total_size(0) + /* IFLA_VXLAN_TTL_INHERIT */
>   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TOS */
>   nla_total_size(sizeof(__be32)) + /* IFLA_VXLAN_LABEL */
>   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_LEARNING */
> @@ -3604,8 +3604,6 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   }
>  
>   if (nla_put_u8(skb, IFLA_VXLAN_TTL, vxlan->cfg.ttl) ||
> - nla_put_u8(skb, IFLA_VXLAN_TTL_INHERIT,
> -!!(vxlan->cfg.flags & VXLAN_F_TTL_INHERIT)) ||
>   nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->cfg.tos) ||
>   nla_put_be32(skb, IFLA_VXLAN_LABEL, vxlan->cfg.label) ||
>   nla_put_u8(skb, IFLA_VXLAN_LEARNING,
> @@ -3650,6 +3648,10 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   nla_put_flag(skb, IFLA_VXLAN_REMCSUM_NOPARTIAL))
>   goto nla_put_failure;
>  
> + if (vxlan->cfg.flags & VXLAN_F_TTL_INHERIT &&
> + nla_put_flag(skb, IFLA_VXLAN_TTL_INHERIT))
> + goto nla_put_failure;
> +
>   return 0;
>  
>  nla_put_failure:
>

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 7:59 AM Ido Schimmel  wrote:
>
> On Wed, Sep 19, 2018 at 04:37:29PM -0700, Cong Wang wrote:
> > From: Vlad Buslov 
> >
> > From: Vlad Buslov 
> >
> > Action API was changed to work with actions and action_idr in concurrency
> > safe manner, however tcf_del_walker() still uses actions without taking a
> > reference or idrinfo->lock first, and deletes them directly, disregarding
> > possible concurrent delete.
> >
> > Change tcf_del_walker() to take idrinfo->lock while iterating over actions
> > and use new tcf_idr_release_unsafe() to release them while holding the
> > lock.
> >
> > And the blocking function fl_hw_destroy_tmplt() could be called when we
> > put a filter chain, so defer it to a work queue.
>
> I'm getting a use-after-free when running tc_chains.sh selftest and I
> believe it's caused by this patch.
>
> To reproduce:
> # cd tools/testing/selftests/net/forwarding
> # export TESTS="template_filter_fits"; ./tc_chains.sh veth0 veth1
>
> __tcf_chain_put()
> tc_chain_tmplt_del()
> fl_tmplt_destroy()
> tcf_queue_work(>rwork, fl_tmplt_destroy_work)
> tcf_chain_destroy()
> kfree(chain)
>
> Some time later fl_tmplt_destroy_work() starts executing and
> dereferencing 'chain'.

Oops, forgot to hold the chain... I will test this:

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 92dd5071a708..cbb68d5515d6 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1444,6 +1444,7 @@ static void fl_tmplt_destroy_work(struct
work_struct *work)
 struct fl_flow_tmplt, rwork);

fl_hw_destroy_tmplt(tmplt->chain, tmplt);
+   tcf_chain_put(tmplt->chain);
kfree(tmplt);
 }

@@ -1451,6 +1452,7 @@ static void fl_tmplt_destroy(void *tmplt_priv)
 {
struct fl_flow_tmplt *tmplt = tmplt_priv;

+   tcf_chain_hold(tmplt->chain);
tcf_queue_work(>rwork, fl_tmplt_destroy_work);
 }

Re: [PATCH net] vxlan: use nla_put_flag for ttl inherit

2018-09-28 Thread David Ahern

On 9/28/18 6:38 AM, Hangbin Liu wrote:
> On Fri, Sep 28, 2018 at 12:37:00PM +0200, Phil Sutter wrote:
>> On Fri, Sep 28, 2018 at 09:08:26AM +0800, Hangbin Liu wrote:
>>> Phil pointed out that there is a mismatch between vxlan and geneve ttl 
>>> inherit.
>>> We should define it as a flag and use nla_put_flag to export this opiton.
>>
>> s/opiton/option/
>>
>> Apart from that, LGTM!
>>
>> Thanks, Phil
> 
> Opps, sorry...
> 
> Hi David,
> 
> Should I re-send a patch or will you help fix it directly?
> 
> Thanks
> Hangbin
> 

you have this targeted at net; is it a bug in current iproute2 or an
update to a new feature in -next?

[PATCH] Revert "openvswitch: Fix template leak in error cases."

2018-09-28 Thread Flavio Leitner

This reverts commit 90c7afc96cbbd77f44094b5b651261968e97de67.

When the commit was merged, the code used nf_ct_put() to free
the entry, but later on commit 76644232e612 ("openvswitch: Free
tmpl with tmpl_free.") replaced that with nf_ct_tmpl_free which
is a more appropriate. Now the original problem is removed.

Then 44d6e2f27328 ("net: Replace NF_CT_ASSERT() with WARN_ON().")
replaced a debug assert with a WARN_ON() which is trigged now.

Signed-off-by: Flavio Leitner 
---
 net/openvswitch/conntrack.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 86a75105af1a..0aeb34c6389d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1624,10 +1624,6 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return -ENOMEM;
}
-
-   __set_bit(IPS_CONFIRMED_BIT, _info.ct->status);
-   nf_conntrack_get(_info.ct->ct_general);
-
if (helper) {
err = ovs_ct_add_helper(_info, helper, key, log);
if (err)
@@ -1639,6 +1635,8 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
if (err)
goto err_free_ct;
 
+   __set_bit(IPS_CONFIRMED_BIT, _info.ct->status);
+   nf_conntrack_get(_info.ct->ct_general);
return 0;
 err_free_ct:
__ovs_ct_free_action(_info);
-- 
2.14.4

Re: [PATCH iproute2 net-next] bridge: fdb: add support for sticky flag

2018-09-28 Thread David Ahern

On 9/27/18 7:35 AM, Nikolay Aleksandrov wrote:
> Add support for the new sticky flag that can be set on fdbs and update the
> man page.
> 
> CC: David Ahern 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  bridge/fdb.c  | 9 +++--
>  man/man8/bridge.8 | 6 +-
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 

applied to iproute2-next. Thanks

Re: [PATCH 04/11] net: ip6_multipath_l3_keys() - use new style struct initializer instead of memset

2018-09-28 Thread David Ahern

On 9/27/18 5:00 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 
> 
> Signed-off-by: Maciej Żenczykowski 
> ---
>  net/ipv6/route.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index d28f83e01593..9cb024451fc5 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1981,12 +1981,11 @@ static void ip6_multipath_l3_keys(const struct 
> sk_buff *skb,
>  u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6,
>  const struct sk_buff *skb, struct flow_keys *flkeys)
>  {
> - struct flow_keys hash_keys;
> + struct flow_keys hash_keys = {};
>   u32 mhash;
>  
>   switch (ip6_multipath_hash_policy(net)) {
>   case 0:
> - memset(_keys, 0, sizeof(hash_keys));
>   hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
>   if (skb) {
>   ip6_multipath_l3_keys(skb, _keys, flkeys);
> @@ -2006,8 +2005,6 @@ u32 rt6_multipath_hash(const struct net *net, const 
> struct flowi6 *fl6,
>   if (skb->l4_hash)
>   return skb_get_hash_raw(skb) >> 1;
>  
> - memset(_keys, 0, sizeof(hash_keys));
> -
>  if (!flkeys) {
>   skb_flow_dissect_flow_keys(skb, , flag);
>   flkeys = 
> @@ -2019,7 +2016,6 @@ u32 rt6_multipath_hash(const struct net *net, const 
> struct flowi6 *fl6,
>   hash_keys.ports.dst = flkeys->ports.dst;
>   hash_keys.basic.ip_proto = flkeys->basic.ip_proto;
>   } else {
> - memset(_keys, 0, sizeof(hash_keys));
>   hash_keys.control.addr_type = 
> FLOW_DISSECTOR_KEY_IPV6_ADDRS;
>   hash_keys.addrs.v6addrs.src = fl6->saddr;
>   hash_keys.addrs.v6addrs.dst = fl6->daddr;
> 

ditto for this one.

Re: [PATCH 03/11] net: fib_multipath_hash() - use new style struct initializer instead of memset

2018-09-28 Thread David Ahern

On 9/27/18 5:00 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 
> 
> Signed-off-by: Maciej Żenczykowski 
> ---
>  net/ipv4/route.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 048919713f4e..17953a52fbd0 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1821,12 +1821,11 @@ static void ip_multipath_l3_keys(const struct sk_buff 
> *skb,
>  int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
>  const struct sk_buff *skb, struct flow_keys *flkeys)
>  {
> - struct flow_keys hash_keys;
> + struct flow_keys hash_keys = {};
>   u32 mhash;
>  
>   switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
>   case 0:
> - memset(_keys, 0, sizeof(hash_keys));
>   hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>   if (skb) {
>   ip_multipath_l3_keys(skb, _keys);
> @@ -1845,8 +1844,6 @@ int fib_multipath_hash(const struct net *net, const 
> struct flowi4 *fl4,
>   if (skb->l4_hash)
>   return skb_get_hash_raw(skb) >> 1;
>  
> - memset(_keys, 0, sizeof(hash_keys));
> -
>   if (!flkeys) {
>   skb_flow_dissect_flow_keys(skb, , flag);
>   flkeys = 
> @@ -1859,7 +1856,6 @@ int fib_multipath_hash(const struct net *net, const 
> struct flowi4 *fl4,
>   hash_keys.ports.dst = flkeys->ports.dst;
>   hash_keys.basic.ip_proto = flkeys->basic.ip_proto;
>   } else {
> - memset(_keys, 0, sizeof(hash_keys));
>   hash_keys.control.addr_type = 
> FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>   hash_keys.addrs.v4addrs.src = fl4->saddr;
>   hash_keys.addrs.v4addrs.dst = fl4->daddr;
> 

NACK on this one.

This is the hot path and the memset was done right before use for least
overhead.

Re: [PATCH net-next] selftests: forwarding: test for bridge sticky flag

2018-09-28 Thread David Miller

From: Nikolay Aleksandrov 
Date: Thu, 27 Sep 2018 16:35:13 +0300

> This test adds an fdb entry with the sticky flag and sends traffic from
> a different port with the same mac as a source address expecting the entry
> to not change ports if the flag is operating correctly.
> 
> Signed-off-by: Nikolay Aleksandrov 

Applied, thanks.

Re: [PATCH net 1/1] qed: Fix shmem structure inconsistency between driver and the mfw.

2018-09-28 Thread David Miller

From: Sudarsana Reddy Kalluru 
Date: Thu, 27 Sep 2018 04:12:10 -0700

> The structure shared between driver and the management FW (mfw) differ in
> sizes. This would lead to issues when driver try to access the structure
> members which are not-aligned with the mfw copy e.g., data_ptr usage in the
> case of mfw_tlv request.
> Align the driver structure with mfw copy, add reserved field(s) to driver
> structure for the members not used by the driver.
> 
> Fixes: dd006921d ("qed: Add MFW interfaces for TLV request support.)
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Michal Kalderon 

Applied and queued up for -stable.

In the future please use 12 digits of SHA1_ID for Fixes tags.  I fixed
it up for you this time.

Thanks.

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-09-28 Thread Pravin Shelar

On Wed, Sep 26, 2018 at 2:58 AM Stefano Brivio  wrote:
>
> Hi Pravin,
>
> On Wed, 15 Aug 2018 00:19:39 -0700
> Pravin Shelar  wrote:
>
> > I understand fairness has cost, but we need to find right balance
> > between performance and fairness. Current fairness scheme is a
> > lockless algorithm without much computational overhead, did you try to
> > improve current algorithm so that it uses less number of ports.
>
> In the end, we tried harder as you suggested, and found out a way to
> avoid the need for per-thread sets of per-vport sockets: with a few
> changes, a process-wide set of per-vport sockets is actually enough to
> maintain the current fairness behaviour.
>
> Further, we can get rid of duplicate fd events if the EPOLLEXCLUSIVE
> epoll() flag is available, which improves performance noticeably. This
> is explained in more detail in the commit message of the ovs-vswitchd
> patch Matteo wrote [1], now merged.
>
> This approach solves the biggest issue (i.e. huge amount of netlink
> sockets) in ovs-vswitchd itself, without the need for kernel changes.
>

Thanks for working on this issue.

> It doesn't address some proposed improvements though, that is, it does
> nothing to improve the current fairness scheme, it won't allow neither
> the configurable fairness criteria proposed by Ben, nor usage of BPF
> maps for extensibility as suggested by William.
>
> I would however say that those topics bear definitely lower priority,
> and perhaps addressing them at all becomes overkill now that we don't
> any longer have a serious issue.
>
> [1] https://patchwork.ozlabs.org/patch/974304/
Nice!

Thanks,
Pravin.

Re: [PATCH 1/1] Update maintainers for bnx2/bnx2x/qlge/qlcnic drivers.

2018-09-28 Thread David Miller

From: Sudarsana Reddy Kalluru 
Date: Wed, 26 Sep 2018 21:57:03 -0700

> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ameen Rahman 

Applied, thank you.

Re: [PATCH] [PATCH net-next] openvswitch: Use correct reply values in datapath and vport ops

2018-09-28 Thread Pravin Shelar

On Wed, Sep 26, 2018 at 11:40 AM Yifeng Sun  wrote:
>
> This patch fixes the bug that all datapath and vport ops are returning
> wrong values (OVS_FLOW_CMD_NEW or OVS_DP_CMD_NEW) in their replies.
>
> Signed-off-by: Yifeng Sun 
I am surprised this was not found earlier.

Acked-by: Pravin B Shelar 

Thanks.

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 10:25 AM Dave Jones  wrote:
>
> On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote:
>  > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  wrote:
>  > >
>  > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
>  > > otherwise a trace like below is shown
>  >
>  > So why not take rcu read lock in netpoll_send_skb_on_dev() where
>  > RCU is also assumed?
>
> that does seem to solve the backtrace spew I saw too, so if that's
> preferable I can respin the patch.


>From my observations, netpoll_send_skb_on_dev() does not take
RCU read lock _and_ it relies on rcu read lock because it calls
rcu_dereference_bh().

If my observation is correct, you should catch a RCU warning like
this but within netpoll_send_skb_on_dev().


>
>  > As I said, I can't explain why you didn't trigger the RCU warning in
>  > netpoll_send_skb_on_dev()...
>
> netpoll_send_skb_on_dev takes the rcu lock itself.

Could you please point me where exactly is the rcu lock here?

I am too stupid to see it. :)

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-28 Thread Edward Cree

On 28/09/18 17:45, Andrew Lunn wrote:
> Now is a good time to change the API, since we are moving to a netlink
> socket. Which is why these questions were asked in the first place...
OK, well, I've posted sfc's semantics and view-from-the-hardware*; now
 patiently waiting for other NIC vendors to chime in so we can try to
 converge on something consistent.
Then again, since they've been CCed since the original patch three weeks
 ago, we might be waiting a while :-(

Regarding Ariel Almog's suggested semantics, it seems like they have the
 'auto' bit just encoding 'more than one non-auto bit', which is
 redundant (i.e. off|rs is always off|rs|auto, whereas rs is never
 rs|auto).  I don't see how that would be useful.

-Ed

* One complication I left out: we actually have _three_ pairs of sup/req
  bits, because we separate 'BaseR for 10G/40G/100G' from 'BaseR for
  25G/50G'.  I don't know the details of why our HW does this (or why
  100G isn't lumped in with the other 25ers) but I think it has to do
  with Horrific Ethernet Spec Arcana Man Was Not Meant To Know™.

Re: [PATCH] MAINTAINERS: change bridge maintainers

2018-09-28 Thread David Miller

From: Stephen Hemminger 
Date: Thu, 27 Sep 2018 10:47:01 +0200

> I haven't been doing reviews only but not active development on bridge
> code for several years. Roopa and Nikolay have been doing most of
> the new features and have agreed to take over as new co-maintainers.
> 
> Signed-off-by: Stephen Hemminger 

Thanks for all of the years of watching over the bridge code Stephen.

Applied.

[PATCH net-next] tcp/fq: move back to CLOCK_MONOTONIC

2018-09-28 Thread Eric Dumazet

In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.

But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.

If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.

In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.

Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet 
Reported-by: Leonard Crestez 
---
 include/net/tcp.h| 2 +-
 net/ipv4/tcp_timer.c | 2 +-
 net/sched/sch_fq.c   | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
ff15d8e0d525715b17671e64f6abdead9df0a8f3..0d2929223c703c0aaabef0d485aabf7dc707aae1
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -732,7 +732,7 @@ void tcp_send_window_probe(struct sock *sk);
 
 static inline u64 tcp_clock_ns(void)
 {
-   return ktime_get_tai_ns();
+   return ktime_get_ns();
 }
 
 static inline u64 tcp_clock_us(void)
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
4f661e178da8465203266ff4dfa3e8743e60ff82..61023d50cd604d5e19464a32c33b65d29c75c81e
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -758,7 +758,7 @@ void tcp_init_xmit_timers(struct sock *sk)
 {
inet_csk_init_xmit_timers(sk, _write_timer, _delack_timer,
  _keepalive_timer);
-   hrtimer_init(_sk(sk)->pacing_timer, CLOCK_TAI,
+   hrtimer_init(_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
 HRTIMER_MODE_ABS_PINNED_SOFT);
tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 
628a2cdcfc6f2fa69d9402f06881949d2e1423d9..338222a6c664b1825aaada4355e2fc0a01db9c73
 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -412,7 +412,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 
now)
 static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 {
struct fq_sched_data *q = qdisc_priv(sch);
-   u64 now = ktime_get_tai_ns();
+   u64 now = ktime_get_ns();
struct fq_flow_head *head;
struct sk_buff *skb;
struct fq_flow *f;
@@ -776,7 +776,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt,
q->fq_trees_log = ilog2(1024);
q->orphan_mask  = 1024 - 1;
q->low_rate_threshold   = 55 / 8;
-   qdisc_watchdog_init_clockid(>watchdog, sch, CLOCK_TAI);
+   qdisc_watchdog_init_clockid(>watchdog, sch, CLOCK_MONOTONIC);
 
if (opt)
err = fq_change(sch, opt, extack);
@@ -831,7 +831,7 @@ static int fq_dump_stats(struct Qdisc *sch, struct 
gnet_dump *d)
st.flows_plimit   = q->stat_flows_plimit;
st.pkts_too_long  = q->stat_pkts_too_long;
st.allocation_errors  = q->stat_allocation_errors;
-   st.time_next_delayed_flow = q->time_next_delayed_flow - 
ktime_get_tai_ns();
+   st.time_next_delayed_flow = q->time_next_delayed_flow - ktime_get_ns();
st.flows  = q->flows;
st.inactive_flows = q->inactive_flows;
st.throttled_flows= q->throttled_flows;
-- 
2.19.0.605.g01d371f741-goog

Re: [PATCH v3 0/5] netlink: nested policy validation

2018-09-28 Thread David Miller

From: Johannes Berg 
Date: Thu, 27 Sep 2018 10:22:42 +0200

> On Wed, 2018-09-26 at 10:21 -0700, David Miller wrote:
>> From: Johannes Berg 
>> Date: Wed, 26 Sep 2018 11:15:29 +0200
>> 
>> > This adds nested policy validation, which lets you specify the
>> > nested attribute type, e.g. NLA_NESTED with sub-policy, or the
>> > new NLA_NESTED_ARRAY with sub-sub-policy.
>> > 
>> > 
>> > Changes in v2:
>> >  * move setting the bad attr pointer/message into validate_nla()
>> >  * remove the recursion patch since that's no longer needed
>> >  * simply skip the generic bad attr pointer/message setting in
>> >case of nested nla_validate() failing since that could fail
>> >only due to validate_nla() failing inside, which already sets
>> >the extack information
>> > 
>> > Changes in v3:
>> >  * fix NLA_REJECT to have an error message if none is in policy
>> 
>> Looks great Johannes, series applied.
> 
> Sorry to nag, but I see patches that you replied to later than this in
> the tree, but not this.
> 
> Or did you see something wrong with this later and dropped it?

Ugh, the perils of working on multiple machines :-/

This should be fixed now and your netlink changes pushed out to net-next.

Sorry about that.

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones

On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  wrote:
 > >
 > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
 > > otherwise a trace like below is shown
 > 
 > So why not take rcu read lock in netpoll_send_skb_on_dev() where
 > RCU is also assumed?

that does seem to solve the backtrace spew I saw too, so if that's
preferable I can respin the patch.

 > As I said, I can't explain why you didn't trigger the RCU warning in
 > netpoll_send_skb_on_dev()...

netpoll_send_skb_on_dev takes the rcu lock itself.

Dave

rcu_read_lock() in tcf_block_find()

2018-09-28 Thread Cong Wang

(Changing $subject as the discussion is going to a completely different topic)

On Thu, Sep 27, 2018 at 3:19 PM Eric Dumazet  wrote:
>
>
>
> On 09/27/2018 02:36 PM, Cong Wang wrote:
>
> > I don't understand what you mean by changing ip command, you must
> > mean tc command, but still, I have no idea about how restarting failed
> > syscall could be related to my patch and why we need to restart anything
> > here. If the refcnt goes to 0, it will never come back, retrying won't help
> > anything.
> >
>
> Yep, tc command it is.
>
> I was not especially commenting your patch (replacing an english message by 
> another does
> not seem very big deal), but the fact that the code right there seems to be 
> prepared
> for parallel changes.
>
> But using RCU lookups in control path will lead to occasional failures
> that most user space tools would not expect.
>

I already discussed this with Vlad in the beginning of his RTNL
removal patches, we both agreed some lock is still needed, it is not
completely lockless. Take a look at tc action code now, two spinlocks
are still needed even after we will remove the RTNL there.

> Lets assume two tasks are launching "tc qdisc replace dev eth0 root XXX" in 
> whatever order/parallelism.
>
> Both should succeed, after/before major RTNL->other_locking_mechanism

Yes, it is never going to be completely lockless.

>
> Control paths are usually using a mutex or a spinlock so that they never hit 
> a 0-refcount at all.

For dev->qdisc, sure, we should already hold a refcnt, it can't be gone.

For qdisc_lookup_rcu(), it could be that refcnt goes to 0 before we
remove it from hashtable, right? call_rcu() is only called after
refcnt==0, so rcu read lock can't help here.

Thanks.

Re: [PATCH net-next] qed: Remove set but not used variable 'p_archipelago'

2018-09-28 Thread David Miller

From: YueHaibing 
Date: Thu, 27 Sep 2018 06:45:06 +

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/qlogic/qed/qed_ooo.c: In function 'qed_ooo_delete_isles':
> drivers/net/ethernet/qlogic/qed/qed_ooo.c:354:30: warning:
>  variable 'p_archipelago' set but not used [-Wunused-but-set-variable]
>  
> drivers/net/ethernet/qlogic/qed/qed_ooo.c: In function 'qed_ooo_join_isles':
> drivers/net/ethernet/qlogic/qed/qed_ooo.c:463:30: warning:
>  variable 'p_archipelago' set but not used [-Wunused-but-set-variable]
> 
> Since commit 1eec2437d14c ("qed: Make OOO archipelagos into an array"),
> 'p_archipelago' is no longer in use.
> 
> Signed-off-by: YueHaibing 

Applied, thanks.

[PATCH iproute2] lib/libnetlink: fix response seq check

2018-09-28 Thread vlad

From: Vlad Dumitrescu 

Taking a one-iovec example, with rtnl->seq at 42. iovlen == 1, seq
becomes 43 on line 604, and a message is sent with nlmsg_seq == 43. If
a response with nlmsg_seq of 42 is received, the condition being fixed
in this patch would incorrectly accept it.

Fixes: 72a2ff3916e5 ("lib/libnetlink: Add a new function rtnl_talk_iov")
Signed-off-by: Vlad Dumitrescu 
---
 lib/libnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index f18dceac..4d2416bf 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -647,7 +647,7 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct 
iovec *iov,
 
if (nladdr.nl_pid != 0 ||
h->nlmsg_pid != rtnl->local.nl_pid ||
-   h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen) {
+   h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen + 
1) {
/* Don't forget to skip that message. */
status -= NLMSG_ALIGN(len);
h = (struct nlmsghdr *)((char *)h + 
NLMSG_ALIGN(len));
-- 
2.19.0.605.g01d371f741-goog

Re: [Patch net-next] net_sched: fix an extack message in tcf_block_find()

2018-09-28 Thread Cong Wang

On Thu, Sep 27, 2018 at 1:42 PM Cong Wang  wrote:
>
> It is clearly a copy-n-paste.
>
> Signed-off-by: Cong Wang 

I regret again for wasting my time, so:

NAcked-by: Cong Wang 

I really don't care, do you?

Re: [Patch net-next] net_sched: fix an extack message in tcf_block_find()

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 4:36 AM Vlad Buslov  wrote:
>
> On Thu 27 Sep 2018 at 20:42, Cong Wang  wrote:
> > It is clearly a copy-n-paste.
> >
> > Signed-off-by: Cong Wang 
> > ---
> >  net/sched/cls_api.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > index 3de47e99b788..8dd7f8af6d54 100644
> > --- a/net/sched/cls_api.c
> > +++ b/net/sched/cls_api.c
> > @@ -655,7 +655,7 @@ static struct tcf_block *tcf_block_find(struct net 
> > *net, struct Qdisc **q,
> >
> >   *q = qdisc_refcount_inc_nz(*q);
> >   if (!*q) {
> > - NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
> > + NL_SET_ERR_MSG(extack, "Can't increase Qdisc 
> > refcount");
> >   err = -EINVAL;
> >   goto errout_rcu;
> >   }
>
> Is there a benefit in exposing this info to user?

Depends on what user you mean here. For kernel developers, yes,
this is useful. For others, no.

> For all intents and purposes Qdisc is gone at that point.

I don't want to be a language lawyer, but there is a difference between
"it doesn't exist" and "it exists but being removed". The errno EINVAL
betrays what you said too, it must be ENOENT to mach "Qdisc is gone".

I don't want to waste my time on this any more. Let's just drop it.

I really don't care, do you?

Re: [PATCH iproute2 net-next] ipneigh: support setting of NTF_ROUTER on neigh entries

2018-09-28 Thread David Ahern

On 9/25/18 3:15 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> Signed-off-by: Roopa Prabhu 
> ---
>  ip/ipneigh.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

applied to iproute2-next.

And then I noticed you did not update the help or the man page. Please
send a follow up.

Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Cong Wang

On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  wrote:
>
> Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
> otherwise a trace like below is shown

So why not take rcu read lock in netpoll_send_skb_on_dev() where
RCU is also assumed?

As I said, I can't explain why you didn't trigger the RCU warning in
netpoll_send_skb_on_dev()...

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-28 Thread Andrew Lunn

> I see where you're coming from, but if people start needing to manually
>  configure FEC on their consumer devices, possibly we have bigger
>  problems.

Yes, i agree with that. For the consumer market, SFPs needs to grow up
and start doing full and reliable auto-neg, just like copper Ethernet.

However, there is often an intermediate step after the really niche
market like TOR routers. Industrial applications start using this
stuff. There are a lot of planes flying today using SFPs for the
inflight entertainment systems. Fibre weights less than copper. It is
a somewhat specialist market, so you probably can still force them to
read the hardware manual, but i think they would prefer not to. And
i'm sure they are not the only industrial users. There are likely to
be more industrial users than TOR users.

In general, it is hard to know which APIs are going to remain Unix
Wizard level, and which are going to be used by mere mortals. So
ideally, we want consistency everywhere.

> I think the alternative, of finding a set of semantics that fits
>  everyone's hardware and covers everyone's requirements, is likely to
>  be difficult (and probably require changing the ethtool API).

Now is a good time to change the API, since we are moving to a netlink
socket. Which is why these questions were asked in the first place...

Andrew

Re: WARN_ON in TLP causing RT throttling

2018-09-28 Thread stranche


On 2018-09-27 18:25, Eric Dumazet wrote:

On 09/27/2018 05:16 PM, stran...@codeaurora.org wrote:


Hi Yuchung,

Based on the dumps we were able to get, it appears that TFO was not 
used in this case.
We also tried some local experiments where we dropped incoming SYN 
packets after already
successful TFO connections on the receive side to see if TFO would 
trigger this scenario, but

have not been able to reproduce it.

One other interesting thing we found is that the socket never sent or 
received any data. It only
sent/received the packets for the initial handshake and the outgoing 
FIN.


Just to make sure : Was this some sort of syzkaller (or other fuzzer) 
run ?


No, this was a normal run.

bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones

Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
otherwise a trace like below is shown

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

Signed-off-by: Dave Jones 

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c05c01a00755..77a3607a7099 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -977,6 +977,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
if (bond_3ad_get_active_agg_info(bond, _info))
return;
 
+   rcu_read_lock();
bond_for_each_slave_rcu(bond, slave, iter) {
if (!bond_slave_is_up(slave))
continue;
@@ -992,6 +993,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
 
netpoll_poll_dev(slave->dev);
}
+   rcu_read_unlock();
 }
 
 static void bond_netpoll_cleanup(struct net_device *bond_dev)

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-28 Thread Edward Cree

On 28/09/18 16:39, Andrew Lunn wrote:
> I wonder how true that will be in 5 years time, about reading the
> manual? SFP sockets are starting to appear in consumer devices. There
> are some Marvell SoC reference boards with SFP and SFP+. Broadcom also
> have some boards with SFP. With time, SFP will move out of the data
> centre and comms rack and into more everyday systems. In such context,
> reading the manual becomes less likely. It would be nice to avoid a
> future inconsistent mess caused be this sentiment now.
>
>   Andrew
I see where you're coming from, but if people start needing to manually
 configure FEC on their consumer devices, possibly we have bigger
 problems.
Ethtool FEC control is for those situations where autoneg, autodetect,
 autoconfigure etc. don't work (e.g. owing to out-of-spec switches, or
 just a user wanting to disable FEC to save a few hundred nanos).  I
 would hope that FEC won't show up in consumer gear until these kinds
 of problems have settled down somewhat.

Perhaps we can add something to the man page saying that not only can
 the semantics vary from NIC to NIC, but that the semantics for a given
 NIC might change in the future?  Then if in five years' time we know
 what the Right Thing™ is, we can move everyone over to that (with
 appropriately *loud* release-notes).

I think the alternative, of finding a set of semantics that fits
 everyone's hardware and covers everyone's requirements, is likely to
 be difficult (and probably require changing the ethtool API).

-Ed

RE: [PATCH net] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-09-28 Thread Justin.Lee1

Hi Samuel,

Please see my comment below.

Thanks,
Justin


> On Thu, 2018-09-27 at 21:08 +, justin.l...@dell.com wrote:
> > The new command (NCSI_CMD_SEND_CMD) is added to allow user space 
> > application 
> > to send NC-SI command to the network card.
> > Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
> > response.
> > 
> > The work flow is as below. 
> > 
> > Request:
> > User space application -> Netlink interface (msg)
> >   -> new Netlink handler - 
> > ncsi_send_cmd_nl()
> >   -> ncsi_xmit_cmd()
> > Response:
> > Response received - ncsi_rcv_rsp() -> internal response handler - 
> > ncsi_rsp_handler_xxx()
> > -> 
> > ncsi_rsp_handler_netlink()
> > -> 
> > ncsi_send_netlink_rsp ()
> > -> 
> > Netlink interface (msg)
> > -> 
> > user space application
> > Command timeout - ncsi_request_timeout() -> ncsi_send_netlink_timeout ()
> > 
> > -> Netlink interface (msg with zero data length)
> > 
> > -> user space application
> > Error:
> > Error detected -> ncsi_send_netlink_err () -> Netlink interface (err msg)
> > 
> >-> user space application
> > 
> > 
> > Signed-off-by: Justin Lee 
> > 
> 
> Hi Justin,
> 
> Thanks for posting this on the list! The overall design looks good and so
> far looks like it should fit relatively well with the other OEM command
> patch. I'll try and run some OEM commands against my machine.
> Some comments below:
> 
> > 
> > ---
> >  include/uapi/linux/ncsi.h |   3 +
> >  net/ncsi/internal.h   |  12 ++-
> >  net/ncsi/ncsi-aen.c   |  10 ++-
> >  net/ncsi/ncsi-cmd.c   | 106 
> >  net/ncsi/ncsi-manage.c|  74 ++---
> >  net/ncsi/ncsi-netlink.c   | 199 
> > +-
> >  net/ncsi/ncsi-netlink.h   |   4 +
> >  net/ncsi/ncsi-rsp.c   |  70 ++--
> >  8 files changed, 420 insertions(+), 58 deletions(-)
> > 
> > diff --git a/include/uapi/linux/ncsi.h b/include/uapi/linux/ncsi.h
> > index 4c292ec..4992bfc 100644
> > --- a/include/uapi/linux/ncsi.h
> > +++ b/include/uapi/linux/ncsi.h
> > @@ -30,6 +30,7 @@ enum ncsi_nl_commands {
> > NCSI_CMD_PKG_INFO,
> > NCSI_CMD_SET_INTERFACE,
> > NCSI_CMD_CLEAR_INTERFACE,
> > +   NCSI_CMD_SEND_CMD,
> >  
> > __NCSI_CMD_AFTER_LAST,
> > NCSI_CMD_MAX = __NCSI_CMD_AFTER_LAST - 1
> > @@ -43,6 +44,7 @@ enum ncsi_nl_commands {
> >   * @NCSI_ATTR_PACKAGE_LIST: nested array of NCSI_PKG_ATTR attributes
> >   * @NCSI_ATTR_PACKAGE_ID: package ID
> >   * @NCSI_ATTR_CHANNEL_ID: channel ID
> > + * @NCSI_ATTR_DATA: command payload
> >   * @NCSI_ATTR_MAX: highest attribute number
> >   */
> >  enum ncsi_nl_attrs {
> > @@ -51,6 +53,7 @@ enum ncsi_nl_attrs {
> > NCSI_ATTR_PACKAGE_LIST,
> > NCSI_ATTR_PACKAGE_ID,
> > NCSI_ATTR_CHANNEL_ID,
> > +   NCSI_ATTR_DATA,
> >  
> > __NCSI_ATTR_AFTER_LAST,
> > NCSI_ATTR_MAX = __NCSI_ATTR_AFTER_LAST - 1
> > diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> > index 8055e39..20ce735 100644
> > --- a/net/ncsi/internal.h
> > +++ b/net/ncsi/internal.h
> > @@ -215,12 +215,17 @@ struct ncsi_request {
> > unsigned charid;  /* Request ID - 0 to 255   */
> > bool used;/* Request that has been assigned  */
> > unsigned int flags;   /* NCSI request property   */
> > -#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
> > +#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
> > +#define NCSI_REQ_FLAG_NETLINK_DRIVEN   2
> > struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
> > struct sk_buff   *cmd;/* Associated NCSI command packet  */
> > struct sk_buff   *rsp;/* Associated NCSI response packet */
> > struct timer_listtimer;   /* Timer on waiting for response   */
> > bool enabled; /* Time has been enabled or not*/
> > +
> > +   u32  snd_seq; /* netlink sending sequence number */
> > +   u32  snd_portid;  /* netlink portid of sender*/
> > +   struct nlmsghdr  nlhdr;   /* netlink message header  */
> >  };
> >  
> >  enum {
> > @@ -301,10 +306,13 @@ struct ncsi_cmd_arg {
> > unsigned short   payload; /* Command packet payload length */
> > unsigned int req_flags;   /* NCSI request properties   */
> > union {
> > -

[PATCH RFC net-next 4/5] net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR

2018-09-28 Thread dsahern

From: David Ahern 

Update inet_dump_ifaddr to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifaddrmsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values suppored by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
will support for other fields (e.g., honor ifa_index and only return data
for the given device index).

Signed-off-by: David Ahern 
---
 net/ipv4/devinet.c | 52 +---
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 44d931a3cd50..1e06a21cd8f4 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1661,15 +1661,15 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct 
in_ifaddr *ifa,
 
 static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
struct inet_fill_args fillargs = {
.portid = NETLINK_CB(cb->skb).portid,
-   .seq = cb->nlh->nlmsg_seq,
+   .seq = nlh->nlmsg_seq,
.event = RTM_NEWADDR,
-   .flags = NLM_F_MULTI,
.netnsid = -1,
};
struct net *net = sock_net(skb->sk);
-   struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
int h, s_h;
int idx, s_idx;
@@ -1683,15 +1683,45 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
 
-   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-   ifa_ipv4_policy, NULL) >= 0) {
-   if (tb[IFA_TARGET_NETNSID]) {
-   fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct nlattr *tb[IFA_MAX+1];
+   struct ifaddrmsg *ifm;
+   int err, i;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ifm = (struct ifaddrmsg *) nlmsg_data(cb->nlh);
+   if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifa_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
+   }
+   err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, 
IFA_MAX,
+   ifa_ipv4_policy, NULL);
+   if (err < 0)
+   return err;
 
-   tgt_net = rtnl_get_net_ns_capable(skb->sk,
- fillargs.netnsid);
-   if (IS_ERR(tgt_net))
-   return PTR_ERR(tgt_net);
+   for (i = 0; i < IFA_MAX; ++i) {
+   if (i == IFA_TARGET_NETNSID) {
+   fillargs.netnsid = nla_get_s32(tb[i]);
+
+   tgt_net = rtnl_get_net_ns_capable(skb->sk,
+ 
fillargs.netnsid);
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
+
+   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   }
+   if (tb[i]) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
}
}
 
-- 
2.11.0

[PATCH RFC net-next 2/5] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-09-28 Thread dsahern

From: David Ahern 

Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid
into it. Since IFA_TARGET_NETNSID is a kernel side filter add the
NLM_F_DUMP_FILTERED flag so userspace knows the request was honored.

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 59 +
 1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9a317322388..375ea9d9869b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4793,12 +4793,19 @@ static inline int inet6_ifaddr_msgsize(void)
   + nla_total_size(4)  /* IFA_RT_PRIORITY */;
 }
 
+enum addr_type_t {
+   UNICAST_ADDR,
+   MULTICAST_ADDR,
+   ANYCAST_ADDR,
+};
+
 struct inet6_fill_args {
u32 portid;
u32 seq;
int event;
unsigned int flags;
int netnsid;
+   enum addr_type_t type;
 };
 
 static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
@@ -4930,39 +4937,28 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, 
struct ifacaddr6 *ifaca,
return 0;
 }
 
-enum addr_type_t {
-   UNICAST_ADDR,
-   MULTICAST_ADDR,
-   ANYCAST_ADDR,
-};
-
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
- struct netlink_callback *cb, enum addr_type_t type,
- int s_ip_idx, int *p_ip_idx, int netnsid)
+ struct netlink_callback *cb,
+ int s_ip_idx, int *p_ip_idx,
+ struct inet6_fill_args *fillargs)
 {
-   struct inet6_fill_args fillargs = {
-   .portid = NETLINK_CB(cb->skb).portid,
-   .seq = cb->nlh->nlmsg_seq,
-   .flags = NLM_F_MULTI,
-   .netnsid = netnsid,
-   };
struct ifmcaddr6 *ifmca;
struct ifacaddr6 *ifaca;
int err = 1;
int ip_idx = *p_ip_idx;
 
read_lock_bh(>lock);
-   switch (type) {
+   switch (fillargs->type) {
case UNICAST_ADDR: {
struct inet6_ifaddr *ifa;
-   fillargs.event = RTM_NEWADDR;
+   fillargs->event = RTM_NEWADDR;
 
/* unicast address incl. temp addr */
list_for_each_entry(ifa, >addr_list, if_list) {
if (++ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifaddr(skb, ifa, );
+   err = inet6_fill_ifaddr(skb, ifa, fillargs);
if (err < 0)
break;
nl_dump_check_consistent(cb, nlmsg_hdr(skb));
@@ -4970,26 +4966,26 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
struct sk_buff *skb,
break;
}
case MULTICAST_ADDR:
-   fillargs.event = RTM_GETMULTICAST;
+   fillargs->event = RTM_GETMULTICAST;
 
/* multicast address */
for (ifmca = idev->mc_list; ifmca;
 ifmca = ifmca->next, ip_idx++) {
if (ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifmcaddr(skb, ifmca, );
+   err = inet6_fill_ifmcaddr(skb, ifmca, fillargs);
if (err < 0)
break;
}
break;
case ANYCAST_ADDR:
-   fillargs.event = RTM_GETANYCAST;
+   fillargs->event = RTM_GETANYCAST;
/* anycast address */
for (ifaca = idev->ac_list; ifaca;
 ifaca = ifaca->aca_next, ip_idx++) {
if (ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifacaddr(skb, ifaca, );
+   err = inet6_fill_ifacaddr(skb, ifaca, fillargs);
if (err < 0)
break;
}
@@ -5005,10 +5001,16 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
struct sk_buff *skb,
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
   enum addr_type_t type)
 {
+   struct inet6_fill_args fillargs = {
+   .portid = NETLINK_CB(cb->skb).portid,
+   .seq = cb->nlh->nlmsg_seq,
+   .flags = NLM_F_MULTI,
+   .netnsid = -1,
+   .type = type,
+   };
struct net *net = sock_net(skb->sk);
struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
-   int netnsid = -1;
int h, s_h;
int idx, ip_idx;
int s_idx, s_ip_idx;
@@ -5023,11 +5025,14 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
ifa_ipv6_policy, NULL)

[PATCH RFC net-next 5/5] net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR

2018-09-28 Thread dsahern

From: David Ahern 

Update inet6_dump_addr to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifaddrmsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values suppored by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
will support for other fields (e.g., honor ifa_index and only return data
for the given device index).

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 50 --
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 375ea9d9869b..888e5a4b8dd2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5001,6 +5001,8 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
   enum addr_type_t type)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
struct inet6_fill_args fillargs = {
.portid = NETLINK_CB(cb->skb).portid,
.seq = cb->nlh->nlmsg_seq,
@@ -5009,7 +5011,6 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
.type = type,
};
struct net *net = sock_net(skb->sk);
-   struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
int h, s_h;
int idx, ip_idx;
@@ -5022,17 +5023,46 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
 
-   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-   ifa_ipv6_policy, NULL) >= 0) {
-   if (tb[IFA_TARGET_NETNSID]) {
-   fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct nlattr *tb[IFA_MAX+1];
+   struct ifaddrmsg *ifm;
+   int err, i;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
 
-   tgt_net = rtnl_get_net_ns_capable(skb->sk,
- fillargs.netnsid);
-   if (IS_ERR(tgt_net))
-   return PTR_ERR(tgt_net);
+   ifm = (struct ifaddrmsg *)nlmsg_data(cb->nlh);
+   if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifa_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
+   }
 
-   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, 
IFA_MAX,
+ ifa_ipv6_policy, NULL);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i < IFA_MAX; ++i) {
+   if (i == IFA_TARGET_NETNSID) {
+   fillargs.netnsid = nla_get_s32(tb[i]);
+
+   tgt_net = rtnl_get_net_ns_capable(skb->sk,
+ 
fillargs.netnsid);
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
+
+   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   }
+   if (tb[i]) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
}
}
 
-- 
2.11.0

[PATCH RFC net-next 3/5] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-09-28 Thread dsahern

From: David Ahern 

Add a new flag, NLM_F_DUMP_PROPER_HDR, for userspace to indicate to the
kernel that it believes it is sending the right header struct for the
dump message type (ifinfomsg, ifaddrmsg, rtmsg, fib_rule_hdr, ...).

Setting the flag in the netlink message header indicates to the kernel
it should do rigid checking on all data passed in the dump request and
filter the data returned based on data passed in.

Signed-off-by: David Ahern 
---
 include/uapi/linux/netlink.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 776bc92e9118..e722bed88dee 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -57,6 +57,7 @@ struct nlmsghdr {
 #define NLM_F_ECHO 0x08/* Echo this request*/
 #define NLM_F_DUMP_INTR0x10/* Dump was inconsistent due to 
sequence change */
 #define NLM_F_DUMP_FILTERED0x20/* Dump was filtered as requested */
+#define NLM_F_DUMP_PROPER_HDR  0x40/* Dump request has the proper header 
for type */
 
 /* Modifiers to GET request */
 #define NLM_F_ROOT 0x100   /* specify tree root*/
-- 
2.11.0

[PATCH RFC net-next 1/5] net/netlink: Pass extack to dump callbacks

2018-09-28 Thread dsahern

From: David Ahern 

Pass extack to dump callbacks by adding extack to netlink_dump_control
and transferring to netlink_callback. Update rtnetlink as the first
user.

Signed-off-by: David Ahern 
---
 include/linux/netlink.h  | 2 ++
 net/core/rtnetlink.c | 1 +
 net/netlink/af_netlink.c | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 71f121b66ca8..8fc90308a653 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -176,6 +176,7 @@ struct netlink_callback {
void*data;
/* the module that dump function belong to */
struct module   *module;
+   struct netlink_ext_ack  *extack;
u16 family;
u16 min_dump_alloc;
unsigned intprev_seq, seq;
@@ -197,6 +198,7 @@ struct netlink_dump_control {
int (*done)(struct netlink_callback *);
void *data;
struct module *module;
+   struct netlink_ext_ack *extack;
u16 min_dump_alloc;
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 35162e1b06ad..da91b38297d3 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4689,6 +4689,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.dump   = dumpit,
.min_dump_alloc = min_dump_alloc,
.module = owner,
+   .extack = extack
};
err = netlink_dump_start(rtnl, skb, nlh, );
/* netlink_dump_start() will keep a reference on
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..7d9e735b32c4 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2307,6 +2307,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
cb->module = control->module;
cb->min_dump_alloc = control->min_dump_alloc;
cb->skb = skb;
+   cb->extack = control->extack;
 
if (control->start) {
ret = control->start(cb);
-- 
2.11.0

[PATCH RFC net-next 0/5] rtnetlink: Add support for rigid checking of data in dump request

2018-09-28 Thread dsahern

From: David Ahern 

There are many use cases where a user wants to influence what is
returned in a dump for some rtnetlink command: one is wanting data
for a different namespace than the one the request is received and
another is limiting the amount of data returned in the dump to a
specific set of interest to userspace, reducing the cpu overhead of
both kernel and userspace. Unfortunately, the kernel has historically
not been strict with checking for the proper header or checking the
values passed in the header. This lenient implementation has allowed
iproute2 and other packages to pass any struct or data in the dump
request as long as the family is the first byte. For example, ifinfomsg
struct is used by iproute2 for all generic dump requests - links,
addresses, routes and rules when it is really only valid for link
requests.

There is 1 is example where the kernel deals with the wrong struct: link
dumps after VF support was added. Older iproute2 was sending rtgenmsg as
the header instead of ifinfomsg so a patch was added to try and detect
old userspace vs new:
e5eca6d41f53 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0")

The latest example is Christian's patch set wanting to return addresses for
a target namespace. It guesses the header struct is an ifaddrmsg and if it
guesses wrong a netlink warning is generated in the kernel log on every
address dump which is unacceptable.

Another example where the kernel is a bit lenient is route dumps: iproute2
can send either a request with either ifinfomsg or a rtmsg as the header
struct, yet the kernel always treats the header as an rtmsg (see
inet_dump_fib and rtm_flags check).

How to resolve the problem of not breaking old userspace yet be able to
move forward with new features such as kernel side filtering which are
crucial for efficient operation at high scale?

This patch set addresses the problem by adding a new netlink flag,
NLM_F_DUMP_PROPER_HDR, that userspace can set to say "I have a clue, and
I am sending the right header struct" and that the struct fields and any
attributes after it should be used for filtering the data returned in the
dump.

Kernel side, the dump handlers are updated to check every field in the
header struct and all attributes passed. Only ones where filtering is
implemented are allowed to be set. Any other values cause the dump to fail
with EINVAL. If the new flag is honored by the kernel and the dump contents
adjusted by any data passed in the request, the dump handler sets the
NLM_F_DUMP_FILTERED flag in the netlink message header.

This is an RFC set with the address handlers updated. If the approach is
acceptable, then I will do the same to the other rtnetlink dump handlers.


David Ahern (5):
  net/netlink: Pass extack to dump callbacks
  net/ipv6: Refactor address dump to push inet6_fill_args to
in6_dump_addrs
  netlink: introduce NLM_F_DUMP_PROPER_HDR flag
  net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR
  net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR

 include/linux/netlink.h  |   2 +
 include/uapi/linux/netlink.h |   1 +
 net/core/rtnetlink.c |   1 +
 net/ipv4/devinet.c   |  52 +-
 net/ipv6/addrconf.c  | 101 +--
 net/netlink/af_netlink.c |   1 +
 6 files changed, 114 insertions(+), 44 deletions(-)

-- 
2.11.0

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-28 Thread Andrew Lunn

> For us, those semantics make sense (our HW has a notion of 'supported'
>  and 'requested' bits for each FEC type for each of local-device, cable
>  and link-partner, and uses the strongest FEC mode that's supported by
>  everyone and requested by anyone); but if something else is a better fit
>  for your hardware I wouldn't worry too much about the inconsistency —
>  people using this functionality will hopefully have read the hardware's
>  user manual...

I wonder how true that will be in 5 years time, about reading the
manual? SFP sockets are starting to appear in consumer devices. There
are some Marvell SoC reference boards with SFP and SFP+. Broadcom also
have some boards with SFP. With time, SFP will move out of the data
centre and comms rack and into more everyday systems. In such context,
reading the manual becomes less likely. It would be nice to avoid a
future inconsistent mess caused be this sentiment now.

  Andrew

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Ido Schimmel

On Fri, Sep 28, 2018 at 05:59:00PM +0300, Ido Schimmel wrote:
> I'm getting a use-after-free when running tc_chains.sh selftest and I
> believe it's caused by this patch.

BTW, I can't reproduce the issue after reverting the patch.

Re: [Patch net-next v3] net_sched: change tcf_del_walker() to take idrinfo->lock

2018-09-28 Thread Ido Schimmel

On Wed, Sep 19, 2018 at 04:37:29PM -0700, Cong Wang wrote:
> From: Vlad Buslov 
> 
> From: Vlad Buslov 
> 
> Action API was changed to work with actions and action_idr in concurrency
> safe manner, however tcf_del_walker() still uses actions without taking a
> reference or idrinfo->lock first, and deletes them directly, disregarding
> possible concurrent delete.
> 
> Change tcf_del_walker() to take idrinfo->lock while iterating over actions
> and use new tcf_idr_release_unsafe() to release them while holding the
> lock.
> 
> And the blocking function fl_hw_destroy_tmplt() could be called when we
> put a filter chain, so defer it to a work queue.

I'm getting a use-after-free when running tc_chains.sh selftest and I
believe it's caused by this patch.

To reproduce:
# cd tools/testing/selftests/net/forwarding
# export TESTS="template_filter_fits"; ./tc_chains.sh veth0 veth1

__tcf_chain_put()
tc_chain_tmplt_del()
fl_tmplt_destroy()
tcf_queue_work(>rwork, fl_tmplt_destroy_work)
tcf_chain_destroy()
kfree(chain)

Some time later fl_tmplt_destroy_work() starts executing and
dereferencing 'chain'.

Splat:
[   48.269074] 
==
[   48.270186] BUG: KASAN: use-after-free in fl_tmplt_destroy_work+0x289/0x2d0
[   48.271199] Read of size 8 at addr 880067f0d498 by task kworker/u2:1/18
[   48.272270]
[   48.272520] CPU: 0 PID: 18 Comm: kworker/u2:1 Not tainted 4.19.0-rc5-custom+ 
#917  
[   48.273683] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[   48.275495] Workqueue: tc_filter_workqueue fl_tmplt_destroy_work 
   
[   48.276387] Call Trace:  
   
[   48.276766]  dump_stack+0x10f/0x1d8
[   48.277302]  ? dump_stack_print_info.cold.0+0x20/0x20
[   48.278080]  ? printk+0xac/0xd4   
[   48.278585]  ? cpumask_weight.constprop.23+0x39/0x39 
 
[   48.279360]  print_address_description.cold.3+0x9/0x258
[   48.280186]  kasan_report.cold.4+0x6a/0x9d
[   48.280856]  __asan_report_load8_noabort+0x19/0x20
[   48.281642]  fl_tmplt_destroy_work+0x289/0x2d0   
 
[   48.282331]  ? fl_tmplt_destroy+0x30/0x30
 
[   48.282966]  process_one_work+0xb5d/0x1a10   
 
[   48.283642]  ? kasan_check_write+0x14/0x20  
[   48.284301]  ? pwq_dec_nr_in_flight+0x4e0/0x4e0
[   48.285003]  ? lock_repin_lock+0x360/0x360   

[   48.285652]  ? __schedule+0x86d/0x2310
[   48.286256]  ? lockdep_hardirqs_on+0x39f/0x580
[   48.286995]  ? __sched_text_start+0x8/0x8
[   48.287654]  ? lock_downgrade+0x740/0x740
[   48.288292]  ? sched_clock_local+0xe0/0x150
[   48.288969]  ? save_trace+0x340/0x340
[   48.289526]  ? save_trace+0x340/0x340
[   48.290156]  ? check_preemption_disabled+0x3b/0x210
[   48.290937]  ? find_held_lock+0x40/0x1d0
[   48.291598]  ? debug_smp_processor_id+0x1c/0x20
[   48.292290]  ? worker_thread+0x3c3/0x12b0
[   48.292952]  ? lock_contended+0x1260/0x1260
[   48.293606]  ? _raw_spin_unlock_irq+0x2c/0x50
[   48.294311]  ? do_raw_spin_trylock+0x121/0x1c0
[   48.294982]  ? do_raw_spin_lock+0x1e0/0x1e0
[   48.295658]  ? trace_hardirqs_on+0x290/0x290
[   48.296346]  worker_thread+0x18e/0x12b0
[   48.296972]  ? lock_repin_lock+0x360/0x360
[   48.297624]  ? process_one_work+0x1a10/0x1a10
[   48.298332]  ? save_trace+0x340/0x340
[   48.298937]  ? kasan_check_write+0x14/0x20
[   48.299639]  ? sched_clock_local+0xe0/0x150
[   48.300310]  ? check_preemption_disabled+0x3b/0x210
[   48.301064]  ? check_preemption_disabled+0x3b/0x210
[   48.301848]  ? find_held_lock+0x40/0x1d0
[   48.302472]  ? debug_smp_processor_id+0x1c/0x20
[   48.303196]  ? _raw_spin_unlock_irqrestore+0x57/0x70
[   48.304037]  ? _raw_spin_unlock_irqrestore+0x57/0x70
[   48.304862]  ? lockdep_hardirqs_on+0x39f/0x580
[   48.305595]  ? trace_hardirqs_on+0xa1/0x290
[   48.306237]  ? do_raw_spin_trylock+0x121/0x1c0
[   48.306924]  ? __kthread_parkme+0xdc/0x1a0
[   48.307607]  ? preempt_schedule_common+0x1f/0xd0
[   48.308359]  ? __kthread_parkme+0x5d/0x1a0
[   48.309014]  ? __kthread_parkme+0xfa/0x1a0
[   48.309690]  ? _raw_spin_unlock_irqrestore+0x60/0x70
[   48.310461]  kthread+0x34d/0x410
[   48.310970]  ? process_one_work+0x1a10/0x1a10
[   48.311672]  ? kthread_delayed_work_timer_fn+0x450/0x450
[   48.312548]  ret_from_fork+0x24/0x30
[   48.313114]
[   48.313390] Allocated by task 1239:
[   48.313983]  save_stack+0x43/0xd0
[   48.314533]  kasan_kmalloc+0xc4/0xe0
[   48.315119]  kmem_cache_alloc_trace+0x124/0x280
[   48.315873]  tcf_chain_create+0xa4/0x370
[   48.316506]

[PATCH iproute2-next] tc: f_flower: add geneve option match support to flower

2018-09-28 Thread Simon Horman

From: Pieter Jansen van Vuuren 

Allow matching on options in Geneve tunnel headers.

The options can be described in the form
CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent : \
 flower \
   enc_src_ip 10.0.99.192 \
   enc_dst_ip 10.0.99.193 \
   enc_key_id 11 \
   geneve_opts 0102:80:1122334421314151/:ff: \
   ip_proto udp \
   action mirred egress redirect dev eth1

Signed-off-by: Pieter Jansen van Vuuren 
Signed-off-by: Simon Horman 
---
 man/man8/tc-flower.8 |  13 ++-
 tc/f_flower.c| 282 +++
 2 files changed, 294 insertions(+), 1 deletion(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 305d7efe046a..8be8882592ea 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -80,6 +80,8 @@ flower \- flow based traffic control filter
 .IR TOS " | "
 .B enc_ttl
 .IR TTL " | "
+.B geneve_opts
+.IR OPTIONS " | "
 .BR ip_flags
 .IR IP_FLAGS
 .SH DESCRIPTION
@@ -283,6 +285,8 @@ bits is assumed.
 .BI enc_tos " NUMBER"
 .TQ
 .BI enc_ttl " NUMBER"
+.TQ
+.BI geneve_opts " OPTIONS"
 Match on IP tunnel metadata. Key id
 .I NUMBER
 is a 32 bit tunnel key id (e.g. VNI for VXLAN tunnel).
@@ -295,7 +299,14 @@ is a 16 bit UDP dst port. Tos
 .I NUMBER
 is an 8 bit tos (dscp+ecn) value, ttl
 .I NUMBER
-is an 8 bit time-to-live value.
+is an 8 bit time-to-live value. geneve_opts
+.I OPTIONS
+must be a valid list of comma-separated geneve options where each option
+consists of a key optionally followed by a slash and corresponding mask. If
+the masks is missing, \fBtc\fR assumes a full-length match. The options can
+be described in the form CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK,
+where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit
+hexadecimal value and DATA as a variable length hexadecimal value.
 .TP
 .BI ip_flags " IP_FLAGS"
 .I IP_FLAGS
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 59e5f572c542..4a8fb984a35a 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -79,6 +79,7 @@ static void explain(void)
"   enc_key_id [ KEY-ID ] |\n"
"   enc_tos MASKED-IP_TOS |\n"
"   enc_ttl MASKED-IP_TTL |\n"
+   "   geneve_opts MASKED-OPTIONS |\n"
"   ip_flags IP-FLAGS | \n"
"   enc_dst_port [ port_number ] }\n"
"   FILTERID := X:Y:Z\n"
@@ -589,6 +590,179 @@ static int flower_parse_enc_port(char *str, int type, 
struct nlmsghdr *n)
return 0;
 }
 
+static int flower_parse_geneve_opts(char *str, struct nlmsghdr *n)
+{
+   struct rtattr *nest;
+   char *token;
+   int i, err;
+
+   nest = addattr_nest(n, MAX_MSG, TCA_FLOWER_KEY_ENC_OPTS_GENEVE);
+
+   i = 1;
+   token = strsep(, ":");
+   while (token) {
+   switch (i) {
+   case TCA_FLOWER_KEY_ENC_OPT_GENEVE_CLASS:
+   {
+   __be16 opt_class;
+
+   if (!strlen(token))
+   break;
+   err = get_be16(_class, token, 16);
+   if (err)
+   return err;
+
+   addattr16(n, MAX_MSG, i, opt_class);
+   break;
+   }
+   case TCA_FLOWER_KEY_ENC_OPT_GENEVE_TYPE:
+   {
+   __u8 opt_type;
+
+   if (!strlen(token))
+   break;
+   err = get_u8(_type, token, 16);
+   if (err)
+   return err;
+
+   addattr8(n, MAX_MSG, i, opt_type);
+   break;
+   }
+   case TCA_FLOWER_KEY_ENC_OPT_GENEVE_DATA:
+   {
+   size_t token_len = strlen(token);
+   __u8 *opts;
+
+   if (!token_len)
+   break;
+   opts = malloc(token_len / 2);
+   if (!opts)
+   return -1;
+   if (hex2mem(token, opts, token_len / 2) < 0) {
+   free(opts);
+   return -1;
+   }
+   addattr_l(n, MAX_MSG, i, opts, token_len / 2);
+   free(opts);
+
+   break;
+   }
+   default:
+   fprintf(stderr, "Unknown \"geneve_opts\" type\n");
+

Re: [PATCHv3 bpf-next 01/12] bpf: Add iterator for spilled registers

2018-09-28 Thread Daniel Borkmann

On 09/28/2018 01:26 AM, Joe Stringer wrote:
> Add this iterator for spilled registers, it concentrates the details of
> how to get the current frame's spilled registers into a single macro
> while clarifying the intention of the code which is calling the macro.
> 
> Signed-off-by: Joe Stringer 
> Acked-by: Alexei Starovoitov 
> ---
>  include/linux/bpf_verifier.h | 11 +++
>  kernel/bpf/verifier.c| 16 +++-
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index b42b60a83e19..af262b97f586 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -131,6 +131,17 @@ struct bpf_verifier_state {
>   u32 curframe;
>  };
>  
> +#define __get_spilled_reg(slot, frame)   
> \
> + (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
> +   (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
> +  ? >stack[slot].spilled_ptr : NULL)
> +
> +/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. 
> */
> +#define for_each_spilled_reg(iter, frame, reg)   
> \
> + for (iter = 0, reg = __get_spilled_reg(iter, frame);\
> +  iter < frame->allocated_stack / BPF_REG_SIZE;  \
> +  iter++, reg = __get_spilled_reg(iter, frame))
> +

(Just a very small nit: please make sure this has a bpf_ prefix given this
is a global kernel header.)

Re: [PATCHv3 bpf-next 04/12] bpf: Add PTR_TO_SOCKET verifier type

2018-09-28 Thread Daniel Borkmann

On 09/28/2018 01:26 AM, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
> 
> Signed-off-by: Joe Stringer 
[...]
> +/* Return true if it's OK to have the same insn return a different type. */
> +static bool reg_type_mismatch_ok(enum bpf_reg_type type)
> +{
> + switch (type) {
> + case PTR_TO_CTX:
> + case PTR_TO_SOCKET:
> + case PTR_TO_SOCKET_OR_NULL:
> + return false;
> + default:
> + return true;
> + }
> +}
> +
> +/* If an instruction was previously used with particular pointer types, then 
> we
> + * need to be careful to avoid cases such as the below, where it may be ok
> + * for one branch accessing the pointer, but not ok for the other branch:
> + *
> + * R1 = sock_ptr
> + * goto X;
> + * ...
> + * R1 = some_other_valid_ptr;
> + * goto X;
> + * ...
> + * R2 = *(u32 *)(R1 + 0);
> + */
> +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type prev)
> +{
> + return src != prev && (!reg_type_mismatch_ok(src) ||
> +!reg_type_mismatch_ok(prev));
> +}
> +
>  static int do_check(struct bpf_verifier_env *env)
>  {
>   struct bpf_verifier_state *state;
> @@ -4812,9 +4894,7 @@ static int do_check(struct bpf_verifier_env *env)
>*/
>   *prev_src_type = src_reg_type;
>  
> - } else if (src_reg_type != *prev_src_type &&
> -(src_reg_type == PTR_TO_CTX ||
> - *prev_src_type == PTR_TO_CTX)) {
> + } else if (reg_type_mismatch(src_reg_type, 
> *prev_src_type)) {
>   /* ABuser program is trying to use the same insn
>* dst_reg = *(u32*) (src_reg + off)
>* with different pointer types:
> @@ -4859,9 +4939,7 @@ static int do_check(struct bpf_verifier_env *env)
>  
>   if (*prev_dst_type == NOT_INIT) {
>   *prev_dst_type = dst_reg_type;
> - } else if (dst_reg_type != *prev_dst_type &&
> -(dst_reg_type == PTR_TO_CTX ||
> - *prev_dst_type == PTR_TO_CTX)) {
> + } else if (reg_type_mismatch(dst_reg_type, 
> *prev_dst_type)) {
>   verbose(env, "same insn cannot be used with 
> different pointers\n");
>   return -EINVAL;

Can also be as follow-up later on, but it would be crucial to also have
test_verifier tests on this logic here with mixing these pointer types
from different branches (right now we only cover ctx there).

Thanks,
Daniel

Re: [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification

2018-09-28 Thread Edward Cree

On 26/09/18 23:16, Jiong Wang wrote:
> On 22/08/2018 20:00, Edward Cree wrote:
>> In the future this idea may be extended to form use-def chains.
>
>   1. instruction level use->def chain
>
>  - new use->def chains for each instruction. one eBPF insn could have two
>    uses at maximum.
I was thinking of something a lot weaker/simpler, just making
    ld rX, rY
 copy rY.parent into rX.parent and not read-mark rY (whereas actual
 arithmetic, pointer deref etc. would still create read marks).
But what you've described sounds interesting; perhaps it would also
 help later with loop-variable handling?

-Ed

Re: [PATCHv3 bpf-next 04/12] bpf: Add PTR_TO_SOCKET verifier type

2018-09-28 Thread Daniel Borkmann

On 09/28/2018 01:26 AM, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
> 
> Signed-off-by: Joe Stringer 
[...]
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 72db8afb7cb6..057af3dc9f08 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5394,23 +5394,29 @@ static bool __sock_filter_check_size(int off, int 
> size,
>   return size == size_default;
>  }
>  
> -static bool sock_filter_is_valid_access(int off, int size,
> - enum bpf_access_type type,
> - const struct bpf_prog *prog,
> - struct bpf_insn_access_aux *info)
> +bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
> +   struct bpf_insn_access_aux *info)
>  {
>   if (off < 0 || off >= sizeof(struct bpf_sock))
>   return false;
>   if (off % size != 0)
>   return false;
> - if (!__sock_filter_check_attach_type(off, type,
> -  prog->expected_attach_type))
> - return false;
>   if (!__sock_filter_check_size(off, size, info))
>   return false;
>   return true;
>  }
>  
> +static bool sock_filter_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + if (!__sock_filter_check_attach_type(off, type,
> +  prog->expected_attach_type))
> + return false;
> + return bpf_sock_is_valid_access(off, size, type, info);
> +}

This one here should also be swapped to make it more robust, meaning the
__sock_filter_check_attach_type() should come in a second step after basic
sanity checks have been completed, not before them. E.g. out of bounds read
access would then indicate a "good" access in the first one.

Re: [PATCHv3 bpf-next 04/12] bpf: Add PTR_TO_SOCKET verifier type

2018-09-28 Thread Daniel Borkmann

Hi Joe,

On 09/28/2018 01:26 AM, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
> 
> Signed-off-by: Joe Stringer 
[...]
>   }
> @@ -1726,6 +1755,14 @@ static int check_mem_access(struct bpf_verifier_env 
> *env, int insn_idx, u32 regn
>   err = check_flow_keys_access(env, off, size);
>   if (!err && t == BPF_READ && value_regno >= 0)
>   mark_reg_unknown(env, regs, value_regno);
> + } else if (reg->type == PTR_TO_SOCKET) {
> + if (t == BPF_WRITE) {
> + verbose(env, "cannot write into socket\n");
> + return -EACCES;
> + }
> + err = check_sock_access(env, regno, off, size, t);
> + if (!err && value_regno >= 0)
> + mark_reg_unknown(env, regs, value_regno);

Not an issue today, but this is quite fragile and very easy to miss, if we
allow to enable writes into ptr_to_socket in future e.g. mark or others,
then lifting above will not be enough. E.g. see check_xadd() and friends,
this rejects writes to ctx via f37a8cb84cce ("bpf: reject stores into ctx
via st and xadd") as otherwise this would bypass the context rewriter. So
I think we should add PTR_TO_SOCKET to is_ctx_reg() as well to have a full
guarantee this won't happen.

>   } else {
>   verbose(env, "R%d invalid mem access '%s'\n", regno,
>   reg_type_str[reg->type]);

Thanks,
Daniel

1 2 >

1 - 100 of 151 matches

Mail list logo