date:20180305

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread Ido Schimmel

On Mon, Mar 05, 2018 at 01:28:28PM +, John Hurley wrote:
> To prevent sync issues between the kernel and offload device, the linux
> bond driver is affectively locked when it has offloaded rules. i.e no new
> ports can be enslaved and no slaves can be released until the offload
> rules are removed. Similarly, if a port on a bond is deleted, the bond is
> destroyed, forcing a flush of all offloaded rules.

Hi John,

I understand where this is coming from, but I don't think these
semantics are acceptable. The part about not adding new slaves might
make sense, but one needs to be able to remove slaves at any time.

Anyway, it would be much better to handle this in a generic way that
team and other stacked devices can later re-use. There's a similar sync
issue with VLAN filtering, which is handled by bond/team by calling
vlan_vids_add_by_dev() and vlan_vids_del_by_dev() in their
ndo_add_slave() and ndo_del_slave(), respectively. You can do something
similar and call into TC to replay the necessary information to the
newly added slave?

Flaw in RFC793 (Fwd: New Version Notification for draft-gont-tcpm-tcp-seq-validation-03.txt)

2018-03-05 Thread Fernando Gont

Folks,

Dave Borman  and me are trying to get this flaw fixed in the TCP spec --
this is of particular interest since the IETF finally agreed to revise
the old spec. The working copy of our document is:


I'm wondering if any Linux TCP expert could help with this:

* Would you mind taking a look at our doc, and check if our description
of the Linux behavior is correct?

* If you do something different or better, we'd also like to know.

Thanks!

Cheers,
Fernando




 Forwarded Message 
Subject: New Version Notification for
draft-gont-tcpm-tcp-seq-validation-03.txt
Date: Mon, 05 Mar 2018 15:43:15 -0800
From: internet-dra...@ietf.org
To: Fernando Gont , David Borman



A new version of I-D, draft-gont-tcpm-tcp-seq-validation-03.txt
has been successfully submitted by Fernando Gont and posted to the
IETF repository.

Name:   draft-gont-tcpm-tcp-seq-validation
Revision:   03
Title:  On the Validation of TCP Sequence Numbers
Document date:  2018-03-05
Group:  Individual Submission
Pages:  16
URL:
https://www.ietf.org/internet-drafts/draft-gont-tcpm-tcp-seq-validation-03.txt
Status:
https://datatracker.ietf.org/doc/draft-gont-tcpm-tcp-seq-validation/
Htmlized:
https://tools.ietf.org/html/draft-gont-tcpm-tcp-seq-validation-03
Htmlized:
https://datatracker.ietf.org/doc/html/draft-gont-tcpm-tcp-seq-validation-03
Diff:
https://www.ietf.org/rfcdiff?url2=draft-gont-tcpm-tcp-seq-validation-03

Abstract:
   When TCP receives packets that lie outside of the receive window, the
   corresponding packets are dropped and either an ACK, RST or no
   response is generated due to the out-of-window packet, with no
   further processing of the packet.  Most of the time, this works just
   fine and TCP remains stable, especially when a TCP connection has
   unidirectional data flow.  However, there are three scenarios in
   which packets that are outside of the receive window should still
   have their ACK field processed, or else a packet war will take place.
   The aforementioned issues have affected a number of popular TCP
   implementations, typically leading to connection failures, system
   crashes, or other undesirable behaviors.  This document describes the
   three scenarios in which the aforementioned issues might arise, and
   formally updates RFC 793 such that these potential problems are
   mitigated.




Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at tools.ietf.org.

The IETF Secretariat

Re: [PATCH iproute2-next 3/3] macsec: support JSON

2018-03-05 Thread Stephen Hemminger

On Mon,  5 Mar 2018 22:58:30 -0800
Stephen Hemminger  wrote:

> From: Stephen Hemminger 
> 
> The JSON support in macsec code was mostly missing and what was
> there was broken. This uses new json_print utilities to complete
> output.
> 
> Compile tested only.
> 
> Signed-off-by: Stephen Hemminger 

Did some basic macsec testing and it works correctly.

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend

On 03/05/2018 10:42 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 5 Mar 2018 22:22:21 -0800
> 
>> All I meant by this is if an application uses sendfile() call
>> there is no good way to know when/if the kernel side will copy or
>> xmit the  data. So a reliable user space application will need to
>> only modify the data if it "knows" there are no outstanding sends
>> in-flight. So if we assume applications follow this then it
>> is OK to avoid the copy. Of course this is not good enough for
>> security, but for monitoring/statistics (my use case 1 it works).
> 
> For an application implementing a networking file system, it's pretty
> legitimate for file contents to change before the page gets DMA's to
> the networking card.
> 

Still there are useful BPF programs that can tolerate this. So I
would prefer to allow BPF programs to operate in the no-copy mode
if wanted. It doesn't have to be the default though as it currently
is. A l7 load balancer is a good example of this.

> And that's perfectly fine, and we everything such that this will work
> properly.
> 
> The card checksums what ends up being DMA'd so nothing from the
> networking side is broken.

Assuming the card has checksum support correct? Which is why we have
the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
helpers called by the drivers when they do not support the protocol
being used. So probably OK assumption if using supported protocols and
hardware? Perhaps in general folks just use normal protocols and
hardware so it works.

> 
> So this assumption you mention really does not hold.
> 

OK.

> There needs to be some feedback from the BPF program that parses the
> packet.  This way it can say, "I need at least X more bytes before I
> can generate a verdict".  And you keep copying more and more bytes
> into a linear buffer and calling the parser over and over until it can
> generate a full verdict or you run out of networking data.
> 

So the "I need at least X more bytes" is the msg_cork_bytes() in patch
7. I could handle the sendpage case the same as I handle the sendmsg
case and copy the data into the buffer until N bytes are received. I
had planned to add this mode in a follow up series but could add it in
this series so we have all the pieces in one submission.

Although I used a scatterlist instead of a linear buffer. I was
planning to add a helper to pull in next sg list item if needed
rather than try to allocate a large linear block up front.

Re: [PATCH net-next 2/3] sctp: add support for SCTP_DSTADDRV4/6 Information for sendmsg

2018-03-05 Thread Xin Long

On Tue, Mar 6, 2018 at 7:39 AM, Marcelo Ricardo Leitner
 wrote:
> On Mon, Mar 05, 2018 at 08:44:19PM +0800, Xin Long wrote:
>> This patch is to add support for Destination IPv4/6 Address options
>> for sendmsg, as described in section 5.3.9/10 of RFC6458.
>>
>> With this option, you can provide more than one destination addrs
>> to sendmsg when creating asoc, like sctp_connectx.
>>
>> It's also a necessary send info for sctp_sendv.
>>
>> Signed-off-by: Xin Long 
>> ---
>>  include/net/sctp/structs.h |  1 +
>>  include/uapi/linux/sctp.h  |  6 
>>  net/sctp/socket.c  | 77 
>> ++
>>  3 files changed, 84 insertions(+)
>>
>> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>> index d40a2a3..ec6e46b 100644
>> --- a/include/net/sctp/structs.h
>> +++ b/include/net/sctp/structs.h
>> @@ -2113,6 +2113,7 @@ struct sctp_cmsgs {
>>   struct sctp_sndrcvinfo *srinfo;
>>   struct sctp_sndinfo *sinfo;
>>   struct sctp_prinfo *prinfo;
>> + struct msghdr *addrs_msg;
>>  };
>>
>>  /* Structure for tracking memory objects */
>> diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
>> index 0dd1f82..a1bc350 100644
>> --- a/include/uapi/linux/sctp.h
>> +++ b/include/uapi/linux/sctp.h
>> @@ -308,6 +308,12 @@ typedef enum sctp_cmsg_type {
>>  #define SCTP_NXTINFO SCTP_NXTINFO
>>   SCTP_PRINFO,/* 5.3.7 SCTP PR-SCTP Information Structure */
>>  #define SCTP_PRINFO  SCTP_PRINFO
>> + SCTP_AUTHINFO,  /* 5.3.8 SCTP AUTH Information Structure 
>> (RESERVED) */
>> +#define SCTP_AUTHINFOSCTP_AUTHINFO
>> + SCTP_DSTADDRV4, /* 5.3.9 SCTP Destination IPv4 Address 
>> Structure */
>> +#define SCTP_DSTADDRV4   SCTP_DSTADDRV4
>> + SCTP_DSTADDRV6, /* 5.3.10 SCTP Destination IPv6 Address 
>> Structure */
>> +#define SCTP_DSTADDRV6   SCTP_DSTADDRV6
>>  } sctp_cmsg_t;
>>
>>  /*
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index fdde697..067b57a 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -1676,6 +1676,7 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, 
>> __u16 sflags,
>>   struct net *net = sock_net(sk);
>>   struct sctp_association *asoc;
>>   enum sctp_scope scope;
>> + struct cmsghdr *cmsg;
>>   int err = -EINVAL;
>>
>>   *tp = NULL;
>> @@ -1741,6 +1742,67 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, 
>> __u16 sflags,
>>   goto free;
>>   }
>>
>> + if (!cmsgs->addrs_msg)
>> + return 0;
>> +
>> + /* sendv addr list parse */
>> + for_each_cmsghdr(cmsg, cmsgs->addrs_msg) {
>> + struct sctp_transport *transport;
>> + struct sctp_association *old;
>> + union sctp_addr _daddr;
>> + int dlen;
>> +
>> + if (cmsg->cmsg_level != IPPROTO_SCTP ||
>> + (cmsg->cmsg_type != SCTP_DSTADDRV4 &&
>> +  cmsg->cmsg_type != SCTP_DSTADDRV6))
>> + continue;
>> +
>> + daddr = &_daddr;
>> + memset(daddr, 0, sizeof(*daddr));
>> + dlen = cmsg->cmsg_len - sizeof(struct cmsghdr);
>> + if (cmsg->cmsg_type == SCTP_DSTADDRV4) {
>> + if (dlen < sizeof(struct in_addr))
>> + goto free;
>> +
>> + dlen = sizeof(struct in_addr);
>> + daddr->v4.sin_family = AF_INET;
>> + daddr->v4.sin_port = htons(asoc->peer.port);
>> + memcpy(>v4.sin_addr, CMSG_DATA(cmsg), dlen);
>> + } else {
>> + if (dlen < sizeof(struct in6_addr))
>> + goto free;
>> +
>> + dlen = sizeof(struct in6_addr);
>> + daddr->v6.sin6_family = AF_INET6;
>> + daddr->v6.sin6_port = htons(asoc->peer.port);
>> + memcpy(>v6.sin6_addr, CMSG_DATA(cmsg), dlen);
>> + }
>> + err = sctp_verify_addr(sk, daddr, sizeof(*daddr));
>> + if (err)
>> + goto free;
>> +
>> + old = sctp_endpoint_lookup_assoc(ep, daddr, );
>> + if (old && old != asoc) {
>> + if (old->state >= SCTP_STATE_ESTABLISHED)
>> + err = -EISCONN;
>> + else
>> + err = -EALREADY;
>> + goto free;
>> + }
>> +
>> + if (sctp_endpoint_is_peeled_off(ep, daddr)) {
>> + err = -EADDRNOTAVAIL;
>> + goto free;
>> + }
>> +
>> + transport = sctp_assoc_add_peer(asoc, daddr, GFP_KERNEL,
>> + SCTP_UNKNOWN);
>> + if (!transport) {
>> + err = -ENOMEM;
>> + goto free;
>> +

[PATCH iproute2-next 3/3] macsec: support JSON

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

The JSON support in macsec code was mostly missing and what was
there was broken. This uses new json_print utilities to complete
output.

Compile tested only.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmacsec.c | 249 +++---
 1 file changed, 168 insertions(+), 81 deletions(-)

diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index b9976da80603..38ec7136e555 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -559,19 +559,33 @@ static int validate_secy_dump(struct rtattr **attrs)
   attrs[MACSEC_SECY_ATTR_SCB];
 }
 
-static void print_flag(FILE *f, struct rtattr *attrs[], const char *desc,
+static void print_flag(struct rtattr *attrs[], const char *desc,
   int field)
 {
-   if (attrs[field]) {
-   const char *v = values_on_off[!!rta_getattr_u8(attrs[field])];
+   __u8 flag;
 
-   if (is_json_context())
-   print_string(PRINT_JSON, desc, NULL, v);
-   else
-   fprintf(f, "%s %s ", desc, v);
+   if (!attrs[field])
+   return;
+
+   flag = rta_getattr_u8(attrs[field]);
+   if (is_json_context())
+   print_bool(PRINT_JSON, desc, NULL, flag);
+   else {
+   print_string(PRINT_FP, NULL, "%s ", desc);
+   print_string(PRINT_FP, NULL, "%s ",
+flag ? "on" : "off");
}
 }
 
+static void print_key(struct rtattr *key)
+{
+   SPRINT_BUF(keyid);
+
+   print_string(PRINT_ANY, "key", " key %s\n",
+hexstring_n2a(RTA_DATA(key), RTA_PAYLOAD(key),
+  keyid, sizeof(keyid)));
+}
+
 #define DEFAULT_CIPHER_NAME "GCM-AES-128"
 
 static const char *cs_id_to_name(__u64 cid)
@@ -585,43 +599,45 @@ static const char *cs_id_to_name(__u64 cid)
}
 }
 
-static void print_cipher_suite(const char *prefix, __u64 cid, __u8 icv_len)
-{
-   printf("%scipher suite: %s, using ICV length %d\n", prefix,
-  cs_id_to_name(cid), icv_len);
-}
-
-static void print_attrs(const char *prefix, struct rtattr *attrs[])
+static void print_attrs(struct rtattr *attrs[])
 {
-   print_flag(stdout, attrs, "protect", MACSEC_SECY_ATTR_PROTECT);
+   print_flag(attrs, "protect", MACSEC_SECY_ATTR_PROTECT);
 
if (attrs[MACSEC_SECY_ATTR_VALIDATE]) {
__u8 val = rta_getattr_u8(attrs[MACSEC_SECY_ATTR_VALIDATE]);
 
-   printf("validate %s ", validate_str[val]);
+   print_string(PRINT_ANY, "validate",
+"validate %s ", validate_str[val]);
}
 
-   print_flag(stdout, attrs, "sc", MACSEC_RXSC_ATTR_ACTIVE);
-   print_flag(stdout, attrs, "sa", MACSEC_SA_ATTR_ACTIVE);
-   print_flag(stdout, attrs, "encrypt", MACSEC_SECY_ATTR_ENCRYPT);
-   print_flag(stdout, attrs, "send_sci", MACSEC_SECY_ATTR_INC_SCI);
-   print_flag(stdout, attrs, "end_station", MACSEC_SECY_ATTR_ES);
-   print_flag(stdout, attrs, "scb", MACSEC_SECY_ATTR_SCB);
+   print_flag(attrs, "sc", MACSEC_RXSC_ATTR_ACTIVE);
+   print_flag(attrs, "sa", MACSEC_SA_ATTR_ACTIVE);
+   print_flag(attrs, "encrypt", MACSEC_SECY_ATTR_ENCRYPT);
+   print_flag(attrs, "send_sci", MACSEC_SECY_ATTR_INC_SCI);
+   print_flag(attrs, "end_station", MACSEC_SECY_ATTR_ES);
+   print_flag(attrs, "scb", MACSEC_SECY_ATTR_SCB);
+   print_flag(attrs, "replay", MACSEC_SECY_ATTR_REPLAY);
 
-   print_flag(stdout, attrs, "replay", MACSEC_SECY_ATTR_REPLAY);
if (attrs[MACSEC_SECY_ATTR_WINDOW]) {
-   printf("window %d ",
-  rta_getattr_u32(attrs[MACSEC_SECY_ATTR_WINDOW]));
+   __u32 win = rta_getattr_u32(attrs[MACSEC_SECY_ATTR_WINDOW]);
+
+   print_uint(PRINT_ANY, "window", "window %u ", win);
}
 
-   if (attrs[MACSEC_SECY_ATTR_CIPHER_SUITE] &&
-   attrs[MACSEC_SECY_ATTR_ICV_LEN]) {
-   printf("\n");
-   print_cipher_suite(prefix,
-   rta_getattr_u64(attrs[MACSEC_SECY_ATTR_CIPHER_SUITE]),
-   rta_getattr_u8(attrs[MACSEC_SECY_ATTR_ICV_LEN]));
+   if (attrs[MACSEC_SECY_ATTR_CIPHER_SUITE]) {
+   __u64 cid = 
rta_getattr_u64(attrs[MACSEC_SECY_ATTR_CIPHER_SUITE]);
+
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   print_string(PRINT_ANY, "cipher_suite",
+"cipher suite: %s,", cs_id_to_name(cid));
}
 
+   if (attrs[MACSEC_SECY_ATTR_ICV_LEN]) {
+   __u8 icv_len = rta_getattr_u8(attrs[MACSEC_SECY_ATTR_ICV_LEN]);
+
+   print_uint(PRINT_ANY, "icv_length",
+" using ICV length %u\n", icv_len);
+   }
 }
 
 static __u64 getattr_uint(struct rtattr *stat)
@@ -642,9 +658,9 @@ static __u64 getattr_uint(struct rtattr *stat)
}
 }
 
-static

[PATCH iproute2-next 1/3] ip: macsec cleanup

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Break long lines and use const as recommended by checkpatch.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmacsec.c | 46 --
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index c0b45f5a12d5..37faed821c10 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -23,9 +23,9 @@
 #include "ll_map.h"
 #include "libgenl.h"
 
-static const char *values_on_off[] = { "off", "on" };
+static const char * const values_on_off[] = { "off", "on" };
 
-static const char *VALIDATE_STR[] = {
+static const char * const validate_str[] = {
[MACSEC_VALIDATE_DISABLED] = "disabled",
[MACSEC_VALIDATE_CHECK] = "check",
[MACSEC_VALIDATE_STRICT] = "strict",
@@ -81,26 +81,27 @@ static int genl_family = -1;
 
 static void ipmacsec_usage(void)
 {
-   fprintf(stderr, "Usage: ip macsec add DEV tx sa { 0..3 } [ OPTS ] key 
ID KEY\n");
-   fprintf(stderr, "   ip macsec set DEV tx sa { 0..3 } [ OPTS ]\n");
-   fprintf(stderr, "   ip macsec del DEV tx sa { 0..3 }\n");
-   fprintf(stderr, "   ip macsec add DEV rx SCI [ on | off ]\n");
-   fprintf(stderr, "   ip macsec set DEV rx SCI [ on | off ]\n");
-   fprintf(stderr, "   ip macsec del DEV rx SCI\n");
-   fprintf(stderr, "   ip macsec add DEV rx SCI sa { 0..3 } [ OPTS ] 
key ID KEY\n");
-   fprintf(stderr, "   ip macsec set DEV rx SCI sa { 0..3 } [ OPTS 
]\n");
-   fprintf(stderr, "   ip macsec del DEV rx SCI sa { 0..3 }\n");
-   fprintf(stderr, "   ip macsec show\n");
-   fprintf(stderr, "   ip macsec show DEV\n");
-   fprintf(stderr, "where  OPTS := [ pn  ] [ on | off ]\n");
-   fprintf(stderr, "   ID   := 128-bit hex string\n");
-   fprintf(stderr, "   KEY  := 128-bit hex string\n");
-   fprintf(stderr, "   SCI  := { sci  | port { 1..2^16-1 } 
address  }\n");
+   fprintf(stderr,
+   "Usage: ip macsec add DEV tx sa { 0..3 } [ OPTS ] key ID KEY\n"
+   "   ip macsec set DEV tx sa { 0..3 } [ OPTS ]\n"
+   "   ip macsec del DEV tx sa { 0..3 }\n"
+   "   ip macsec add DEV rx SCI [ on | off ]\n"
+   "   ip macsec set DEV rx SCI [ on | off ]\n"
+   "   ip macsec del DEV rx SCI\n"
+   "   ip macsec add DEV rx SCI sa { 0..3 } [ OPTS ] key ID 
KEY\n"
+   "   ip macsec set DEV rx SCI sa { 0..3 } [ OPTS ]\n"
+   "   ip macsec del DEV rx SCI sa { 0..3 }\n"
+   "   ip macsec show\n"
+   "   ip macsec show DEV\n"
+   "where  OPTS := [ pn  ] [ on | off ]\n"
+   "   ID   := 128-bit hex string\n"
+   "   KEY  := 128-bit hex string\n"
+   "   SCI  := { sci  | port { 1..2^16-1 } address 
 }\n");
 
exit(-1);
 }
 
-static int one_of(const char *msg, const char *realval, const char **list,
+static int one_of(const char *msg, const char *realval, const char * const 
*list,
  size_t len, int *index)
 {
int i;
@@ -597,7 +598,7 @@ static void print_attrs(const char *prefix, struct rtattr 
*attrs[])
if (attrs[MACSEC_SECY_ATTR_VALIDATE]) {
__u8 val = rta_getattr_u8(attrs[MACSEC_SECY_ATTR_VALIDATE]);
 
-   printf("validate %s ", VALIDATE_STR[val]);
+   printf("validate %s ", validate_str[val]);
}
 
print_flag(stdout, attrs, "sc", MACSEC_RXSC_ATTR_ACTIVE);
@@ -1077,7 +1078,7 @@ static void macsec_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
print_string(PRINT_ANY,
 "validation",
 "validate %s ",
-VALIDATE_STR[val]);
+validate_str[val]);
}
 
const char *inc_sci, *es, *replay;
@@ -1241,7 +1242,7 @@ static int macsec_parse_opt(struct link_util *lu, int 
argc, char **argv,
} else if (strcmp(*argv, "validate") == 0) {
NEXT_ARG();
ret = one_of("validate", *argv,
-VALIDATE_STR, ARRAY_SIZE(VALIDATE_STR),
+validate_str, ARRAY_SIZE(validate_str),
 (int *));
if (ret != 0)
return ret;
@@ -1265,7 +1266,8 @@ static int macsec_parse_opt(struct link_util *lu, int 
argc, char **argv,
}
 
if (!check_txsc_flags(es, scb, send_sci)) {
-   fprintf(stderr, "invalid combination of 
send_sci/end_station/scb\n");
+   fprintf(stderr,
+   "invalid combination of send_sci/end_station/scb\n");
return -1;
}
 
-- 
2.16.1

[PATCH iproute2-next 2/3] ipmacsec: collapse common code

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Several places copy/paste same code for printing array of statistics.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmacsec.c | 137 ++
 1 file changed, 51 insertions(+), 86 deletions(-)

diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index 37faed821c10..b9976da80603 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -624,19 +624,52 @@ static void print_attrs(const char *prefix, struct rtattr 
*attrs[])
 
 }
 
-static void print_one_stat(const char **names, struct rtattr **attr, int idx,
-  bool long_stat)
+static __u64 getattr_uint(struct rtattr *stat)
 {
-   int pad = strlen(names[idx]) + 1;
+   switch (RTA_PAYLOAD(stat)) {
+   case sizeof(__u64):
+   return rta_getattr_u64(stat);
+   case sizeof(__u32):
+   return rta_getattr_u32(stat);
+   case sizeof(__u16):
+   return rta_getattr_u16(stat);
+   case sizeof(__u8):
+   return rta_getattr_u8(stat);
+   default:
+   fprintf(stderr, "invalid attribute length %lu\n",
+   RTA_PAYLOAD(stat));
+   exit(-1);
+   }
+}
+
+static void print_stats(const char *prefix,
+   const char *names[], unsigned int num,
+   struct rtattr *stats[])
+{
+   unsigned int i;
+   int pad;
+
+   printf("%sstats:", prefix);
 
-   if (attr[idx]) {
-   if (long_stat)
-   printf("%*llu", pad, rta_getattr_u64(attr[idx]));
+   for (i = 1; i < num; i++) {
+   if (!names[i])
+   continue;
+   printf(" %s", names[i]);
+   }
+
+   printf("\n%s  ", prefix);
+
+   for (i = 1; i < num; i++) {
+   if (!names[i])
+   continue;
+
+   pad = strlen(names[i]) + 1;
+   if (stats[i])
+   printf("%*llu", pad, getattr_uint(stats[i]));
else
-   printf("%*u", pad, rta_getattr_u32(attr[idx]));
-   } else {
-   printf("%*c", pad, '-');
+   printf("%*c", pad, '-');
}
+   printf("\n");
 }
 
 static const char *txsc_stats_names[NUM_MACSEC_TXSC_STATS_ATTR] = {
@@ -649,29 +682,14 @@ static const char 
*txsc_stats_names[NUM_MACSEC_TXSC_STATS_ATTR] = {
 static void print_txsc_stats(const char *prefix, struct rtattr *attr)
 {
struct rtattr *stats[MACSEC_TXSC_STATS_ATTR_MAX + 1];
-   int i;
 
if (!attr || show_stats == 0)
return;
 
parse_rtattr_nested(stats, MACSEC_TXSC_STATS_ATTR_MAX + 1, attr);
-   printf("%sstats:", prefix);
-
-   for (i = 1; i < NUM_MACSEC_TXSC_STATS_ATTR; i++) {
-   if (!txsc_stats_names[i])
-   continue;
-   printf(" %s", txsc_stats_names[i]);
-   }
-
-   printf("\n%s  ", prefix);
 
-   for (i = 1; i < NUM_MACSEC_TXSC_STATS_ATTR; i++) {
-   if (!txsc_stats_names[i])
-   continue;
-   print_one_stat(txsc_stats_names, stats, i, true);
-   }
-
-   printf("\n");
+   print_stats(prefix, txsc_stats_names, NUM_MACSEC_TXSC_STATS_ATTR,
+   stats);
 }
 
 static const char *secy_stats_names[NUM_MACSEC_SECY_STATS_ATTR] = {
@@ -688,29 +706,14 @@ static const char 
*secy_stats_names[NUM_MACSEC_SECY_STATS_ATTR] = {
 static void print_secy_stats(const char *prefix, struct rtattr *attr)
 {
struct rtattr *stats[MACSEC_SECY_STATS_ATTR_MAX + 1];
-   int i;
 
if (!attr || show_stats == 0)
return;
 
parse_rtattr_nested(stats, MACSEC_SECY_STATS_ATTR_MAX + 1, attr);
-   printf("%sstats:", prefix);
 
-   for (i = 1; i < NUM_MACSEC_SECY_STATS_ATTR; i++) {
-   if (!secy_stats_names[i])
-   continue;
-   printf(" %s", secy_stats_names[i]);
-   }
-
-   printf("\n%s  ", prefix);
-
-   for (i = 1; i < NUM_MACSEC_SECY_STATS_ATTR; i++) {
-   if (!secy_stats_names[i])
-   continue;
-   print_one_stat(secy_stats_names, stats, i, true);
-   }
-
-   printf("\n");
+   print_stats(prefix, secy_stats_names,
+   NUM_MACSEC_SECY_STATS_ATTR, stats);
 }
 
 static const char *rxsa_stats_names[NUM_MACSEC_SA_STATS_ATTR] = {
@@ -724,29 +727,13 @@ static const char 
*rxsa_stats_names[NUM_MACSEC_SA_STATS_ATTR] = {
 static void print_rxsa_stats(const char *prefix, struct rtattr *attr)
 {
struct rtattr *stats[MACSEC_SA_STATS_ATTR_MAX + 1];
-   int i;
 
if (!attr || show_stats == 0)
return;
 
parse_rtattr_nested(stats, MACSEC_SA_STATS_ATTR_MAX + 1, attr);
-   printf("%s%s  ", prefix, prefix);
-
-   for (i = 1; i < NUM_MACSEC_SA_STATS_ATTR; i++) {
-

[PATCH iproute2-next 0/3] macsec cleanup and JSON

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

The macsec code didn't really support JSON and had several
pieces of copy/pasted code.

Stephen Hemminger (3):
  ip: macsec cleanup
  ipmacsec: collapse common code
  macsec: support JSON

 ip/ipmacsec.c | 424 +-
 1 file changed, 239 insertions(+), 185 deletions(-)

-- 
2.16.1

[GIT] net merged into net-next

2018-03-05 Thread David Miller


If I botched up any part of the merge, please send me fix up
patches.

Thank you.

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller

From: John Fastabend 
Date: Mon, 5 Mar 2018 22:22:21 -0800

> All I meant by this is if an application uses sendfile() call
> there is no good way to know when/if the kernel side will copy or
> xmit the  data. So a reliable user space application will need to
> only modify the data if it "knows" there are no outstanding sends
> in-flight. So if we assume applications follow this then it
> is OK to avoid the copy. Of course this is not good enough for
> security, but for monitoring/statistics (my use case 1 it works).

For an application implementing a networking file system, it's pretty
legitimate for file contents to change before the page gets DMA's to
the networking card.

And that's perfectly fine, and we everything such that this will work
properly.

The card checksums what ends up being DMA'd so nothing from the
networking side is broken.

So this assumption you mention really does not hold.

There needs to be some feedback from the BPF program that parses the
packet.  This way it can say, "I need at least X more bytes before I
can generate a verdict".  And you keep copying more and more bytes
into a linear buffer and calling the parser over and over until it can
generate a full verdict or you run out of networking data.

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend

On 03/05/2018 09:42 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 5 Mar 2018 14:53:08 -0800
> 
>> I decided to make the default no-copy to mirror the existing
>> sendpage() semantics and then to add the flag later. The flag
>> support is not in this series simply because I wanted to get the
>> base support in first.
> 
> What existing sendpage semantics are you referring to?
> 

All I meant by this is if an application uses sendfile() call
there is no good way to know when/if the kernel side will copy or
xmit the  data. So a reliable user space application will need to
only modify the data if it "knows" there are no outstanding sends
in-flight. So if we assume applications follow this then it
is OK to avoid the copy. Of course this is not good enough for
security, but for monitoring/statistics (my use case 1 it works).

By keep existing sendpage semantics I just meant applications
should already follow the above.

[PATCH] cxgb3: remove VLA

2018-03-05 Thread Gustavo A. R. Silva

In preparation to enabling -Wvla, remove VLA and replace it
with dynamic memory allocation.

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/ethernet/chelsio/cxgb3/t3_hw.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/t3_hw.c 
b/drivers/net/ethernet/chelsio/cxgb3/t3_hw.c
index a89721f..ad6a280 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/t3_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/t3_hw.c
@@ -683,20 +683,37 @@ int t3_seeprom_wp(struct adapter *adapter, int enable)
 
 static int vpdstrtouint(char *s, int len, unsigned int base, unsigned int *val)
 {
-   char tok[len + 1];
+   char *tok;
+   int ret;
+
+   tok = kcalloc(len + 1, sizeof(*tok), GFP_KERNEL);
+   if (!tok)
+   return -ENOMEM;
 
memcpy(tok, s, len);
tok[len] = 0;
-   return kstrtouint(strim(tok), base, val);
+   ret = kstrtouint(strim(tok), base, val);
+
+   kfree(tok);
+   return ret;
 }
 
 static int vpdstrtou16(char *s, int len, unsigned int base, u16 *val)
 {
-   char tok[len + 1];
+   char *tok;
+   int ret;
+
+   tok = kcalloc(len + 1, sizeof(*tok), GFP_KERNEL);
+   if (!tok)
+   return -ENOMEM;
 
memcpy(tok, s, len);
tok[len] = 0;
-   return kstrtou16(strim(tok), base, val);
+
+   ret = kstrtou16(strim(tok), base, val);
+
+   kfree(tok);
+   return ret;
 }
 
 /**
-- 
2.7.4

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller

From: John Fastabend 
Date: Mon, 5 Mar 2018 14:53:08 -0800

> I decided to make the default no-copy to mirror the existing
> sendpage() semantics and then to add the flag later. The flag
> support is not in this series simply because I wanted to get the
> base support in first.

What existing sendpage semantics are you referring to?

Re: [PATCH net-next] selftests: net: Introduce first PMTU test

2018-03-05 Thread David Ahern

On 3/5/18 3:45 PM, Stefano Brivio wrote:
> diff --git a/tools/testing/selftests/net/pmtu.sh 
> b/tools/testing/selftests/net/pmtu.sh
> new file mode 100755
> index ..eb186ca3e5e4
> --- /dev/null
> +++ b/tools/testing/selftests/net/pmtu.sh
> @@ -0,0 +1,159 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Check that route PMTU values match expectations
> +#
> +# Tests currently implemented:
> +#
> +# - test_pmtu_vti6_exception
> +#Set up vti6 tunnel on top of veth, with xfrm states and policies, in two
> +#namespaces with matching endpoints. Check that route exception is
> +#created by exceeding link layer MTU with ping to other endpoint. Then
> +#decrease and increase MTU of tunnel, checking that route exception PMTU
> +#changes accordingly
> +
> +NS_A="ns-$(mktemp -u XX)"
> +NS_B="ns-$(mktemp -u XX)"
> +ns_a="ip netns exec ${NS_A}"
> +ns_b="ip netns exec ${NS_B}"
> +
> +veth6_a_addr="fd00:1::a"
> +veth6_b_addr="fd00:1::b"
> +veth6_mask="64"
> +
> +vti6_a_addr="fd00:2::a"
> +vti6_b_addr="fd00:2::b"
> +vti6_mask="64"
> +
> +setup_namespaces() {
> + ip netns add ${NS_A} || return 1
> + ip netns add ${NS_B}

for basic config commands that should work every encasing in
set -e
...
set +e

simplifies error handling. IMO, it is relevant for netns and veth config
commands. Not so much for the xfrm commands which need to load modules
or depend on features that need config options.

> +
> + return 0
> +}
> +
> +setup_veth() {
> + ${ns_a} ip link add veth_a type veth peer name veth_b || return 1
> + ${ns_a} ip link set veth_b netns ${NS_B}
> + 
> + ${ns_a} ip link set veth_a up
> + ${ns_b} ip link set veth_b up
> +
> + ${ns_a} ip addr add ${veth6_a_addr}/${veth6_mask} dev veth_a
> + ${ns_b} ip addr add ${veth6_b_addr}/${veth6_mask} dev veth_b
> +
> + return 0
> +}
> +
> +setup_vti6() {
> + ${ns_a} ip link add vti_a type vti6 local ${veth6_a_addr} remote 
> ${veth6_b_addr} key 10 || return 1
> + ${ns_b} ip link add vti_b type vti6 local ${veth6_b_addr} remote 
> ${veth6_a_addr} key 10
> +
> + ${ns_a} ip link set vti_a up
> + ${ns_b} ip link set vti_b up
> +
> + ${ns_a} ip addr add ${vti6_a_addr}/${vti6_mask} dev vti_a
> + ${ns_b} ip addr add ${vti6_b_addr}/${vti6_mask} dev vti_b
> +
> + return 0
> +}
> +
> +setup_xfrm() {
> + ${ns_a} ip -6 xfrm state add src ${veth6_a_addr} dst ${veth6_b_addr} 
> spi 0x1000 proto esp aead "rfc4106(gcm(aes))" 
> 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel || return 1
> + ${ns_a} ip -6 xfrm state add src ${veth6_b_addr} dst ${veth6_a_addr} 
> spi 0x1001 proto esp aead "rfc4106(gcm(aes))" 
> 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
> + ${ns_a} ip -6 xfrm policy add dir out mark 10 tmpl src ${veth6_a_addr} 
> dst ${veth6_b_addr} proto esp mode tunnel
> + ${ns_a} ip -6 xfrm policy add dir in mark 10 tmpl src ${veth6_b_addr} 
> dst ${veth6_a_addr} proto esp mode tunnel
> +
> + ${ns_b} ip -6 xfrm state add src ${veth6_a_addr} dst ${veth6_b_addr} 
> spi 0x1000 proto esp aead "rfc4106(gcm(aes))" 
> 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
> + ${ns_b} ip -6 xfrm state add src ${veth6_b_addr} dst ${veth6_a_addr} 
> spi 0x1001 proto esp aead "rfc4106(gcm(aes))" 
> 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
> + ${ns_b} ip -6 xfrm policy add dir out mark 10 tmpl src ${veth6_b_addr} 
> dst ${veth6_a_addr} proto esp mode tunnel
> + ${ns_b} ip -6 xfrm policy add dir in mark 10 tmpl src ${veth6_a_addr} 
> dst ${veth6_b_addr} proto esp mode tunnel
> +
> + return 0
> +}
> +
> +setup() {
> + tunnel_type="$1"
> +
> + [ "$(id -u)" -ne 0 ] && (echo "SKIP: need to run as root" && exit 0)
> +
> + setup_namespaces || (echo "SKIP: namespaces not supported" && exit 0)
> + setup_veth || (echo "SKIP: veth not supported" && exit 0)

You use this style (' || (...)' or ' && (...)') a lot
and it does not actually exit the script. You can verify by adding
'return 1' to either function.

> +
> + case ${tunnel_type} in
> + "vti6")
> + setup_vti6 && (echo "SKIP: vti6 not supported" && exit 0)
> + setup_xfrm && (echo "SKIP: xfrm not supported" && exit 0)
> + ;;
> + *)
> + ;;
> + esac
> +}
> +
> +cleanup() {
> + ip netns del ${NS_A} 2 > /dev/null
> + ip netns del ${NS_B} 2 > /dev/null
> +}
> +
> +mtu() {
> + ns_cmd="${1}"
> + dev="${2}"
> + mtu="${3}"
> +
> + ${ns_cmd} ip link set dev ${dev} mtu ${mtu}
> +}
> +
> +route_get_dst_exception() {
> + dst="${1}"
> +
> + ${ns_a} ip -6 route get "${dst}" | tail -n1 | tr -s ' '
> +}
> +
> +route_get_dst_pmtu_from_exception() {
> + dst="${1}"
> +
> + exception="$(route_get_dst_exception ${dst})"
> + next=0
> + for i in ${exception}; do
> + [ ${next} -eq 1 ] && echo "${i}" && return
> + [ "${i}" = "mtu" ] &&

[RFC PATCH linux-next] net: mvpp2: mvpp2_check_hw_buf_num() can be static

2018-03-05 Thread kbuild test robot


Fixes: effbf5f58d64 ("net: mvpp2: update the BM buffer free/destroy logic")
Signed-off-by: Fengguang Wu 
---
 mvpp2.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index c7b8093..c360430 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4285,7 +4285,7 @@ static void mvpp2_bm_bufs_free(struct device *dev, struct 
mvpp2 *priv,
 }
 
 /* Check number of buffers in BM pool */
-int mvpp2_check_hw_buf_num(struct mvpp2 *priv, struct mvpp2_bm_pool *bm_pool)
+static int mvpp2_check_hw_buf_num(struct mvpp2 *priv, struct mvpp2_bm_pool 
*bm_pool)
 {
int buf_num = 0;

[linux-next:master 5332/5518] drivers/net/ethernet/marvell/mvpp2.c:4288:5: sparse: symbol 'mvpp2_check_hw_buf_num' was not declared. Should it be static?

2018-03-05 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
master
head:   9c142d8a6556f069be6278ccab701039da81ad6f
commit: effbf5f58d64b1d1f93cb687d9797b42f291d5fd [5332/5518] net: mvpp2: update 
the BM buffer free/destroy logic
reproduce:
# apt-get install sparse
git checkout effbf5f58d64b1d1f93cb687d9797b42f291d5fd
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/marvell/mvpp2.c:4288:5: sparse: symbol 
>> 'mvpp2_check_hw_buf_num' was not declared. Should it be static?
   drivers/net/ethernet/marvell/mvpp2.c:6620:36: sparse: incorrect type in 
argument 2 (different base types) @@expected int [signed] l3_proto @@
got restricted __be1int [signed] l3_proto @@
   drivers/net/ethernet/marvell/mvpp2.c:6620:36:expected int [signed] 
l3_proto
   drivers/net/ethernet/marvell/mvpp2.c:6620:36:got restricted __be16 
[usertype] protocol

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH 2/3] vfio: Add support for unmanaged or userspace managed SR-IOV

2018-03-05 Thread kbuild test robot

Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on v4.16-rc4 next-20180305]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alexander-Duyck/pci-iov-Add-support-for-unmanaged-SR-IOV/20180306-063954
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   drivers/vfio/pci/vfio_pci.c: In function 'vfio_pci_sriov_configure':
>> drivers/vfio/pci/vfio_pci.c:1291:8: error: implicit declaration of function 
>> 'pci_sriov_configure_unmanaged'; did you mean 'pci_write_config_dword'? 
>> [-Werror=implicit-function-declaration]
 err = pci_sriov_configure_unmanaged(pdev, nr_virtfn);
   ^
   pci_write_config_dword
   At top level:
   drivers/vfio/pci/vfio_pci.c:1265:12: warning: 'vfio_pci_sriov_configure' 
defined but not used [-Wunused-function]
static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
   ^~~~
   cc1: some warnings being treated as errors

vim +1291 drivers/vfio/pci/vfio_pci.c

  1264  
  1265  static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
  1266  {
  1267  struct vfio_pci_device *vdev;
  1268  struct vfio_device *device;
  1269  int err;
  1270  
  1271  device = vfio_device_get_from_dev(>dev);
  1272  if (device == NULL)
  1273  return -ENODEV;
  1274  
  1275  vdev = vfio_device_data(device);
  1276  if (vdev == NULL) {
  1277  vfio_device_put(device);
  1278  return -ENODEV;
  1279  }
  1280  
  1281  /*
  1282   * If a userspace process is already using this device just 
return
  1283   * busy and don't allow for any changes.
  1284   */
  1285  if (vdev->refcnt) {
  1286  pci_warn(pdev,
  1287   "PF is currently in use, blocked until 
released by user\n");
  1288  return -EBUSY;
  1289  }
  1290  
> 1291  err = pci_sriov_configure_unmanaged(pdev, nr_virtfn);
  1292  if (err <= 0)
  1293  return err;
  1294  
  1295  /*
  1296   * We are now leaving VFs in the control of some unknown PF 
entity.
  1297   *
  1298   * Best case is a well behaved userspace PF is expected and any 
VMs
  1299   * that the VFs will be assigned to are dependent on the 
userspace
  1300   * entity anyway. An example being NFV where maybe the PF is 
acting
  1301   * as an accelerated interface for a firewall or switch.
  1302   *
  1303   * Worst case is somebody really messed up and just enabled 
SR-IOV
  1304   * on a device they were planning to assign to a VM somwhere.
  1305   *
  1306   * In either case it is probably best for us to set the taint 
flag
  1307   * and warn the user since this could get really ugly really 
quick
  1308   * if this wasn't what they were planning to do.
  1309   */
  1310  add_taint(TAINT_USER, LOCKDEP_STILL_OK);
  1311  pci_warn(pdev,
  1312   "Adding kernel taint for vfio-pci now managing SR-IOV 
PF device\n");
  1313  
  1314  return nr_virtfn;
  1315  }
  1316  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Donation For Charity Work

2018-03-05 Thread Friedrich And Ann Mayrhofer




--
Good Day,

My wife and I have awarded you with a donation of $ 1,000,000.00 Dollars from
part of our Jackpot Lottery of 50 Million Dollars, respond with your details
for claims.

We await your earliest response and God Bless you.

Friedrich And Ann Mayrhofer.

Re: [PATCH] tipc: bcast: use true and false for boolean values

2018-03-05 Thread Ying Xue

On 03/06/2018 05:56 AM, Gustavo A. R. Silva wrote:
> Assign true or false to boolean variables instead of an integer value.
> 
> This issue was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

Acked-by: Ying Xue 

> ---
>  net/tipc/bcast.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> index 37892b3..f371117 100644
> --- a/net/tipc/bcast.c
> +++ b/net/tipc/bcast.c
> @@ -574,5 +574,5 @@ void tipc_nlist_purge(struct tipc_nlist *nl)
>  {
>   tipc_dest_list_purge(>list);
>   nl->remote = 0;
> - nl->local = 0;
> + nl->local = false;
>  }
>

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-05 Thread Stephen Hemminger

On Mon, 5 Mar 2018 14:47:20 -0800
Alexander Duyck  wrote:

> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko  wrote:
> > Mon, Mar 05, 2018 at 05:11:32PM CET, step...@networkplumber.org wrote:  
> >>On Mon, 5 Mar 2018 10:21:18 +0100
> >>Jiri Pirko  wrote:
> >>  
> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.du...@gmail.com wrote:  
> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko  wrote:  
> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.du...@gmail.com wrote:  
> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko  wrote:  
> >>>
> >>> [...]
> >>>  
> >>> >  
> >>> >>>Currently we only have agreement from Michael on taking this code, as
> >>> >>>such we are working with virtio only for now. When the time comes that 
> >>> >>> 
> >>> >>
> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it 
> >>> >> will
> >>> >> stay there forever. So what you say is: "We will do it halfway now
> >>> >> and promise to fix it later". That later will never happen, I'm pretty
> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
> >>> >> this patchset.  
> >>> >
> >>> >You want this new approach and a copy of netvsc moved into either core
> >>> >or some module of its own. I say pick an architecture. We are looking
> >>> >at either 2 netdevs or 3. We are not going to support both because
> >>> >that will ultimately lead to a terrible user experience and make
> >>> >things quite confusing.
> >>> >  
> >>> >> + if you would be pushing first driver to do this, I would understand.
> >>> >> But the first driver is already in. You are pushing second. This is the
> >>> >> time to do the sharing, unification of behaviour. Next time is too 
> >>> >> late.  
> >>> >
> >>> >That is great, if we want to share then lets share. But what you are
> >>> >essentially telling us is that we need to fork this solution and
> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
> >>> >point what is the point in merging them together?  
> >>>
> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
> >>> is my point from the very beginning.
> >>>
> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
> >>> behave the same and to use a single code with well-defined checks and
> >>> restrictions for this feature?  
> >>
> >>Eventually, yes both could share common code routines. In reality,
> >>the failover stuff is only a very small part of either driver so
> >>it is not worth stretching to try and cover too much. If you look,
> >>the failover code is just using routines that already exist for
> >>use by bonding, teaming, etc.  
> >
> > Yeah, we consern was also about the code that processes the netdev
> > notifications and does auto-enslave and all related stuff.  
> 
> The concern was the driver model. If we expose 3 netdevs or 2 with the
> VF driver present. Somehow this is turning into a "merge netvsc into
> virtio" think and that isn't the subject that was being asked.
> 
> Ideally we want one model for this. Either 3 netdevs or 2. The problem
> is 2 causes issues in terms of performance and will limit features of
> virtio, but 2 is the precedent set by netvsc. We need to figure out
> the path forward for this. There is talk about "sharing" but it is
> hard to make these two approaches share code when they are doing two
> very different setups and end up presenting themselves as two very
> different driver models.

I appreciate this discussion, and it has helped a lot.

Netvsc is stuck with 2 netdev model for the foreseeable future.
We already failed once with the bonding model, and that created a lot of
pain. The current model is working well and have convinced the major distros
to support the two netdev model and don't want to back.

Very open to optimizations and ways to smooth out the rough edges.

Re: [PATCH] net: qcom/emac: Use proper free methods during TX

2018-03-05 Thread Timur Tabi


On 3/5/18 8:48 PM, Hemanth Puranik wrote:

This patch fixes the warning messages/call traces seen if DMA debug is
enabled, In case of fragmented skb's memory was allocated using
dma_map_page but freed using dma_unmap_single. This patch modifies buffer
allocations in TX path to use dma_map_page in all the places and
dma_unmap_page while freeing the buffers.

Signed-off-by: Hemanth Puranik


Acked-by: Timur Tabi 

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

[PATCH iproute2-next 2/7] ip: add json support to addrlabel

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Add missing json and color support to addrlabel display

Signed-off-by: Stephen Hemminger 
---
 ip/ipaddrlabel.c | 40 +++-
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
index 7200bf542929..2f79c56dcead 100644
--- a/ip/ipaddrlabel.c
+++ b/ip/ipaddrlabel.c
@@ -38,6 +38,7 @@
 #include "rt_names.h"
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 #define IFAL_RTA(r)((struct rtattr *)(((char *)(r)) + 
NLMSG_ALIGN(sizeof(struct ifaddrlblmsg
 #define IFAL_PAYLOAD(n)NLMSG_PAYLOAD(n, sizeof(struct ifaddrlblmsg))
@@ -55,7 +56,6 @@ static void usage(void)
 
 int print_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void 
*arg)
 {
-   FILE *fp = (FILE *)arg;
struct ifaddrlblmsg *ifal = NLMSG_DATA(n);
int len = n->nlmsg_len;
struct rtattr *tb[IFAL_MAX+1];
@@ -69,28 +69,40 @@ int print_addrlabel(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg
 
parse_rtattr(tb, IFAL_MAX, IFAL_RTA(ifal), len);
 
+   open_json_object(NULL);
if (n->nlmsg_type == RTM_DELADDRLABEL)
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (tb[IFAL_ADDRESS]) {
-   fprintf(fp, "prefix %s/%u ",
-   format_host_rta(ifal->ifal_family,
-   tb[IFAL_ADDRESS]),
-   ifal->ifal_prefixlen);
+   const char *host
+   = format_host_rta(ifal->ifal_family,
+ tb[IFAL_ADDRESS]);
+
+   print_string(PRINT_FP, NULL, "prefix ", NULL);
+   print_color_string(PRINT_ANY,
+  ifa_family_color(ifal->ifal_family),
+  "address", "%s", host);
+
+   print_uint(PRINT_ANY, "prefixlen", "/%u ",
+  ifal->ifal_prefixlen);
}
 
-   if (ifal->ifal_index)
-   fprintf(fp, "dev %s ", ll_index_to_name(ifal->ifal_index));
+   if (ifal->ifal_index) {
+   print_string(PRINT_FP, NULL, "dev ", NULL);
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "ifname", "%s ",
+  ll_index_to_name(ifal->ifal_index));
+   }
 
if (tb[IFAL_LABEL] && RTA_PAYLOAD(tb[IFAL_LABEL]) == sizeof(uint32_t)) {
-   uint32_t label;
+   uint32_t label = rta_getattr_u32(RTA_DATA(tb[IFAL_LABEL]));
 
-   memcpy(, RTA_DATA(tb[IFAL_LABEL]), sizeof(label));
-   fprintf(fp, "label %u ", label);
+   print_uint(PRINT_ANY,
+  "label", "label %u ", label);
}
+   print_string(PRINT_FP, NULL, "\n", "");
+   close_json_object();
 
-   fprintf(fp, "\n");
-   fflush(fp);
return 0;
 }
 
@@ -111,10 +123,12 @@ static int ipaddrlabel_list(int argc, char **argv)
return 1;
}
 
+   new_json_obj(json);
if (rtnl_dump_filter(, print_addrlabel, stdout) < 0) {
fprintf(stderr, "Dump terminated\n");
return 1;
}
+   delete_json_obj();
 
return 0;
 }
-- 
2.16.1

[PATCH iproute2-next 6/7] tcp_metrics; make tables const

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Signed-off-by: Stephen Hemminger 
---
 ip/tcp_metrics.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/ip/tcp_metrics.c b/ip/tcp_metrics.c
index 7e2d9eb34b79..5f394765c62b 100644
--- a/ip/tcp_metrics.c
+++ b/ip/tcp_metrics.c
@@ -47,8 +47,8 @@ static int genl_family = -1;
 #define CMD_DEL0x0002  /* delete, remove   */
 #define CMD_FLUSH  0x0004  /* flush*/
 
-static struct {
-   char*name;
+static const struct {
+   const char *name;
int code;
 } cmds[] = {
{   "list", CMD_LIST},
@@ -59,7 +59,7 @@ static struct {
{   "flush",CMD_FLUSH   },
 };
 
-static char *metric_name[TCP_METRIC_MAX + 1] = {
+static const char *metric_name[TCP_METRIC_MAX + 1] = {
[TCP_METRIC_RTT]= "rtt",
[TCP_METRIC_RTTVAR] = "rttvar",
[TCP_METRIC_SSTHRESH]   = "ssthresh",
@@ -67,8 +67,7 @@ static char *metric_name[TCP_METRIC_MAX + 1] = {
[TCP_METRIC_REORDERING] = "reordering",
 };
 
-static struct
-{
+static struct {
int flushed;
char *flushb;
int flushp;
-- 
2.16.1

[PATCH iproute2-next 7/7] ip: jsonify tcp_metrics

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Add JSON support to the ip tcp_metrics output.

$ ip -j -p tcp_metrics show
[ {
"dst": "192.18.1.11",
"age": 23617.8,
"ssthresh": 7,
"cwnd": 3,
"rtt": 0.039176,
"rttvar": 0.039176,
"source": "192.18.1.2"
}
...

The JSON output does scale values differently since there is no good
way to indicate units. The rtt values are displayed in seconds in
JSON and microseconds in the original (non JSON) mode. In the example
above the output in without the -j flag, the output would be
 ... rtt 39176us rttvar 39176us

I did this since all the other values in the JSON record are also in
floating point seconds.

Signed-off-by: Stephen Hemminger 
---
 ip/tcp_metrics.c | 179 ++-
 1 file changed, 110 insertions(+), 69 deletions(-)

diff --git a/ip/tcp_metrics.c b/ip/tcp_metrics.c
index 5f394765c62b..72dc980c92a6 100644
--- a/ip/tcp_metrics.c
+++ b/ip/tcp_metrics.c
@@ -38,6 +38,7 @@ static void usage(void)
 /* netlink socket */
 static struct rtnl_handle grth = { .fd = -1 };
 static int genl_family = -1;
+static const double usec_per_sec = 100.;
 
 #define TCPM_REQUEST(_req, _bufsiz, _cmd, _flags) \
GENL_REQUEST(_req, _bufsiz, genl_family, 0, \
@@ -87,15 +88,84 @@ static int flush_update(void)
return 0;
 }
 
+static void print_tcp_metrics(struct rtattr *a)
+{
+   struct rtattr *m[TCP_METRIC_MAX + 1 + 1];
+   unsigned long rtt = 0, rttvar = 0;
+   int i;
+
+   parse_rtattr_nested(m, TCP_METRIC_MAX + 1, a);
+
+   for (i = 0; i < TCP_METRIC_MAX + 1; i++) {
+   const char *name;
+   __u32 val;
+   SPRINT_BUF(b1);
+
+   a = m[i + 1];
+   if (!a)
+   continue;
+
+   val = rta_getattr_u32(a);
+
+   switch (i) {
+   case TCP_METRIC_RTT:
+   if (!rtt)
+   rtt = (val * 1000UL) >> 3;
+   continue;
+   case TCP_METRIC_RTTVAR:
+   if (!rttvar)
+   rttvar = (val * 1000UL) >> 2;
+   continue;
+   case TCP_METRIC_RTT_US:
+   rtt = val >> 3;
+   continue;
+
+   case TCP_METRIC_RTTVAR_US:
+   rttvar = val >> 2;
+   continue;
+
+   case TCP_METRIC_SSTHRESH:
+   case TCP_METRIC_CWND:
+   case TCP_METRIC_REORDERING:
+   name = metric_name[i];
+   break;
+
+   default:
+   snprintf(b1, sizeof(b1),
+" metric_%d ", i);
+   name = b1;
+   }
+
+
+   print_uint(PRINT_JSON, name, NULL, val);
+   print_string(PRINT_FP, NULL, " %s ", name);
+   print_uint(PRINT_FP, NULL, "%lu", val);
+   }
+
+   if (rtt) {
+   print_float(PRINT_JSON, "rtt", NULL,
+   (double)rtt / usec_per_sec);
+   print_uint(PRINT_FP, NULL,
+  " rtt %luus", rtt);
+   }
+   if (rttvar) {
+   print_float(PRINT_JSON, "rttvar", NULL,
+   (double) rttvar / usec_per_sec);
+   print_uint(PRINT_FP, NULL,
+  " rttvar %luus", rttvar);
+   }
+}
+
 static int process_msg(const struct sockaddr_nl *who, struct nlmsghdr *n,
   void *arg)
 {
FILE *fp = (FILE *) arg;
struct genlmsghdr *ghdr;
struct rtattr *attrs[TCP_METRICS_ATTR_MAX + 1], *a;
+   const char *h;
int len = n->nlmsg_len;
inet_prefix daddr, saddr;
-   int i, atype, stype;
+   int atype, stype;
 
if (n->nlmsg_type != genl_family)
return -1;
@@ -185,96 +255,60 @@ static int process_msg(const struct sockaddr_nl *who, 
struct nlmsghdr *n,
return 0;
}
 
+   open_json_object(NULL);
if (f.cmd & (CMD_DEL | CMD_FLUSH))
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
-   fprintf(fp, "%s",
-   format_host(daddr.family, daddr.bytelen, daddr.data));
+   h = format_host(daddr.family, daddr.bytelen, daddr.data);
+   print_color_string(PRINT_ANY,
+  ifa_family_color(daddr.family),
+  "dst", "%s", h);
 
a = attrs[TCP_METRICS_ATTR_AGE];
if (a) {
-   unsigned long long val = rta_getattr_u64(a);
+   __u64 val = rta_getattr_u64(a);
+   double age = val / 1000.;
 
-   fprintf(fp, " age %llu.%03llusec",
-   val / 1000, val % 1000);
+

[PATCH iproute2-next 0/7] ip: more JSON

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

The ip command implementation of JSON was very spotty. Only address
and link were originally implemented. After doing route for next,
went ahead and implemented it for a bunch of the other sub commands.

Hopefully will reach full coverage soon.

Stephen Hemminger (7):
  ip: add color and json support to neigh
  ip: add json support to addrlabel
  ip: add json support to ip rule
  ip: add json support to ntable
  ip: add JSON support to netconf
  tcp_metrics; make tables const
  ip: jsonify tcp_metrics

 ip/ipaddrlabel.c |  40 --
 ip/ipneigh.c | 143 +--
 ip/ipnetconf.c   |  69 +
 ip/ipntable.c| 415 ++-
 ip/iprule.c  | 203 +--
 ip/tcp_metrics.c | 188 +++--
 6 files changed, 633 insertions(+), 425 deletions(-)

-- 
2.16.1

[PATCH iproute2-next 1/7] ip: add color and json support to neigh

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Use json_print to provide json (and color) support to
ip neigh command.

Signed-off-by: Stephen Hemminger 
---
 ip/ipneigh.c | 143 ---
 1 file changed, 97 insertions(+), 46 deletions(-)

diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 0735424900f6..1f550e98e003 100644
--- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -23,6 +23,7 @@
 #include "rt_names.h"
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 #define NUD_VALID  
(NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)
 #define MAX_ROUNDS 10
@@ -189,6 +190,48 @@ static int ipneigh_modify(int cmd, int flags, int argc, 
char **argv)
return 0;
 }
 
+static void print_cacheinfo(const struct nda_cacheinfo *ci)
+{
+   static int hz;
+
+   if (!hz)
+   hz = get_user_hz();
+
+   if (ci->ndm_refcnt)
+   print_uint(PRINT_ANY, "refcnt",
+   " ref %u", ci->ndm_refcnt);
+
+   print_uint(PRINT_ANY,
+"used", " used %u", ci->ndm_used / hz);
+   print_uint(PRINT_ANY,
+"confirmed", "/%u", ci->ndm_confirmed / hz);
+   print_uint(PRINT_ANY,
+"updated", "/u", ci->ndm_updated / hz);
+}
+
+static void print_neigh_state(unsigned int nud)
+{
+
+   open_json_array(PRINT_JSON,
+   is_json_context() ?  "state" : "");
+
+#define PRINT_FLAG(f)  \
+   if (nud & NUD_##f) {\
+   nud &= ~NUD_##f;\
+   print_string(PRINT_ANY, NULL, " %s", #f);   \
+   }
+
+   PRINT_FLAG(INCOMPLETE);
+   PRINT_FLAG(REACHABLE);
+   PRINT_FLAG(STALE);
+   PRINT_FLAG(DELAY);
+   PRINT_FLAG(PROBE);
+   PRINT_FLAG(FAILED);
+   PRINT_FLAG(NOARP);
+   PRINT_FLAG(PERMANENT);
+#undef PRINT_FLAG
+   close_json_array(PRINT_JSON, NULL);
+}
 
 int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
@@ -262,65 +305,71 @@ int print_neigh(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
return 0;
}
 
+   open_json_object(NULL);
if (n->nlmsg_type == RTM_DELNEIGH)
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
else if (n->nlmsg_type == RTM_GETNEIGH)
-   fprintf(fp, "miss ");
+   print_null(PRINT_ANY, "miss", "%s ", "miss");
+
if (tb[NDA_DST]) {
-   fprintf(fp, "%s ",
-   format_host_rta(r->ndm_family, tb[NDA_DST]));
+   const char *dst;
+
+   dst = format_host_rta(r->ndm_family, tb[NDA_DST]);
+   print_color_string(PRINT_ANY,
+  ifa_family_color(r->ndm_family),
+  "dst", "%s ", dst);
}
-   if (!filter.index && r->ndm_ifindex)
-   fprintf(fp, "dev %s ", ll_index_to_name(r->ndm_ifindex));
+
+   if (!filter.index && r->ndm_ifindex) {
+   if (!is_json_context())
+   fprintf(fp, "dev ");
+
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "dev", "%s ",
+  ll_index_to_name(r->ndm_ifindex));
+   }
+
if (tb[NDA_LLADDR]) {
+   const char *lladdr;
SPRINT_BUF(b1);
-   fprintf(fp, "lladdr %s", ll_addr_n2a(RTA_DATA(tb[NDA_LLADDR]),
- RTA_PAYLOAD(tb[NDA_LLADDR]),
- ll_index_to_type(r->ndm_ifindex),
- b1, sizeof(b1)));
-   }
-   if (r->ndm_flags & NTF_ROUTER) {
-   fprintf(fp, " router");
-   }
-   if (r->ndm_flags & NTF_PROXY) {
-   fprintf(fp, " proxy");
-   }
-   if (tb[NDA_CACHEINFO] && show_stats) {
-   struct nda_cacheinfo *ci = RTA_DATA(tb[NDA_CACHEINFO]);
-   int hz = get_user_hz();
 
-   if (ci->ndm_refcnt)
-   printf(" ref %d", ci->ndm_refcnt);
-   fprintf(fp, " used %d/%d/%d", ci->ndm_used/hz,
-  ci->ndm_confirmed/hz, ci->ndm_updated/hz);
-   }
+   lladdr = ll_addr_n2a(RTA_DATA(tb[NDA_LLADDR]),
+RTA_PAYLOAD(tb[NDA_LLADDR]),
+ll_index_to_type(r->ndm_ifindex),
+b1, sizeof(b1));
 
-   if (tb[NDA_PROBES] && show_stats) {
-   __u32 p = rta_getattr_u32(tb[NDA_PROBES]);
+   if (!is_json_context())
+   fprintf(fp, "lladdr ");
 
-   fprintf(fp, " probes %u", p);
+

[PATCH iproute2-next 3/7] ip: add json support to ip rule

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

More JSON and colorizing.

Signed-off-by: Stephen Hemminger 
---
 ip/iprule.c | 203 ++--
 1 file changed, 130 insertions(+), 73 deletions(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index 6fdc9b5efa00..ab1e0c15f877 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -26,6 +26,7 @@
 #include "rt_names.h"
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 enum list_action {
IPRULE_LIST,
@@ -179,13 +180,12 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct 
rtattr **tb, int host_len)
 
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
-   FILE *fp = (FILE *)arg;
+   FILE *fp = arg;
struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
-   __u32 table;
+   __u32 table, prio = 0;
struct rtattr *tb[FRA_MAX+1];
-
SPRINT_BUF(b1);
 
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
@@ -202,50 +202,66 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (!filter_nlmsg(n, tb, host_len))
return 0;
 
+   open_json_object(NULL);
if (n->nlmsg_type == RTM_DELRULE)
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (tb[FRA_PRIORITY])
-   fprintf(fp, "%u:\t",
-   rta_getattr_u32(tb[FRA_PRIORITY]));
-   else
-   fprintf(fp, "0:\t");
+   prio = rta_getattr_u32(tb[FRA_PRIORITY]);
+
+   print_uint(PRINT_ANY, "priority",
+  "%u:\t", prio);
 
if (frh->flags & FIB_RULE_INVERT)
-   fprintf(fp, "not ");
+   print_null(PRINT_ANY, "not", "not ", NULL);
+
+   if (!is_json_context())
+   fprintf(fp, "from ");
 
if (tb[FRA_SRC]) {
-   if (frh->src_len != host_len) {
-   fprintf(fp, "from %s/%u ",
-   rt_addr_n2a_rta(frh->family, tb[FRA_SRC]),
-   frh->src_len);
-   } else {
-   fprintf(fp, "from %s ",
-   format_host_rta(frh->family, tb[FRA_SRC]));
-   }
+   const char *src
+   = rt_addr_n2a_rta(frh->family, tb[FRA_SRC]);
+
+   print_color_string(PRINT_ANY,
+  ifa_family_color(frh->family),
+  "src", "%s", src);
+   if (frh->src_len != host_len)
+   print_uint(PRINT_ANY, "srclen",
+  "/%u", frh->src_len);
} else if (frh->src_len) {
-   fprintf(fp, "from 0/%d ", frh->src_len);
+   print_string(PRINT_ANY,
+"src", "%s", "0");
+   print_uint(PRINT_ANY,
+  "srclen", "/%u", frh->src_len);
} else {
-   fprintf(fp, "from all ");
+   print_string(PRINT_ANY,
+"src", "%s", "all");
}
 
+   if (!is_json_context())
+   fprintf(fp, " to ");
+
if (tb[FRA_DST]) {
-   if (frh->dst_len != host_len) {
-   fprintf(fp, "to %s/%u ",
-   rt_addr_n2a_rta(frh->family, tb[FRA_DST]),
-   frh->dst_len);
-   } else {
-   fprintf(fp, "to %s ",
-   format_host_rta(frh->family, tb[FRA_DST]));
-   }
+   const char *dst
+   = rt_addr_n2a_rta(frh->family, tb[FRA_DST]);
+
+   print_color_string(PRINT_ANY,
+  ifa_family_color(frh->family),
+  "dst", "%s", dst);
+   if (frh->dst_len != host_len)
+   print_uint(PRINT_ANY, "dstlen",
+  "/%u ", frh->dst_len);
} else if (frh->dst_len) {
-   fprintf(fp, "to 0/%d ", frh->dst_len);
+   print_string(PRINT_ANY,
+"dst", "%s", "0");
+   print_uint(PRINT_ANY,
+  "dstlen", "/%u ", frh->dst_len);
}
 
if (frh->tos) {
-   SPRINT_BUF(b1);
-   fprintf(fp, "tos %s ",
-   rtnl_dsfield_n2a(frh->tos, b1, sizeof(b1)));
+   print_string(PRINT_ANY, "tos",
+"tos %s ",
+rtnl_dsfield_n2a(frh->tos, b1, sizeof(b1)));
}
 
if (tb[FRA_FWMARK] || tb[FRA_FWMASK]) {
@@ -255,53 +271,82 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
mark =

[PATCH iproute2-next 5/7] ip: add JSON support to netconf

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Basic JSON support for ip netconf command.
Also cleanup some checkpatch warnings about long lines.

Signed-off-by: Stephen Hemminger 
---
 ip/ipnetconf.c | 69 +-
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c
index 76e639a813e4..03f98ace9145 100644
--- a/ip/ipnetconf.c
+++ b/ip/ipnetconf.c
@@ -29,6 +29,10 @@ static struct {
int ifindex;
 } filter;
 
+static const char * const rp_filter_names[] = {
+   "off", "strict", "loose"
+};
+
 static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
@@ -37,9 +41,12 @@ static void usage(void)
exit(-1);
 }
 
-static void print_onoff(FILE *f, const char *flag, __u32 val)
+static void print_onoff(FILE *fp, const char *flag, __u32 val)
 {
-   fprintf(f, "%s %s ", flag, val ? "on" : "off");
+   if (is_json_context())
+   print_bool(PRINT_JSON, flag, NULL, val);
+   else
+   fprintf(fp, "%s %s ", flag, val ? "on" : "off");
 }
 
 static struct rtattr *netconf_rta(struct netconfmsg *ncm)
@@ -83,50 +90,44 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
if (filter.ifindex && filter.ifindex != ifindex)
return 0;
 
-   switch (ncm->ncm_family) {
-   case AF_INET:
-   fprintf(fp, "ipv4 ");
-   break;
-   case AF_INET6:
-   fprintf(fp, "ipv6 ");
-   break;
-   case AF_MPLS:
-   fprintf(fp, "mpls ");
-   break;
-   default:
-   fprintf(fp, "unknown ");
-   break;
-   }
+   open_json_object(NULL);
+   print_string(PRINT_ANY, "family",
+"%s ", family_name(ncm->ncm_family));
 
if (tb[NETCONFA_IFINDEX]) {
+   const char *dev;
+
switch (ifindex) {
case NETCONFA_IFINDEX_ALL:
-   fprintf(fp, "all ");
+   dev = "all";
break;
case NETCONFA_IFINDEX_DEFAULT:
-   fprintf(fp, "default ");
+   dev = "default";
break;
default:
-   fprintf(fp, "dev %s ", ll_index_to_name(ifindex));
+   dev = ll_index_to_name(ifindex);
break;
}
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "interface", "%s ", dev);
}
 
if (tb[NETCONFA_FORWARDING])
print_onoff(fp, "forwarding",
rta_getattr_u32(tb[NETCONFA_FORWARDING]));
+
if (tb[NETCONFA_RP_FILTER]) {
__u32 rp_filter = rta_getattr_u32(tb[NETCONFA_RP_FILTER]);
 
-   if (rp_filter == 0)
-   fprintf(fp, "rp_filter off ");
-   else if (rp_filter == 1)
-   fprintf(fp, "rp_filter strict ");
-   else if (rp_filter == 2)
-   fprintf(fp, "rp_filter loose ");
+   if (rp_filter < ARRAY_SIZE(rp_filter_names))
+   print_string(PRINT_ANY, "rp_filter",
+"rp_filter %s ",
+rp_filter_names[rp_filter]);
else
-   fprintf(fp, "rp_filter unknown mode ");
+   print_uint(PRINT_ANY, "rp_filter",
+  "rp_filter %u ", rp_filter);
}
+
if (tb[NETCONFA_MC_FORWARDING])
print_onoff(fp, "mc_forwarding",
rta_getattr_u32(tb[NETCONFA_MC_FORWARDING]));
@@ -142,7 +143,8 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
if (tb[NETCONFA_INPUT])
print_onoff(fp, "input", rta_getattr_u32(tb[NETCONFA_INPUT]));
 
-   fprintf(fp, "\n");
+   close_json_object();
+   print_string(PRINT_FP, NULL, "\n", NULL);
fflush(fp);
return 0;
 }
@@ -179,7 +181,8 @@ static int do_show(int argc, char **argv)
NEXT_ARG();
filter.ifindex = ll_name_to_index(*argv);
if (filter.ifindex <= 0) {
-   fprintf(stderr, "Device \"%s\" does not 
exist.\n",
+   fprintf(stderr,
+   "Device \"%s\" does not exist.\n",
*argv);
return -1;
}
@@ -202,10 +205,13 @@ static int do_show(int argc, char **argv)
} else {
rth.flags = RTNL_HANDLE_F_SUPPRESS_NLERR;
 dump:
-   if (rtnl_wilddump_request(, filter.family, RTM_GETNETCONF) 
< 0) {
+   if

[PATCH iproute2-next 4/7] ip: add json support to ntable

2018-03-05 Thread Stephen Hemminger

From: Stephen Hemminger 

Add JSON (and limited color) to ip neighbor table parameter output.

Signed-off-by: Stephen Hemminger 
---
 ip/ipntable.c | 415 --
 1 file changed, 226 insertions(+), 189 deletions(-)

diff --git a/ip/ipntable.c b/ip/ipntable.c
index 2f72c989f35d..f6dff28ecbb2 100644
--- a/ip/ipntable.c
+++ b/ip/ipntable.c
@@ -31,6 +31,7 @@
 
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 static struct
 {
@@ -338,274 +339,308 @@ static const char *ntable_strtime_delta(__u32 msec)
return str;
 }
 
-static int print_ntable(const struct sockaddr_nl *who, struct nlmsghdr *n, 
void *arg)
+static void print_ndtconfig(const struct ndt_config *ndtc)
 {
-   FILE *fp = (FILE *)arg;
-   struct ndtmsg *ndtm = NLMSG_DATA(n);
-   int len = n->nlmsg_len;
-   struct rtattr *tb[NDTA_MAX+1];
-   struct rtattr *tpb[NDTPA_MAX+1];
-   int ret;
 
-   if (n->nlmsg_type != RTM_NEWNEIGHTBL) {
-   fprintf(stderr, "Not NEIGHTBL: %08x %08x %08x\n",
-   n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
-   return 0;
-   }
-   len -= NLMSG_LENGTH(sizeof(*ndtm));
-   if (len < 0) {
-   fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
-   return -1;
-   }
+   print_uint(PRINT_ANY, "key_length",
+"config key_len %u ", ndtc->ndtc_key_len);
+   print_uint(PRINT_ANY, "entry_size",
+"entry_size %u ", ndtc->ndtc_entry_size);
+   print_uint(PRINT_ANY, "entries",
+  "entries %u ", ndtc->ndtc_entries);
 
-   if (preferred_family && preferred_family != ndtm->ndtm_family)
-   return 0;
+   print_string(PRINT_FP, NULL, "%s", _SL_);
 
-   parse_rtattr(tb, NDTA_MAX, NDTA_RTA(ndtm),
-n->nlmsg_len - NLMSG_LENGTH(sizeof(*ndtm)));
+   print_string(PRINT_ANY, "last_flush",
+"last_flush %s ",
+ntable_strtime_delta(ndtc->ndtc_last_flush));
+   print_string(PRINT_ANY, "last_rand",
+"last_rand %s ",
+ntable_strtime_delta(ndtc->ndtc_last_rand));
 
-   if (tb[NDTA_NAME]) {
-   const char *name = rta_getattr_str(tb[NDTA_NAME]);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
 
-   if (filter.name && strcmp(filter.name, name))
-   return 0;
-   }
-   if (tb[NDTA_PARMS]) {
-   parse_rtattr(tpb, NDTPA_MAX, RTA_DATA(tb[NDTA_PARMS]),
-RTA_PAYLOAD(tb[NDTA_PARMS]));
+   print_uint(PRINT_ANY, "hash_rnd",
+  "hash_rnd %u ", ndtc->ndtc_hash_rnd);
+   print_0xhex(PRINT_ANY, "hash_mask",
+   "hash_mask %08x ", ndtc->ndtc_hash_mask);
 
-   if (tpb[NDTPA_IFINDEX]) {
-   __u32 ifindex = rta_getattr_u32(tpb[NDTPA_IFINDEX]);
+   print_uint(PRINT_ANY, "hash_chain_gc",
+  "hash_chain_gc %u ", ndtc->ndtc_hash_chain_gc);
+   print_uint(PRINT_ANY, "proxy_qlen",
+  "proxy_qlen %u ", ndtc->ndtc_proxy_qlen);
 
-   if (filter.index && filter.index != ifindex)
-   return 0;
-   } else {
-   if (filter.index && filter.index != NONE_DEV)
-   return 0;
-   }
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+}
+
+static void print_ndtparams(struct rtattr *tpb[])
+{
+
+   if (tpb[NDTPA_IFINDEX]) {
+   __u32 ifindex = rta_getattr_u32(tpb[NDTPA_IFINDEX]);
+
+   print_string(PRINT_FP, NULL, "dev ", NULL);
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "dev", "%s ", ll_index_to_name(ifindex));
+   print_string(PRINT_FP, NULL, "%s", _SL_);
}
 
-   if (ndtm->ndtm_family == AF_INET)
-   fprintf(fp, "inet ");
-   else if (ndtm->ndtm_family == AF_INET6)
-   fprintf(fp, "inet6 ");
-   else if (ndtm->ndtm_family == AF_DECnet)
-   fprintf(fp, "dnet ");
-   else
-   fprintf(fp, "(%d) ", ndtm->ndtm_family);
+   print_string(PRINT_FP, NULL, "", NULL);
+   if (tpb[NDTPA_REFCNT]) {
+   __u32 refcnt = rta_getattr_u32(tpb[NDTPA_REFCNT]);
 
-   if (tb[NDTA_NAME]) {
-   const char *name = rta_getattr_str(tb[NDTA_NAME]);
+   print_uint(PRINT_ANY, "refcnt", "refcnt %u ", refcnt);
+   }
 
-   fprintf(fp, "%s ", name);
+   if (tpb[NDTPA_REACHABLE_TIME]) {
+   __u64 reachable = rta_getattr_u64(tpb[NDTPA_REACHABLE_TIME]);
+
+   print_uint(PRINT_ANY, "reachable",
+"reachable %llu ", reachable);
}
 
-   fprintf(fp, "%s", _SL_);
+

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-05 Thread Alexei Starovoitov


On 3/5/18 6:13 PM, Randy Dunlap wrote:

Hi,

On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:


diff --git a/kernel/module.c b/kernel/module.c
index ad2d420024f6..6cfa35795741 100644
--- a/kernel/module.c
+++ b/kernel/module.c



@@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const 
char __user *uargs,
if (err)
goto free_copy;

+   if (info->hdr->e_type == ET_EXEC) {
+#ifdef CONFIG_MODULE_SIG
+   if (!info->sig_ok) {
+   pr_notice_once("umh %s verification failed: signature and/or 
required key missing - tainting kernel\n",


That's not a very friendly message to tell a user.  "umh" eh?


umh is an abbreviation known to kernel newbies:
https://kernelnewbies.org/KernelProjects/usermode-helper-enhancements

The rest of the message is copy paste of existing one.


+  info->file->f_path.dentry->d_name.name);
+   add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
+   }


And since the signature failed, why is it being loaded at all?


because this is how regular kernel modules deal with it.
sig_enforce is handled earlier.


Is this in the "--force" load path?


--force forces modver and modmagic. These things don't apply here.

[PATCH] net: qcom/emac: Use proper free methods during TX

2018-03-05 Thread Hemanth Puranik

This patch fixes the warning messages/call traces seen if DMA debug is
enabled, In case of fragmented skb's memory was allocated using
dma_map_page but freed using dma_unmap_single. This patch modifies buffer
allocations in TX path to use dma_map_page in all the places and
dma_unmap_page while freeing the buffers.

Signed-off-by: Hemanth Puranik 
---
 drivers/net/ethernet/qualcomm/emac/emac-mac.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c 
b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index 9cbb2726..d5a32b7 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -1194,9 +1194,9 @@ void emac_mac_tx_process(struct emac_adapter *adpt, 
struct emac_tx_queue *tx_q)
while (tx_q->tpd.consume_idx != hw_consume_idx) {
tpbuf = GET_TPD_BUFFER(tx_q, tx_q->tpd.consume_idx);
if (tpbuf->dma_addr) {
-   dma_unmap_single(adpt->netdev->dev.parent,
-tpbuf->dma_addr, tpbuf->length,
-DMA_TO_DEVICE);
+   dma_unmap_page(adpt->netdev->dev.parent,
+  tpbuf->dma_addr, tpbuf->length,
+  DMA_TO_DEVICE);
tpbuf->dma_addr = 0;
}
 
@@ -1353,9 +1353,11 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt,
 
tpbuf = GET_TPD_BUFFER(tx_q, tx_q->tpd.produce_idx);
tpbuf->length = mapped_len;
-   tpbuf->dma_addr = dma_map_single(adpt->netdev->dev.parent,
-skb->data, tpbuf->length,
-DMA_TO_DEVICE);
+   tpbuf->dma_addr = dma_map_page(adpt->netdev->dev.parent,
+  virt_to_page(skb->data),
+  offset_in_page(skb->data),
+  tpbuf->length,
+  DMA_TO_DEVICE);
ret = dma_mapping_error(adpt->netdev->dev.parent,
tpbuf->dma_addr);
if (ret)
@@ -1371,9 +1373,12 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt,
if (mapped_len < len) {
tpbuf = GET_TPD_BUFFER(tx_q, tx_q->tpd.produce_idx);
tpbuf->length = len - mapped_len;
-   tpbuf->dma_addr = dma_map_single(adpt->netdev->dev.parent,
-skb->data + mapped_len,
-tpbuf->length, DMA_TO_DEVICE);
+   tpbuf->dma_addr = dma_map_page(adpt->netdev->dev.parent,
+  virt_to_page(skb->data +
+   mapped_len),
+  offset_in_page(skb->data +
+ mapped_len),
+  tpbuf->length, DMA_TO_DEVICE);
ret = dma_mapping_error(adpt->netdev->dev.parent,
tpbuf->dma_addr);
if (ret)
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread Roopa Prabhu

On Mon, Mar 5, 2018 at 2:08 PM, Jakub Kicinski
 wrote:
> On Mon,  5 Mar 2018 13:28:28 +, John Hurley wrote:
>> The linux bond itself registers a cb for offloading tc rules. Potential
>> slave netdevs on offload devices can then register with the bond for a
>> further callback - this code is basically the same as registering for an
>> egress dev offload in TC. Then when a rule is offloaded to the bond, it
>> can be relayed to each netdev that has registered with the bond code and
>> which is a slave of the given bond.
>
> As you know I would much rather see this handled in the TC core,
> similarly to how blocks are shared.  We can add a new .ndo_setup_tc
> notification like TC_MASTER_BLOCK_BIND and reuse the existing offload
> tracking.  It would also fix the problem of freezing the bond and allow
> better code reuse with team etc.

+1 to handle this in tc core. We will soon find that many other
devices will need to propagate rules down
 the netdev stack and keeping it in tc core allows re-use like you state above.
The switchdev api's before they moved to notifiers in many cases had
bond and other netdev stack offload
traversal inside the switchdev layer (In the notifier world, I think a
driver can still register and track rules and other offload on
 bonds with its own ports as slaves)

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-05 Thread Randy Dunlap

Hi,

On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:

> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c

> @@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const 
> char __user *uargs,
>   if (err)
>   goto free_copy;
>  
> + if (info->hdr->e_type == ET_EXEC) {
> +#ifdef CONFIG_MODULE_SIG
> + if (!info->sig_ok) {
> + pr_notice_once("umh %s verification failed: signature 
> and/or required key missing - tainting kernel\n",

That's not a very friendly message to tell a user.  "umh" eh?

> +info->file->f_path.dentry->d_name.name);
> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
> + }

And since the signature failed, why is it being loaded at all?
Is this in the "--force" load path?

> +#endif
> + return 0;
> + }
> +
>   /* Figure out module layout, and allocate all the memory. */
>   mod = layout_and_allocate(info, flags);
>   if (IS_ERR(mod)) {

thanks,
-- 
~Randy

[PATCH 2/2] e1000e: Fix link check race condition

2018-03-05 Thread Benjamin Poirier

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., )
link = true

 /* link goes down... interrupt */
 \ e1000_msix_other
 hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
 * get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck 
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 31 -
 drivers/net/ethernet/intel/e1000e/mac.c | 14 ++---
 2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index d6d4ed7acf03..1dddfb7b2de6 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1383,6 +1383,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
if (!mac->get_link_status)
return 0;
+   mac->get_link_status = false;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1390,12 +1391,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
ret_val = e1000e_phy_has_link_generic(hw, 1, 0, );
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pchlan) {
ret_val = e1000_k1_gig_workaround_hv(hw, link);
if (ret_val)
-   return ret_val;
+   goto out;
}
 
/* When connected at 10Mbps half-duplex, some parts are excessively
@@ -1428,7 +1429,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pch2lan)
emi_addr = I82579_RX_CONFIG;
@@ -1450,7 +1451,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
hw->phy.ops.release(hw);
 
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type >= e1000_pch_spt) {
u16 data;
@@ -1459,14 +1460,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
if (speed == SPEED_1000) {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_rphy_locked(hw,
  PHY_REG(776, 20),
  );
if (ret_val) {
hw->phy.ops.release(hw);
-   return ret_val;
+   goto out;
}
 
ptr_gap = (data & (0x3FF << 2)) >> 2;
@@ -1480,18 +1481,18 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
}
hw->phy.ops.release(hw);
if (ret_val)
-   return ret_val;
+   goto out;
} else {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_wphy_locked(hw,
  PHY_REG(776, 20),
  0xC023);
hw->phy.ops.release(hw);
if (ret_val)
-   return ret_val;
+

[PATCH 1/2] Revert "e1000e: Separate signaling for link check/link up"

2018-03-05 Thread Benjamin Poirier

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.

Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
changed what happens to the link status when there is an error which
happens after "get_link_status = false" in the copper check_for_link
callbacks. Previously, such an error would be ignored and the link
considered up. After that commit, any error implies that the link is down.

Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
up") and its followups. After reverting, the race condition described in
the log of commit 19110cfbb34d is reintroduced. It may still be triggered
by LSC events but this should keep the link down in case the link is
electrically unstable, as discussed. The race may no longer be
triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts") restored reading icr in the Other
handler.

Link: https://lkml.org/lkml/2018/3/1/789
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 13 -
 drivers/net/ethernet/intel/e1000e/mac.c | 13 -
 drivers/net/ethernet/intel/e1000e/netdev.c  |  2 +-
 3 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ff308b05d68c..d6d4ed7acf03 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1367,9 +1367,6 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, 
bool force)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
 {
@@ -1385,7 +1382,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1602,7 +1599,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return 1;
+   return -E1000_ERR_CONFIG;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
@@ -1616,12 +1613,10 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+   if (ret_val)
e_dbg("Error configuring flow control\n");
-   return ret_val;
-   }
 
-   return 1;
+   return ret_val;
 }
 
 static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index db735644b312..b322011ec282 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,9 +410,6 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 {
@@ -426,7 +423,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -450,7 +447,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return 1;
+   return -E1000_ERR_CONFIG;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
@@ -464,12 +461,10 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+   if (ret_val)
e_dbg("Error configuring flow

[PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-05 Thread Alexei Starovoitov

As the first step in development of bpfilter project [1] the request_module()
code is extended to allow user mode helpers to be invoked. Idea is that
user mode helpers are built as part of the kernel build and installed as
traditional kernel modules with .ko file extension into distro specified
location, such that from a distribution point of view, they are no different
than regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) modules via:

  request_module("foo") ->
call_umh("modprobe foo") ->
  sys_finit_module(FD of /lib/modules/.../foo.ko) ->
call_umh(struct file)

Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise:
control plane is done by the user space while data plane stays in the kernel.

It's easy to distinguish "umh module" from traditional kernel module:

$ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
  Type:  EXEC (Executable file)
$ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
  Type:  REL (Relocatable file)

Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.

[1] https://lwn.net/Articles/747551/

Signed-off-by: Alexei Starovoitov 
---
 fs/exec.c   | 40 +++-
 include/linux/binfmts.h |  1 +
 include/linux/umh.h |  3 +++
 kernel/module.c | 43 ++-
 kernel/umh.c| 24 +---
 5 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..0483c438de7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+   struct user_arg_ptr argv,
+   struct user_arg_ptr envp,
+   int flags, struct file *file)
 {
char *pathbuf = NULL;
struct linux_binprm *bprm;
-   struct file *file;
struct files_struct *displaced;
int retval;
 
@@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
 
-   file = do_open_execat(fd, filename, flags);
+   if (!file)
+   file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename 
*filename,
sched_exec();
 
bprm->file = file;
-   if (fd == AT_FDCWD || filename->name[0] == '/') {
+   if (!filename) {
+   bprm->filename = "/dev/null";
+   } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
-   putname(filename);
+   if (filename)
+   putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename 
*filename,
if (displaced)
reset_files_struct(displaced);
 out_ret:
-   putname(filename);
+   if (filename)
+   putname(filename);
return retval;
 }
 
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+   struct file *file = NULL;
+
+   return

Re: [PATCH bpf-next 3/5] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-05 Thread Alexei Starovoitov


On 3/5/18 3:56 PM, Daniel Borkmann wrote:

On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:

Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
   __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // program can read args[N] where N depends on tracepoint
  // and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0x81132080 <+0>: mov%ecx,%ecx
   0x81132082 <+2>: jmpq   0x811231f0 

where

TRACE_EVENT(xdp_exception,
TP_PROTO(const struct net_device *dev,
 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since the probe funcs need to appear in kallsyms.
The alternative of having __bpf_trace_##call being global in kallsyms
could have been to keep them static and add another pointer to these
static functions to 'struct trace_event_class' and 'struct trace_event_call',
but keeping them global simplifies implementation and keeps it indepedent
from the tracing side.

Also such approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.


Awesome work! Just a few comments below.


Since tracepoint+bpf are used at speeds of 1M+ events per second
this is very valuable optimization.

Since ftrace and perf side are not involved the new
BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
// attach bpf program to given tracepoint
bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side for_each_kernel_tracepoint() is used
to find a tracepoint with "xdp_exception" name
(that would be __tracepoint_xdp_exception record)

Then kallsyms_lookup_name() is used to find the addr
of __bpf_trace_xdp_exception() probe function.

And finally tracepoint_probe_register() is used to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Also multiple bpf_raw_tracepoint_open("foo") are permitted.
Each raw_tp_fd allows to attach one bpf program, so multiple
user space processes can open their own raw_tp_fd with their own
bpf program. The kernel will execute all tracepoint probes
and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

Signed-off-by: Alexei Starovoitov 

...

+static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
+{
+   struct bpf_raw_tracepoint *raw_tp;
+   struct tracepoint *tp;
+   char tp_name[128];
+
+   if (strncpy_from_user(tp_name, 
u64_to_user_ptr(attr->raw_tracepoint.name),
+ sizeof(tp_name) - 1) < 0)
+   return -EFAULT;
+   tp_name[sizeof(tp_name) - 1] = 0;
+
+   tp = for_each_kernel_tracepoint(__find_tp, tp_name);
+   if (!tp)
+   return -ENOENT;
+
+   raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
+   if (!raw_tp)
+   return -ENOMEM;
+   raw_tp->tp = tp;
+
+   return anon_inode_getfd("bpf-raw-tracepoint", _raw_tp_fops, raw_tp,
+   O_CLOEXEC);


When anon_inode_getfd() fails to get you an fd, then you leak raw_tp here.


good catch. will fix.


break;
+   case BPF_RAW_TRACEPOINT_OPEN:
+   err = bpf_raw_tracepoint_open();


With regards to above attach_raw_tp() comment, why not having single
BPF_RAW_TRACEPOINT_OPEN command already passing BPF fd along with the
tp name? Is there a

Add NETIF_F_HW_VLAN_CTAG_TX to hw_enc_features

2018-03-05 Thread Seiichi Ishitsuka

Hi all,

I have a problem of Ether frame corruption on using mv88e6xxx DSA driver.
Could you please tell me the solving method or patch if it is a known problem?
Attached is workaround patch, and it seems to works fine.

Environment:

kernel: linux-yocto-4.4
Hardware: Marvell 88E6182 L2SW with mv88e6xxx DSA driver

  +-+
  |L2SW(88E6182)|
  | with DSA Driver |
  +-++--+
|| VLAN 100 untagged port
   CPU external
||
   br.100   HUB -- PC


The Ether frame sending from external port was corrupted by a data of 
VLAN tag and IPv4 header.
There is no problem in case of received from external port.

Analyze corrupted data by capturing external port and debugging by printing:

Original packet:
 | Dst MAC |Src MAC  | type | IPv4 hdr |
 | 8000273c8b6d|005043000201 | 0800 | 456cc7514000.. |

VLAN packet:
 | Dst MAC |Src MAC  | VLAN |type | IPv4 hdr   |
 | 8000273c8b6d|005043000201 | 8164 |0800 | 456cc7514000.. |
   ^
   < from this correct data>
Corrupted packet:
 | Dst MAC |Src MAC  | VLAN |type | IPv4 hdr   |
 | 8000273c0064|0800456c | c7514000 |0800 | 456cc7514000.. |
   
   < to this corrupted data   >


Best regards,
Seiichi Ishitsuka


0001-Add-NETIF_F_HW_VLAN_CTAG_TX-to-hw_enc_features.patch
Description: 0001-Add-NETIF_F_HW_VLAN_CTAG_TX-to-hw_enc_features.patch

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread Jakub Kicinski

On Mon, 5 Mar 2018 23:57:18 +, John Hurley wrote:
> > what is your approach for rules whose bond is their egress device
> > (ingress = vf port
> > representor)?  
> 
> Egress offload will be handled entirely by the NFP driver.
> Basically, notifiers will track the slaves/masters and update the card
> with any groups that consist entirely of reprs.
> We then offload the TC rule outputting to the given group - because it
> is an ingress match we can access the egress netdev in the block
> callback.

And you handle egdev call too?  Or are we hoping to get rid of that
before? :)

Re: [PATCH bpf-next 3/5] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-05 Thread Daniel Borkmann

On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
> Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
> kernel internal arguments of the tracepoints in their raw form.
> 
> From bpf program point of view the access to the arguments look like:
> struct bpf_raw_tracepoint_args {
>__u64 args[0];
> };
> 
> int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
> {
>   // program can read args[N] where N depends on tracepoint
>   // and statically verified at program load+attach time
> }
> 
> kprobe+bpf infrastructure allows programs access function arguments.
> This feature allows programs access raw tracepoint arguments.
> 
> Similar to proposed 'dynamic ftrace events' there are no abi guarantees
> to what the tracepoints arguments are and what their meaning is.
> The program needs to type cast args properly and use bpf_probe_read()
> helper to access struct fields when argument is a pointer.
> 
> For every tracepoint __bpf_trace_##call function is prepared.
> In assembler it looks like:
> (gdb) disassemble __bpf_trace_xdp_exception
> Dump of assembler code for function __bpf_trace_xdp_exception:
>0x81132080 <+0>: mov%ecx,%ecx
>0x81132082 <+2>: jmpq   0x811231f0 
> 
> where
> 
> TRACE_EVENT(xdp_exception,
> TP_PROTO(const struct net_device *dev,
>  const struct bpf_prog *xdp, u32 act),
> 
> The above assembler snippet is casting 32-bit 'act' field into 'u64'
> to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
> All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
> and in total this approach adds 7k bytes to .text and 8k bytes
> to .rodata since the probe funcs need to appear in kallsyms.
> The alternative of having __bpf_trace_##call being global in kallsyms
> could have been to keep them static and add another pointer to these
> static functions to 'struct trace_event_class' and 'struct trace_event_call',
> but keeping them global simplifies implementation and keeps it indepedent
> from the tracing side.
> 
> Also such approach gives the lowest possible overhead
> while calling trace_xdp_exception() from kernel C code and
> transitioning into bpf land.

Awesome work! Just a few comments below.

> Since tracepoint+bpf are used at speeds of 1M+ events per second
> this is very valuable optimization.
> 
> Since ftrace and perf side are not involved the new
> BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
> that returns anon_inode FD of 'bpf-raw-tracepoint' object.
> 
> The user space looks like:
> // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
> prog_fd = bpf_prog_load(...);
> // receive anon_inode fd for given bpf_raw_tracepoint
> raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
> // attach bpf program to given tracepoint
> bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);
> 
> Ctrl-C of tracing daemon or cmdline tool that uses this feature
> will automatically detach bpf program, unload it and
> unregister tracepoint probe.
> 
> On the kernel side for_each_kernel_tracepoint() is used
> to find a tracepoint with "xdp_exception" name
> (that would be __tracepoint_xdp_exception record)
> 
> Then kallsyms_lookup_name() is used to find the addr
> of __bpf_trace_xdp_exception() probe function.
> 
> And finally tracepoint_probe_register() is used to connect probe
> with tracepoint.
> 
> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
> tracepoint mechanisms. perf_event_open() can be used in parallel
> on the same tracepoint.
> Also multiple bpf_raw_tracepoint_open("foo") are permitted.
> Each raw_tp_fd allows to attach one bpf program, so multiple
> user space processes can open their own raw_tp_fd with their own
> bpf program. The kernel will execute all tracepoint probes
> and all attached bpf programs.
> 
> In the future bpf_raw_tracepoints can be extended with
> query/introspection logic.
> 
> Signed-off-by: Alexei Starovoitov 
> ---
>  include/linux/bpf_types.h|   1 +
>  include/linux/trace_events.h |  57 
>  include/trace/bpf_probe.h|  87 ++
>  include/trace/define_trace.h |   1 +
>  include/uapi/linux/bpf.h |  11 +++
>  kernel/bpf/syscall.c | 108 ++
>  kernel/trace/bpf_trace.c | 211 
> +++
>  7 files changed, 476 insertions(+)
>  create mode 100644 include/trace/bpf_probe.h
> 
[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e24aa3241387..b5c33dda1a1c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1311,6 +1311,109 @@ static int bpf_obj_get(const union bpf_attr *attr)
>   attr->file_flags);
>  }
>  
> +struct bpf_raw_tracepoint {
> + struct tracepoint *tp;
> + struct bpf_prog *prog;
> +};
> +
> +static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
> +{
> + struct bpf_raw_tracepoint *raw_tp =

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread John Hurley

On Mon, Mar 5, 2018 at 9:43 PM, Or Gerlitz  wrote:
> On Mon, Mar 5, 2018 at 3:28 PM, John Hurley  wrote:
>> This RFC patchset adds support for offloading tc ingress rules applied to
>> linux bonds. The premise of these patches is that if a rule is applied to
>> a bond port then the rule should be applied to each slave of the bond.
>>
>> The linux bond itself registers a cb for offloading tc rules. Potential
>> slave netdevs on offload devices can then register with the bond for a
>> further callback - this code is basically the same as registering for an
>> egress dev offload in TC. Then when a rule is offloaded to the bond, it
>> can be relayed to each netdev that has registered with the bond code and
>> which is a slave of the given bond.
>>
>> To prevent sync issues between the kernel and offload device, the linux
>> bond driver is affectively locked when it has offloaded rules. i.e no new
>> ports can be enslaved and no slaves can be released until the offload
>> rules are removed. Similarly, if a port on a bond is deleted, the bond is
>> destroyed, forcing a flush of all offloaded rules.
>>
>> Also included in the RFC are changes to the NFP driver to utilise the new
>> code by registering NFP port representors for bond offload rules and
>> modifying cookie handling to allow the relaying of a rule to multiple ports.
>
> what is your approach for rules whose bond is their egress device
> (ingress = vf port
> representor)?

Egress offload will be handled entirely by the NFP driver.
Basically, notifiers will track the slaves/masters and update the card
with any groups that consist entirely of reprs.
We then offload the TC rule outputting to the given group - because it
is an ingress match we can access the egress netdev in the block
callback.

Re: [PATCH net-next 0/3] sctp: add support for some msg_control options from RFC6458

2018-03-05 Thread Marcelo Ricardo Leitner

On Mon, Mar 05, 2018 at 08:44:17PM +0800, Xin Long wrote:
> This patchset is to add support for 3 msg_control options described
> in RFC6458:
> 
> 5.3.7.  SCTP PR-SCTP Information Structure (SCTP_PRINFO)
> 5.3.9.  SCTP Destination IPv4 Address Structure (SCTP_DSTADDRV4)
> 5.3.10. SCTP Destination IPv6 Address Structure (SCTP_DSTADDRV6)
> 
> one send flag described in RFC6458:
> 
> SCTP_SENDALL:  This flag, if set, will cause a one-to-many
> style socket to send the message to all associations that
> are currently established on this socket.  For the one-to-
> one style socket, this flag has no effect.

Other patches (than the 2nd one) LGTM.

  Marcelo

Re: [PATCH net v3] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Neil Horman

On Mon, Mar 05, 2018 at 08:52:54PM +0300, Alexey Kodanev wrote:
> When we exceed current packets limit and we have more than one
> segment in the list returned by skb_gso_segment(), netem drops
> only the first one, skipping the rest, hence kmemleak reports:
> 
> unreferenced object 0x880b5d23b600 (size 1024):
>   comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
>   hex dump (first 32 bytes):
> 00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [] __alloc_skb+0xc9/0x520
> [<1709b32f>] skb_segment+0x8c8/0x3710
> [] tcp_gso_segment+0x331/0x1830
> [] inet_gso_segment+0x476/0x1370
> [<8b762dd4>] skb_mac_gso_segment+0x1f9/0x510
> [<2182660a>] __skb_gso_segment+0x1dd/0x620
> [<412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
> [<05d3b2a9>] __dev_queue_xmit+0x1167/0x2120
> [] ip_finish_output2+0x998/0xf00
> [] ip_output+0x1aa/0x2c0
> [<7ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
> [<42d2a45f>] tcp_write_xmit+0x4d4/0x58c0
> [<56a44199>] tcp_tasklet_func+0x3d9/0x540
> [<13d06d02>] tasklet_action+0x1ca/0x250
> [] __do_softirq+0x1b4/0x5a3
> [] irq_exit+0x1e2/0x210
> 
> Fix it by adding the rest of the segments, if any, to skb 'to_free'
> list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
> because they can be useful in the future if we need to drop segmented
> GSO packets in other places.
> 
> Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
> Signed-off-by: Alexey Kodanev 
> ---
> 
> v3: use skb->prev to find the tail of the list. skb->prev can be NULL
> for not segmented skbs, so check it too.
> 
> v2: add new __qdisc_drop_all() and qdisc_drop_all(), and use
> qdisc_drop_all() in sch_netem.
> 
> 
>  include/net/sch_generic.h | 19 +++
>  net/sched/sch_netem.c |  2 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index e2ab136..2092d33 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -824,6 +824,16 @@ static inline void __qdisc_drop(struct sk_buff *skb, 
> struct sk_buff **to_free)
>   *to_free = skb;
>  }
>  
> +static inline void __qdisc_drop_all(struct sk_buff *skb,
> + struct sk_buff **to_free)
> +{
> + if (skb->prev)
> + skb->prev->next = *to_free;
> + else
> + skb->next = *to_free;
> + *to_free = skb;
> +}
> +
>  static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch,
>  struct qdisc_skb_head *qh,
>  struct sk_buff **to_free)
> @@ -956,6 +966,15 @@ static inline int qdisc_drop(struct sk_buff *skb, struct 
> Qdisc *sch,
>   return NET_XMIT_DROP;
>  }
>  
> +static inline int qdisc_drop_all(struct sk_buff *skb, struct Qdisc *sch,
> +  struct sk_buff **to_free)
> +{
> + __qdisc_drop_all(skb, to_free);
> + qdisc_qstats_drop(sch);
> +
> + return NET_XMIT_DROP;
> +}
> +
>  /* Length to Time (L2T) lookup in a qdisc_rate_table, to determine how
> long it will take to send a packet given its size.
>   */
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 7c179ad..7d6801f 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -509,7 +509,7 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> Qdisc *sch,
>   }
>  
>   if (unlikely(sch->q.qlen >= sch->limit))
> - return qdisc_drop(skb, sch, to_free);
> + return qdisc_drop_all(skb, sch, to_free);
>  
>   qdisc_qstats_backlog_inc(sch, skb);
>  
> -- 
> 1.8.3.1
> 
> 
Acked-by: Neil Horman

Re: [PATCH net-next 2/3] sctp: add support for SCTP_DSTADDRV4/6 Information for sendmsg

2018-03-05 Thread Marcelo Ricardo Leitner

On Mon, Mar 05, 2018 at 08:44:19PM +0800, Xin Long wrote:
> This patch is to add support for Destination IPv4/6 Address options
> for sendmsg, as described in section 5.3.9/10 of RFC6458.
> 
> With this option, you can provide more than one destination addrs
> to sendmsg when creating asoc, like sctp_connectx.
> 
> It's also a necessary send info for sctp_sendv.
> 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/structs.h |  1 +
>  include/uapi/linux/sctp.h  |  6 
>  net/sctp/socket.c  | 77 
> ++
>  3 files changed, 84 insertions(+)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index d40a2a3..ec6e46b 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -2113,6 +2113,7 @@ struct sctp_cmsgs {
>   struct sctp_sndrcvinfo *srinfo;
>   struct sctp_sndinfo *sinfo;
>   struct sctp_prinfo *prinfo;
> + struct msghdr *addrs_msg;
>  };
>  
>  /* Structure for tracking memory objects */
> diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> index 0dd1f82..a1bc350 100644
> --- a/include/uapi/linux/sctp.h
> +++ b/include/uapi/linux/sctp.h
> @@ -308,6 +308,12 @@ typedef enum sctp_cmsg_type {
>  #define SCTP_NXTINFO SCTP_NXTINFO
>   SCTP_PRINFO,/* 5.3.7 SCTP PR-SCTP Information Structure */
>  #define SCTP_PRINFO  SCTP_PRINFO
> + SCTP_AUTHINFO,  /* 5.3.8 SCTP AUTH Information Structure 
> (RESERVED) */
> +#define SCTP_AUTHINFOSCTP_AUTHINFO
> + SCTP_DSTADDRV4, /* 5.3.9 SCTP Destination IPv4 Address 
> Structure */
> +#define SCTP_DSTADDRV4   SCTP_DSTADDRV4
> + SCTP_DSTADDRV6, /* 5.3.10 SCTP Destination IPv6 Address 
> Structure */
> +#define SCTP_DSTADDRV6   SCTP_DSTADDRV6
>  } sctp_cmsg_t;
>  
>  /*
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index fdde697..067b57a 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1676,6 +1676,7 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, __u16 
> sflags,
>   struct net *net = sock_net(sk);
>   struct sctp_association *asoc;
>   enum sctp_scope scope;
> + struct cmsghdr *cmsg;
>   int err = -EINVAL;
>  
>   *tp = NULL;
> @@ -1741,6 +1742,67 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, 
> __u16 sflags,
>   goto free;
>   }
>  
> + if (!cmsgs->addrs_msg)
> + return 0;
> +
> + /* sendv addr list parse */
> + for_each_cmsghdr(cmsg, cmsgs->addrs_msg) {
> + struct sctp_transport *transport;
> + struct sctp_association *old;
> + union sctp_addr _daddr;
> + int dlen;
> +
> + if (cmsg->cmsg_level != IPPROTO_SCTP ||
> + (cmsg->cmsg_type != SCTP_DSTADDRV4 &&
> +  cmsg->cmsg_type != SCTP_DSTADDRV6))
> + continue;
> +
> + daddr = &_daddr;
> + memset(daddr, 0, sizeof(*daddr));
> + dlen = cmsg->cmsg_len - sizeof(struct cmsghdr);
> + if (cmsg->cmsg_type == SCTP_DSTADDRV4) {
> + if (dlen < sizeof(struct in_addr))
> + goto free;
> +
> + dlen = sizeof(struct in_addr);
> + daddr->v4.sin_family = AF_INET;
> + daddr->v4.sin_port = htons(asoc->peer.port);
> + memcpy(>v4.sin_addr, CMSG_DATA(cmsg), dlen);
> + } else {
> + if (dlen < sizeof(struct in6_addr))
> + goto free;
> +
> + dlen = sizeof(struct in6_addr);
> + daddr->v6.sin6_family = AF_INET6;
> + daddr->v6.sin6_port = htons(asoc->peer.port);
> + memcpy(>v6.sin6_addr, CMSG_DATA(cmsg), dlen);
> + }
> + err = sctp_verify_addr(sk, daddr, sizeof(*daddr));
> + if (err)
> + goto free;
> +
> + old = sctp_endpoint_lookup_assoc(ep, daddr, );
> + if (old && old != asoc) {
> + if (old->state >= SCTP_STATE_ESTABLISHED)
> + err = -EISCONN;
> + else
> + err = -EALREADY;
> + goto free;
> + }
> +
> + if (sctp_endpoint_is_peeled_off(ep, daddr)) {
> + err = -EADDRNOTAVAIL;
> + goto free;
> + }
> +
> + transport = sctp_assoc_add_peer(asoc, daddr, GFP_KERNEL,
> + SCTP_UNKNOWN);
> + if (!transport) {
> + err = -ENOMEM;
> + goto free;
> + }
> + }
> +
>   return 0;
>  
>  free:
> @@ -7778,6 +7840,21 @@ static int sctp_msghdr_parse(const struct msghdr *msg, 
> struct sctp_cmsgs *cmsgs)
>   if

Re: [RFC net-next 4/6] nfp: add ndo_set_mac_address for representors

2018-03-05 Thread John Hurley

On Mon, Mar 5, 2018 at 9:39 PM, Or Gerlitz  wrote:
> On Mon, Mar 5, 2018 at 3:28 PM, John Hurley  wrote:
>> A representor hardware address does not have any meaning outside of the
>> kernel netdev/networking stack. Thus there is no need for any app specific
>> code for setting a representors hardware address, the default eth_mac_addr
>> is sufficient.
>
> where did you need that? does libvirt attempts to change the mac address or
> it's for bonding to call, worth mentioning the use-case in the change log

Hi Or,
yes, setting the mac is required to add the repr to a linux bond.
I agree, I should add the use case here. Thanks

Re: [PATCH net] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Neil Horman

On Mon, Mar 05, 2018 at 03:57:52PM +0300, Alexey Kodanev wrote:
> On 03/03/2018 03:20 PM, Neil Horman wrote:
> > On Fri, Mar 02, 2018 at 09:16:48PM +0300, Alexey Kodanev wrote:
> >> When we exceed current packets limit and have more than one
> >> segment in the list returned by skb_gso_segment(), netem drops
> >> only the first one, skipping the rest, hence kmemleak reports:
> >>
> ...
> >>
> >> Fix it by adding the rest of the segments, if any, to skb
> >> 'to_free' list in that case.
> >>
> >> Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
> >> Signed-off-by: Alexey Kodanev 
> >> ---
> >>  net/sched/sch_netem.c | 8 +++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> >> index 7c179ad..a5023a2 100644
> >> --- a/net/sched/sch_netem.c
> >> +++ b/net/sched/sch_netem.c
> >> @@ -508,8 +508,14 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> >> Qdisc *sch,
> >>1<<(prandom_u32() % 8);
> >>}
> >>  
> >> -  if (unlikely(sch->q.qlen >= sch->limit))
> >> +  if (unlikely(sch->q.qlen >= sch->limit)) {
> >> +  while (segs) {
> >> +  skb2 = segs->next;
> >> +  __qdisc_drop(segs, to_free);
> >> +  segs = skb2;
> >> +  }
> >>return qdisc_drop(skb, sch, to_free);
> >> +  }
> >>  
> > It seems like it might be nice to wrap up this drop loop into a
> > qdisc_drop_all inline function.  Then we can easily drop segments in other
> > locations if we should need to
> 
> 
> Agree, will prepare the patch. I guess we could just add 'segs' to 'to_free'
> list, then add qdisc_drop_all() with stats counter and returning status,
> something like this:
> 
> @@ -824,6 +824,18 @@ static inline void __qdisc_drop(struct sk_buff *skb, 
> struct sk_buff **to_free)
> *to_free = skb;
>  }
> 
> +static inline void __qdisc_drop_all(struct sk_buff *skb,
> +   struct sk_buff **to_free)
> +{
> +   struct sk_buff *first = skb;
> +
> +   while (skb->next)
> +   skb = skb->next;
> +
> +   skb->next = *to_free;
> +   *to_free = first;
> +}
> 
 I agree

Thanks!
Neil

> Thanks,
> Alexey
>

[PATCH] caif_dev: use true and false for boolean values

2018-03-05 Thread Gustavo A. R. Silva

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/caif/caif_dev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/caif/caif_dev.c b/net/caif/caif_dev.c
index e0adcd1..f2848d6 100644
--- a/net/caif/caif_dev.c
+++ b/net/caif/caif_dev.c
@@ -139,7 +139,7 @@ static void caif_flow_cb(struct sk_buff *skb)
 
spin_lock_bh(>flow_lock);
send_xoff = caifd->xoff;
-   caifd->xoff = 0;
+   caifd->xoff = false;
dtor = caifd->xoff_skb_dtor;
 
if (WARN_ON(caifd->xoff_skb != skb))
@@ -213,7 +213,7 @@ static int transmit(struct cflayer *layer, struct cfpkt 
*pkt)
pr_debug("queue has stopped(%d) or is full (%d > %d)\n",
netif_queue_stopped(caifd->netdev),
qlen, high);
-   caifd->xoff = 1;
+   caifd->xoff = true;
caifd->xoff_skb = skb;
caifd->xoff_skb_dtor = skb->destructor;
skb->destructor = caif_flow_cb;
@@ -400,7 +400,7 @@ static int caif_device_notify(struct notifier_block *me, 
unsigned long what,
break;
}
 
-   caifd->xoff = 0;
+   caifd->xoff = false;
cfcnfg_set_phy_state(cfg, >layer, true);
rcu_read_unlock();
 
@@ -435,7 +435,7 @@ static int caif_device_notify(struct notifier_block *me, 
unsigned long what,
if (caifd->xoff_skb_dtor != NULL && caifd->xoff_skb != NULL)
caifd->xoff_skb->destructor = caifd->xoff_skb_dtor;
 
-   caifd->xoff = 0;
+   caifd->xoff = false;
caifd->xoff_skb_dtor = NULL;
caifd->xoff_skb = NULL;
 
-- 
2.7.4

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend

On 03/05/2018 01:40 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 05 Mar 2018 11:51:22 -0800
> 
>> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
>> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
>> case and in the sendpage case leaves the data untouched. Both cases
>> return -EACESS to the user. Returning SK_PASS will allow the msg to
>> be sent.
>>
>> In the sendmsg case data is copied into kernel space buffers before
>> running the BPF program. In the sendpage case data is never copied.
>> The implication being users may change data after BPF programs run in
>> the sendpage case. (A flag will be added to always copy shortly
>> if the copy must always be performed).
> 
> I don't see how the sendpage case can be right.
> 
> The user can asynchronously change the page contents whenever they
> want, and if the BPF program runs on the old contents then the verdict
> is not for what actually ends up being sent on the socket> 
> There is really no way to cheaply freeze the page contents other than
> to make a copy.
> 

Right, so we have two cases. The first is we are not trying to protect
against malicious users but merely monitor the connection. This case
is primarily for L7 statistics, number of bytes sent to URL foo
for example. If users are changing data (for a real program not something
malicious) mid sendfile() this is really buggy anyways. There is no way to
know when/if the data is being copied lower in the stack. Even worse would
be if it changed a msg header, such as the http or kafka header, then
I don't see how such a program would work reliable at all. Some of my
L7 monitoring BPF programs fall into this category.

The second case is we want to implement a strict policy. For example
never allow user 'bar' to send to URL foo. In the current patches this
would be vulnerable to async data changes. I was planning to have a follow
up patch to this series to add a flag "always copy" which handles the
asynchronous case by always copying the data if the BPF policy can
not tolerate user changing data mid-send. Another class of BPF programs
I have fall into this bucket.

However, the performance cost of copy can be significant so allowing the
BPF policy to decide which mode they require seems best to me. I decided
to make the default no-copy to mirror the existing sendpage() semantics
and then to add the flag later. The flag support is not in this series
simply because I wanted to get the base support in first.

Make sense? The default could be to copy sendpage data and then a
flag could be made to allow it to skip the copy. But I prefer the
current defaults.

Thanks,
John

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-03-05 Thread Sargun Dhillon

On Mon, Mar 5, 2018 at 8:10 AM, Tycho Andersen  wrote:
> Hi Andy,
>
> On Thu, Mar 01, 2018 at 10:05:47PM +, Andy Lutomirski wrote:
>> But Tycho: would hooking user notifiers in right here work for you?
>> As I see it, this would be the best justification for seccomp eBPF.
>
> Sorry for the delay; Sargun had declared on irc that he was going to
> implement it, so I didn't look into it. I think the basics will work,
> but I haven't invested the time to look into it given the above.
>
> Sargun, are you still planning to look at this? What's your timeline?
>
> Cheers,
>
> Tycho
Still working on this. I don't really have a timeline. I think I'll
get to share a prototype by the end of the week. I'm trying to come up
with a common mechanism to do this for multiple types of filters.

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-05 Thread Alexander Duyck

On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko  wrote:
> Mon, Mar 05, 2018 at 05:11:32PM CET, step...@networkplumber.org wrote:
>>On Mon, 5 Mar 2018 10:21:18 +0100
>>Jiri Pirko  wrote:
>>
>>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.du...@gmail.com wrote:
>>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko  wrote:
>>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.du...@gmail.com wrote:
>>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko  wrote:
>>>
>>> [...]
>>>
>>> >
>>> >>>Currently we only have agreement from Michael on taking this code, as
>>> >>>such we are working with virtio only for now. When the time comes that
>>> >>
>>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>> >> stay there forever. So what you say is: "We will do it halfway now
>>> >> and promise to fix it later". That later will never happen, I'm pretty
>>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>> >> this patchset.
>>> >
>>> >You want this new approach and a copy of netvsc moved into either core
>>> >or some module of its own. I say pick an architecture. We are looking
>>> >at either 2 netdevs or 3. We are not going to support both because
>>> >that will ultimately lead to a terrible user experience and make
>>> >things quite confusing.
>>> >
>>> >> + if you would be pushing first driver to do this, I would understand.
>>> >> But the first driver is already in. You are pushing second. This is the
>>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>> >
>>> >That is great, if we want to share then lets share. But what you are
>>> >essentially telling us is that we need to fork this solution and
>>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>> >point what is the point in merging them together?
>>>
>>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>> is my point from the very beginning.
>>>
>>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>> behave the same and to use a single code with well-defined checks and
>>> restrictions for this feature?
>>
>>Eventually, yes both could share common code routines. In reality,
>>the failover stuff is only a very small part of either driver so
>>it is not worth stretching to try and cover too much. If you look,
>>the failover code is just using routines that already exist for
>>use by bonding, teaming, etc.
>
> Yeah, we consern was also about the code that processes the netdev
> notifications and does auto-enslave and all related stuff.

The concern was the driver model. If we expose 3 netdevs or 2 with the
VF driver present. Somehow this is turning into a "merge netvsc into
virtio" think and that isn't the subject that was being asked.

Ideally we want one model for this. Either 3 netdevs or 2. The problem
is 2 causes issues in terms of performance and will limit features of
virtio, but 2 is the precedent set by netvsc. We need to figure out
the path forward for this. There is talk about "sharing" but it is
hard to make these two approaches share code when they are doing two
very different setups and end up presenting themselves as two very
different driver models.

>>
>>There will always be two drivers, the ring buffers and buffering
>>are very different between vmbus and virtio. It would help to address
>>some of the awkward stuff like queue selection and offload handling
>>in a common way.
>
> Agreed.

There are going to end up being three drivers by the time we are done.
We will end up with netvsc, virtio, and some shared block of
functionality that is used between the two of them. At least that is
the assumption if the two are going to share code. I don't know if
everyone will want to take on the extra overhead for the code shared
between these two drivers being a part of the core net code.

[PATCH net-next] selftests: net: Introduce first PMTU test

2018-03-05 Thread Stefano Brivio

One single test implemented so far: test_pmtu_vti6_exception
checks that the PMTU of a route exception, caused by a tunnel
exceeding the link layer MTU, is affected by administrative
changes of the tunnel MTU. Creation of the route exception is
checked too.

Requested-by: David Ahern 
Signed-off-by: Stefano Brivio 
---
This test will currently fail without "[PATCH net v2] ipv6: Reflect
MTU changes on PMTU of exceptions for MTU-less routes"

 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/pmtu.sh  | 159 +++
 2 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/pmtu.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 229a038966e3..785fc18a16b4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,7 +5,7 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
new file mode 100755
index ..eb186ca3e5e4
--- /dev/null
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -0,0 +1,159 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Check that route PMTU values match expectations
+#
+# Tests currently implemented:
+#
+# - test_pmtu_vti6_exception
+#  Set up vti6 tunnel on top of veth, with xfrm states and policies, in two
+#  namespaces with matching endpoints. Check that route exception is
+#  created by exceeding link layer MTU with ping to other endpoint. Then
+#  decrease and increase MTU of tunnel, checking that route exception PMTU
+#  changes accordingly
+
+NS_A="ns-$(mktemp -u XX)"
+NS_B="ns-$(mktemp -u XX)"
+ns_a="ip netns exec ${NS_A}"
+ns_b="ip netns exec ${NS_B}"
+
+veth6_a_addr="fd00:1::a"
+veth6_b_addr="fd00:1::b"
+veth6_mask="64"
+
+vti6_a_addr="fd00:2::a"
+vti6_b_addr="fd00:2::b"
+vti6_mask="64"
+
+setup_namespaces() {
+   ip netns add ${NS_A} || return 1
+   ip netns add ${NS_B}
+
+   return 0
+}
+
+setup_veth() {
+   ${ns_a} ip link add veth_a type veth peer name veth_b || return 1
+   ${ns_a} ip link set veth_b netns ${NS_B}
+   
+   ${ns_a} ip link set veth_a up
+   ${ns_b} ip link set veth_b up
+
+   ${ns_a} ip addr add ${veth6_a_addr}/${veth6_mask} dev veth_a
+   ${ns_b} ip addr add ${veth6_b_addr}/${veth6_mask} dev veth_b
+
+   return 0
+}
+
+setup_vti6() {
+   ${ns_a} ip link add vti_a type vti6 local ${veth6_a_addr} remote 
${veth6_b_addr} key 10 || return 1
+   ${ns_b} ip link add vti_b type vti6 local ${veth6_b_addr} remote 
${veth6_a_addr} key 10
+
+   ${ns_a} ip link set vti_a up
+   ${ns_b} ip link set vti_b up
+
+   ${ns_a} ip addr add ${vti6_a_addr}/${vti6_mask} dev vti_a
+   ${ns_b} ip addr add ${vti6_b_addr}/${vti6_mask} dev vti_b
+
+   return 0
+}
+
+setup_xfrm() {
+   ${ns_a} ip -6 xfrm state add src ${veth6_a_addr} dst ${veth6_b_addr} 
spi 0x1000 proto esp aead "rfc4106(gcm(aes))" 
0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel || return 1
+   ${ns_a} ip -6 xfrm state add src ${veth6_b_addr} dst ${veth6_a_addr} 
spi 0x1001 proto esp aead "rfc4106(gcm(aes))" 
0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
+   ${ns_a} ip -6 xfrm policy add dir out mark 10 tmpl src ${veth6_a_addr} 
dst ${veth6_b_addr} proto esp mode tunnel
+   ${ns_a} ip -6 xfrm policy add dir in mark 10 tmpl src ${veth6_b_addr} 
dst ${veth6_a_addr} proto esp mode tunnel
+
+   ${ns_b} ip -6 xfrm state add src ${veth6_a_addr} dst ${veth6_b_addr} 
spi 0x1000 proto esp aead "rfc4106(gcm(aes))" 
0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
+   ${ns_b} ip -6 xfrm state add src ${veth6_b_addr} dst ${veth6_a_addr} 
spi 0x1001 proto esp aead "rfc4106(gcm(aes))" 
0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
+   ${ns_b} ip -6 xfrm policy add dir out mark 10 tmpl src ${veth6_b_addr} 
dst ${veth6_a_addr} proto esp mode tunnel
+   ${ns_b} ip -6 xfrm policy add dir in mark 10 tmpl src ${veth6_a_addr} 
dst ${veth6_b_addr} proto esp mode tunnel
+
+   return 0
+}
+
+setup() {
+   tunnel_type="$1"
+
+   [ "$(id -u)" -ne 0 ] && (echo "SKIP: need to run as root" && exit 0)
+
+   setup_namespaces || (echo "SKIP: namespaces not supported" && exit 0)
+   setup_veth || (echo "SKIP: veth not supported" && exit 0)
+
+   case ${tunnel_type} in
+   "vti6")
+   setup_vti6 && (echo "SKIP: vti6 not supported" && exit 0)
+

Re: [PATCH net-next] dt-bindings: net: dsa: marvell: describe compatibility string

2018-03-05 Thread Andrew Lunn

On Mon, Mar 05, 2018 at 04:05:22PM -0600, Brandon Streiff wrote:
> There are two compatibility strings for mv88e6xxx, but it isn't clear
> from the documentation why only those two exist when the mv88e6xxx driver
> supports more than the 6085 and 6190. Briefly describe how the compatible
> property is used, and provide guidance on which to use.
> 
> The model list comes from looking at port_base_addr values (0x0 vs 0x10)
> in drivers/net/dsa/mv88e6xxx/chip.c.
> 
> Signed-off-by: Brandon Streiff 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-05 Thread Jiri Pirko

Mon, Mar 05, 2018 at 05:11:32PM CET, step...@networkplumber.org wrote:
>On Mon, 5 Mar 2018 10:21:18 +0100
>Jiri Pirko  wrote:
>
>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.du...@gmail.com wrote:
>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko  wrote:  
>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.du...@gmail.com wrote:  
>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko  wrote:  
>> 
>> [...]
>> 
>> >  
>> >>>Currently we only have agreement from Michael on taking this code, as
>> >>>such we are working with virtio only for now. When the time comes that  
>> >>
>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>> >> stay there forever. So what you say is: "We will do it halfway now
>> >> and promise to fix it later". That later will never happen, I'm pretty
>> >> sure. That is why I push for in-driver bonding shared code as a part of
>> >> this patchset.  
>> >
>> >You want this new approach and a copy of netvsc moved into either core
>> >or some module of its own. I say pick an architecture. We are looking
>> >at either 2 netdevs or 3. We are not going to support both because
>> >that will ultimately lead to a terrible user experience and make
>> >things quite confusing.
>> >  
>> >> + if you would be pushing first driver to do this, I would understand.
>> >> But the first driver is already in. You are pushing second. This is the
>> >> time to do the sharing, unification of behaviour. Next time is too late.  
>> >
>> >That is great, if we want to share then lets share. But what you are
>> >essentially telling us is that we need to fork this solution and
>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>> >point what is the point in merging them together?  
>> 
>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>> is my point from the very beginning.
>> 
>> Stephen, what do you think? Could we please make virtio_net and netvsc
>> behave the same and to use a single code with well-defined checks and
>> restrictions for this feature?
>
>Eventually, yes both could share common code routines. In reality,
>the failover stuff is only a very small part of either driver so
>it is not worth stretching to try and cover too much. If you look,
>the failover code is just using routines that already exist for
>use by bonding, teaming, etc.

Yeah, we consern was also about the code that processes the netdev
notifications and does auto-enslave and all related stuff.


>
>There will always be two drivers, the ring buffers and buffering
>are very different between vmbus and virtio. It would help to address
>some of the awkward stuff like queue selection and offload handling
>in a common way.

Agreed.


>
>Don't worry too much about backports. The backport can use the
>old code if necessary.

Re: [PATCH] netfilter: ipt_ah: return boolean instead of integer

2018-03-05 Thread Gustavo A. R. Silva




On 03/05/2018 04:10 PM, Pablo Neira Ayuso wrote:

On Tue, Feb 13, 2018 at 08:25:57AM -0600, Gustavo A. R. Silva wrote:

Return statements in functions returning bool should use
true/false instead of 1/0.

This issue was detected with the help of Coccinelle.

This one didn't get in time for the previous merge window.

Now applied, thanks.

Great.
Thanks, Pablo.
--
Gustavo

[PATCH] tipc: bcast: use true and false for boolean values

2018-03-05 Thread Gustavo A. R. Silva

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/tipc/bcast.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 37892b3..f371117 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -574,5 +574,5 @@ void tipc_nlist_purge(struct tipc_nlist *nl)
 {
tipc_dest_list_purge(>list);
nl->remote = 0;
-   nl->local = 0;
+   nl->local = false;
 }
-- 
2.7.4

Re: [PATCH trivial resend]] netfilter: xt_limit: Spelling s/maxmum/maximum/

2018-03-05 Thread Pablo Neira Ayuso

Applied this spelling fix, thanks.

Re: [Patch nf-next] netfilter: make xt_rateest hash table per net

2018-03-05 Thread Pablo Neira Ayuso

On Thu, Mar 01, 2018 at 08:21:52PM -0800, Eric Dumazet wrote:
> On Thu, 2018-03-01 at 18:58 -0800, Cong Wang wrote:
> > As suggested by Eric, we need to make the xt_rateest
> > hash table and its lock per netns to reduce lock
> > contentions.
> > 
> > Cc: Florian Westphal 
> > Cc: Eric Dumazet 
> > Cc: Pablo Neira Ayuso 
> > Signed-off-by: Cong Wang 
> > ---
> >  include/net/netfilter/xt_rateest.h |  4 +-
> >  net/netfilter/xt_RATEEST.c | 91 
> > +++---
> >  net/netfilter/xt_rateest.c | 10 ++---
> >  3 files changed, 72 insertions(+), 33 deletions(-)
> 
> Very nice, thanks !
> 
> Reviewed-by: Eric Dumazet 

Applied, thanks!

[PATCH] xfrm_policy: use true and false for boolean values

2018-03-05 Thread Gustavo A. R. Silva

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/xfrm/xfrm_policy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index eb88a7d..8a0ac6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1743,7 +1743,7 @@ static void xfrm_pcpu_work_fn(struct work_struct *work)
 void xfrm_policy_cache_flush(void)
 {
struct xfrm_dst *old;
-   bool found = 0;
+   bool found = false;
int cpu;
 
might_sleep();
-- 
2.7.4

[PATCH] ipv6: ndisc: use true and false for boolean values

2018-03-05 Thread Gustavo A. R. Silva

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/ipv6/ndisc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 0a19ce3..8af5eef 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -527,7 +527,7 @@ void ndisc_send_na(struct net_device *dev, const struct 
in6_addr *daddr,
}
 
if (!dev->addr_len)
-   inc_opt = 0;
+   inc_opt = false;
if (inc_opt)
optlen += ndisc_opt_addr_space(dev,
   NDISC_NEIGHBOUR_ADVERTISEMENT);
-- 
2.7.4

Re: [PATCH net] netfilter: unlock xt_table earlier in __do_replace

2018-03-05 Thread Pablo Neira Ayuso

On Fri, Feb 16, 2018 at 12:25:56PM +0100, Xin Long wrote:
> On Fri, Feb 16, 2018 at 12:02 PM, Florian Westphal  wrote:
> > Xin Long  wrote:
[...]
> >> Besides, all xt_target/match checkentry is called out of xt_table
> >> lock. It's better also to move all cleanup_entry calling out of
> >> xt_table lock, just as do_replace_finish does for ebtables.
> >
> > Agree but I don't see how this patch fixes a bug so I would prefer if
> > this could simmer in nf-next first.
>
> Sure. No bug fix, it's an improvement.

Applied to nf-next, thanks.

Re: [PATCH] netfilter: ipt_ah: return boolean instead of integer

2018-03-05 Thread Pablo Neira Ayuso

On Tue, Feb 13, 2018 at 08:25:57AM -0600, Gustavo A. R. Silva wrote:
> Return statements in functions returning bool should use
> true/false instead of 1/0.
> 
> This issue was detected with the help of Coccinelle.

This one didn't get in time for the previous merge window.

Now applied, thanks.

[PATCH net-next] dt-bindings: net: dsa: marvell: describe compatibility string

2018-03-05 Thread Brandon Streiff

There are two compatibility strings for mv88e6xxx, but it isn't clear
from the documentation why only those two exist when the mv88e6xxx driver
supports more than the 6085 and 6190. Briefly describe how the compatible
property is used, and provide guidance on which to use.

The model list comes from looking at port_base_addr values (0x0 vs 0x10)
in drivers/net/dsa/mv88e6xxx/chip.c.

Signed-off-by: Brandon Streiff 
---
 Documentation/devicetree/bindings/net/dsa/marvell.txt | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt 
b/Documentation/devicetree/bindings/net/dsa/marvell.txt
index 1d4d0f4..caf71e2 100644
--- a/Documentation/devicetree/bindings/net/dsa/marvell.txt
+++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt
@@ -13,9 +13,18 @@ placed as a child node of an mdio device.
 The properties described here are those specific to Marvell devices.
 Additional required and optional properties can be found in dsa.txt.
 
+The compatibility string is used only to find an identification register,
+which is at a different MDIO base address in different switch families.
+- "marvell,mv88e6085"  : Switch has base address 0x10. Use with models:
+ 6085, 6095, 6097, 6123, 6131, 6141, 6161, 6165,
+ 6171, 6172, 6175, 6176, 6185, 6240, 6320, 6321,
+ 6341, 6350, 6351, 6352
+- "marvell,mv88e6190"  : Switch has base address 0x00. Use with models:
+ 6190, 6190X, 6191, 6290, 6390, 6390X
+
 Required properties:
 - compatible   : Should be one of "marvell,mv88e6085" or
- "marvell,mv88e6190"
+ "marvell,mv88e6190" as indicated above
 - reg  : Address on the MII bus for the switch.
 
 Optional properties:
-- 
2.1.4

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread Jakub Kicinski

On Mon,  5 Mar 2018 13:28:28 +, John Hurley wrote:
> The linux bond itself registers a cb for offloading tc rules. Potential
> slave netdevs on offload devices can then register with the bond for a
> further callback - this code is basically the same as registering for an
> egress dev offload in TC. Then when a rule is offloaded to the bond, it
> can be relayed to each netdev that has registered with the bond code and
> which is a slave of the given bond.

As you know I would much rather see this handled in the TC core,
similarly to how blocks are shared.  We can add a new .ndo_setup_tc
notification like TC_MASTER_BLOCK_BIND and reuse the existing offload
tracking.  It would also fix the problem of freezing the bond and allow
better code reuse with team etc.

For tunnel offloads we necessarily have to stick to the weak offload
model, where any offload success satisfies skip_sw, but in case of
bonds we should strive for the strong model (as you are doing AFAICT).

The only difficulty seems to be replaying the bind commands when port
joins?  I mean finding all blocks on a bond.  But that should be
surmountable..

[PATCH] ipvs: use true and false for boolean values

2018-03-05 Thread Gustavo A. R. Silva

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/netfilter/ipvs/ip_vs_lblc.c  | 4 ++--
 net/netfilter/ipvs/ip_vs_lblcr.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 6a340c9..942e835 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -238,7 +238,7 @@ static void ip_vs_lblc_flush(struct ip_vs_service *svc)
int i;
 
spin_lock_bh(>sched_lock);
-   tbl->dead = 1;
+   tbl->dead = true;
for (i = 0; i < IP_VS_LBLC_TAB_SIZE; i++) {
hlist_for_each_entry_safe(en, next, >bucket[i], list) {
ip_vs_lblc_del(en);
@@ -369,7 +369,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
tbl->max_size = IP_VS_LBLC_TAB_SIZE*16;
tbl->rover = 0;
tbl->counter = 1;
-   tbl->dead = 0;
+   tbl->dead = false;
tbl->svc = svc;
 
/*
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 0627881..a5acab2 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -404,7 +404,7 @@ static void ip_vs_lblcr_flush(struct ip_vs_service *svc)
struct hlist_node *next;
 
spin_lock_bh(>sched_lock);
-   tbl->dead = 1;
+   tbl->dead = true;
for (i = 0; i < IP_VS_LBLCR_TAB_SIZE; i++) {
hlist_for_each_entry_safe(en, next, >bucket[i], list) {
ip_vs_lblcr_free(en);
@@ -532,7 +532,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
tbl->max_size = IP_VS_LBLCR_TAB_SIZE*16;
tbl->rover = 0;
tbl->counter = 1;
-   tbl->dead = 0;
+   tbl->dead = false;
tbl->svc = svc;
 
/*
-- 
2.7.4

Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-03-05 Thread Jeff Moyer

Christoph Hellwig  writes:

> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
>
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.
>
> Signed-off-by: Christoph Hellwig 

Also acked this one in the last posting.

Acked-by: Jeff Moyer 


> ---
>  fs/aio.c | 102 
> +++
>  include/uapi/linux/aio_abi.h |   6 +--
>  2 files changed, 104 insertions(+), 4 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index da87cbf7c67a..0bafc4975d51 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -5,6 +5,7 @@
>   *   Implements an efficient asynchronous io interface.
>   *
>   *   Copyright 2000, 2001, 2002 Red Hat, Inc.  All Rights Reserved.
> + *   Copyright 2018 Christoph Hellwig.
>   *
>   *   See ../COPYING for licensing terms.
>   */
> @@ -156,9 +157,17 @@ struct kioctx {
>   unsignedid;
>  };
>  
> +struct poll_iocb {
> + struct file *file;
> + __poll_tevents;
> + struct wait_queue_head  *head;
> + struct wait_queue_entry wait;
> +};
> +
>  struct aio_kiocb {
>   union {
>   struct kiocbrw;
> + struct poll_iocbpoll;
>   };
>  
>   struct kioctx   *ki_ctx;
> @@ -1565,6 +1574,96 @@ static ssize_t aio_write(struct kiocb *req, struct 
> iocb *iocb, bool vectored,
>   return ret;
>  }
>  
> +static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + fput(req->file);
> + aio_complete(container_of(req, struct aio_kiocb, poll),
> + mangle_poll(mask), 0);
> +}
> +
> +static void aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
> +
> + if (!(iocb->flags & AIO_IOCB_CANCELLED))
> + __aio_complete_poll(req, mask);
> +}
> +
> +static int aio_poll_cancel(struct kiocb *rw)
> +{
> + struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw);
> +
> + remove_wait_queue(iocb->poll.head, >poll.wait);
> + __aio_complete_poll(>poll, 0); /* no events to report */
> + return 0;
> +}
> +
> +static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int 
> sync,
> + void *key)
> +{
> + struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
> + struct file *file = req->file;
> + __poll_t mask = key_to_poll(key);
> +
> + assert_spin_locked(>head->lock);
> +
> + /* for instances that support it check for an event match first: */
> + if (mask && !(mask & req->events))
> + return 0;
> +
> + mask = vfs_poll_mask(file, req->events);
> + if (!mask)
> + return 0;
> +
> + __remove_wait_queue(req->head, >wait);
> + aio_complete_poll(req, mask);
> + return 1;
> +}
> +
> +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
> +{
> + struct poll_iocb *req = >poll;
> + unsigned long flags;
> + __poll_t mask;
> +
> + /* reject any unknown events outside the normal event mask. */
> + if ((u16)iocb->aio_buf != iocb->aio_buf)
> + return -EINVAL;
> + /* reject fields that are not defined for poll */
> + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
> + return -EINVAL;
> +
> + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
> + req->file = fget(iocb->aio_fildes);
> + if (unlikely(!req->file))
> + return -EBADF;
> +
> + req->head = vfs_get_poll_head(req->file, req->events);
> + if (!req->head) {
> + fput(req->file);
> + return -EINVAL; /* same as no support for IOCB_CMD_POLL */
> + }
> + if (IS_ERR(req->head)) {
> + mask = PTR_TO_POLL(req->head);
> + goto done;
> + }
> +
> + init_waitqueue_func_entry(>wait, aio_poll_wake);
> +
> + spin_lock_irqsave(>head->lock, flags);
> + mask = vfs_poll_mask(req->file, req->events);
> + if (!mask) {
> + __kiocb_set_cancel_fn(aiocb, aio_poll_cancel,
> + AIO_IOCB_DELAYED_CANCEL);
> + __add_wait_queue(req->head, >wait);
> + }
> + spin_unlock_irqrestore(>head->lock, flags);
> +done:
> + if (mask)
> + aio_complete_poll(req, mask);
> + return -EIOCBQUEUED;
> +}
> +
>  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>struct iocb *iocb, bool compat)
>  {
> @@ -1628,6 +1727,9 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   case

Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-05 Thread Jeff Moyer

Christoph Hellwig  writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.
>
> Signed-off-by: Christoph Hellwig 

I acked this in the last set, so...

Acked-by: Jeff Moyer 

> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/aio.c   | 114 
> ++---
>  include/linux/compat.h |   7 ++
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  include/uapi/linux/aio_abi.h   |   6 ++
>  kernel/sys_ni.c|   2 +
>  8 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index 9d7d6e4cde87..da87cbf7c67a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const struct __aio_sigset __user *, usig)
> +{
> + struct __aio_sigset ksig = { NULL, };
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + if (usig && copy_from_user(, usig, sizeof(ksig)))
> + return -EFAULT;
> +
> + if (ksig.sigmask) {
> + if (ksig.sigsetsize != sizeof(sigset_t))
> + return -EINVAL;
> + if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
> + }
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (signal_pending(current)) {
> + if (ksig.sigmask) {
> + current->saved_sigmask = sigsaved;
> + set_restore_sigmask();
> + }
> +
> + if (!ret)
> + ret = -ERESTARTNOHAND;
> + } else {
> + if (ksig.sigmask)
> +

[PATCH V3 net] qed: Free RoCE ILT Memory on rmmod qedr

2018-03-05 Thread Michal Kalderon

Rdma requires ILT Memory to be allocated for it's QPs.
Each ILT entry points to a page used by several Rdma QPs.
To avoid allocating all the memory in advance, the rdma
implementation dynamically allocates memory as more QPs are
added, however it does not dynamically free the memory.
The memory should have been freed on rmmod qedr, but isn't.
This patch adds the memory freeing on rmmod qedr (currently
it will be freed with qed is removed).

An outcome of this bug, is that if qedr is unloaded and loaded
without unloaded qed, there will be no more RoCE traffic.

The reason these are related, is that the logic of detecting the
first QP ever opened is by asking whether ILT memory for RoCE has
been allocated.

In addition, this patch modifies freeing of the Task context to
always use the PROTOCOLID_ROCE and not the protocol passed,
this is because task context for iWARP and ROCE both use the
ROCE protocol id, as opposed to the connection context.

Fixes: dbb799c39717 ("qed: Initialize hardware for new protocols")

Signed-off-by: Michal Kalderon 
Signed-off-by: Ariel Elior 
---
Difference from V2:

Fixed Broken parenthesis In comment


---
 drivers/net/ethernet/qlogic/qed/qed_cxt.c  | 5 -
 drivers/net/ethernet/qlogic/qed/qed_rdma.c | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c 
b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 6f546e8..b6f55bc 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -2480,7 +2480,10 @@ int qed_cxt_free_proto_ilt(struct qed_hwfn *p_hwfn, enum 
protocol_type proto)
if (rc)
return rc;
 
-   /* Free Task CXT */
+   /* Free Task CXT ( Intentionally RoCE as task-id is shared between
+* RoCE and iWARP )
+*/
+   proto = PROTOCOLID_ROCE;
rc = qed_cxt_free_ilt_range(p_hwfn, QED_ELEM_TASK, 0,
qed_cxt_get_proto_tid_count(p_hwfn, proto));
if (rc)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c 
b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
index 5d040b8..f3ee653 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
@@ -380,6 +380,7 @@ static void qed_rdma_free(struct qed_hwfn *p_hwfn)
 
qed_rdma_free_reserved_lkey(p_hwfn);
qed_rdma_resc_free(p_hwfn);
+   qed_cxt_free_proto_ilt(p_hwfn, p_hwfn->p_rdma_info->proto);
 }
 
 static void qed_rdma_get_guid(struct qed_hwfn *p_hwfn, u8 *guid)
-- 
2.9.5

RE: [PATCH V2 net] qed: Free RoCE ILT Memory on rmmod qedr

2018-03-05 Thread Kalderon, Michal

> From: Yuval Mintz [mailto:yuv...@mellanox.com]
> Sent: Monday, March 05, 2018 11:24 PM
> To: Kalderon, Michal ;
> da...@davemloft.net
> Cc: netdev@vger.kernel.org; dledf...@redhat.com; Jason Gunthorpe
> ; linux-r...@vger.kernel.org; Elior, Ariel
> 
> Subject: RE: [PATCH V2 net] qed: Free RoCE ILT Memory on rmmod qedr
> 
> > -   /* Free Task CXT */
> > +   /* Free Task CXT ( Intentionally RoCE as task-id is shared between
> > +* RoCE and iWARP
> > +*/
> 
> Broken parenthesis In comment...
Thanks Yuval, V3 on its way

[PATCH 06/36] aio: delete iocbs from the active_reqs list in kiocb_cancel

2018-03-05 Thread Christoph Hellwig

One we cancel an iocb there is no reason to keep it on the active_reqs
list, given that the list is only used to look for cancelation candidates.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2d40cf5dd4ec..0b6394b4e528 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -561,6 +561,8 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 {
kiocb_cancel_fn *cancel = kiocb->ki_cancel;
 
+   list_del_init(>ki_list);
+
if (!cancel)
return -EINVAL;
kiocb->ki_cancel = NULL;
@@ -607,8 +609,6 @@ static void free_ioctx_users(struct percpu_ref *ref)
while (!list_empty(>active_reqs)) {
req = list_first_entry(>active_reqs,
   struct aio_kiocb, ki_list);
-
-   list_del_init(>ki_list);
kiocb_cancel(req);
}
 
-- 
2.14.2

aio poll, io_pgetevents and a new in-kernel poll API V5

2018-03-05 Thread Christoph Hellwig

Hi all,

this series adds support for the IOCB_CMD_POLL operation to poll for the
readyness of file descriptors using the aio subsystem.  The API is based
on patches that existed in RHAS2.1 and RHEL3, which means it already is
supported by libaio.  To implement the poll support efficiently new
methods to poll are introduced in struct file_operations:  get_poll_head
and poll_mask.  The first one returns a wait_queue_head to wait on
(lifetime is bound by the file), and the second does a non-blocking
check for the POLL* events.  This allows aio poll to work without
any additional context switches, unlike epoll.

To make the interface fully useful a new io_pgetevents system call is
added, which atomically saves and restores the signal mask over the
io_pgetevents system call.  It it the logical equivalent to pselect and
ppoll for io_pgetevents.

The corresponding libaio changes for io_pgetevents support and
documentation, as well as a test case will be posted in a separate
series.

The changes were sponsored by Scylladb, and improve performance
of the seastar framework up to 10%, while also removing the need
for a privileged SCHED_FIFO epoll listener thread.

git://git.infradead.org/users/hch/vfs.git aio-poll.5

Gitweb:

http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.5

Libaio changes:

https://pagure.io/libaio.git io-poll

Seastar changes (not updated for the new io_pgetevens ABI yet):

https://github.com/avikivity/seastar/commits/aio

Changes since V4:
 - rebased ontop of Linux 4.16-rc4

Changes since V3:
 - remove the pre-sleep ->poll_mask call in vfs_poll,
   allow ->get_poll_head to return POLL* values.

Changes since V2:
 - removed a double initialization
 - new vfs_get_poll_head helper
 - document that ->get_poll_head can return NULL
 - call ->poll_mask before sleeping
 - various ACKs
 - add conversion of random to ->poll_mask
 - add conversion of af_alg to ->poll_mask
 - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
 - reshuffled the series so that prep patches and everything not
   requiring the new in-kernel poll API is in the beginning

Changes since V1:
 - handle the NULL ->poll case in vfs_poll
 - dropped the file argument to the ->poll_mask socket operation
 - replace the ->pre_poll socket operation with ->get_poll_head as
   in the file operations

[PATCH 08/36] aio: implement io_pgetevents

2018-03-05 Thread Christoph Hellwig

This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:

sigset_t origmask;

pthread_sigmask(SIG_SETMASK, , );
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, , NULL);

Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure.  It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/aio.c   | 114 ++---
 include/linux/compat.h |   7 ++
 include/linux/syscalls.h   |   6 ++
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/aio_abi.h   |   6 ++
 kernel/sys_ni.c|   2 +
 8 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..5997c3e9ac3e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386io_pgetevents   sys_io_pgetevents   
compat_sys_io_pgetevents
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..e995cd2b4e65 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  io_pgetevents   sys_io_pgetevents
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/aio.c b/fs/aio.c
index 9d7d6e4cde87..da87cbf7c67a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,
wait_event_interruptible_hrtimeout(ctx->wait,
aio_read_events(ctx, min_nr, nr, event, ),
until);
-
-   if (!ret && signal_pending(current))
-   ret = -EINTR;
-
return ret;
 }
 
@@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
struct timespec __user *, timeout)
 {
struct timespec64   ts;
+   int ret;
+
+   if (timeout && unlikely(get_timespec64(, timeout)))
+   return -EFAULT;
+
+   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
+   if (!ret && signal_pending(current))
+   ret = -EINTR;
+   return ret;
+}
+
+SYSCALL_DEFINE6(io_pgetevents,
+   aio_context_t, ctx_id,
+   long, min_nr,
+   long, nr,
+   struct io_event __user *, events,
+   struct timespec __user *, timeout,
+   const struct __aio_sigset __user *, usig)
+{
+   struct __aio_sigset ksig = { NULL, };
+   sigset_tksigmask, sigsaved;
+   struct timespec64   ts;
+   int ret;
+
+   if (timeout && unlikely(get_timespec64(, timeout)))
+   return -EFAULT;
 
-   if (timeout) {
-   if (unlikely(get_timespec64(, timeout)))
+   if (usig && copy_from_user(, usig, sizeof(ksig)))
+   return -EFAULT;
+
+   if (ksig.sigmask) {
+   if (ksig.sigsetsize != sizeof(sigset_t))
+   return -EINVAL;
+   if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
return -EFAULT;
+   sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
+   sigprocmask(SIG_SETMASK, , );
+   }
+
+   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
+   if (signal_pending(current)) {
+   if (ksig.sigmask) {
+   current->saved_sigmask = sigsaved;
+   set_restore_sigmask();
+   }
+
+   if (!ret)
+   ret = -ERESTARTNOHAND;
+   } else {
+   if (ksig.sigmask)
+   sigprocmask(SIG_SETMASK, , NULL);
}
 
-   return do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : 
NULL);
+   return ret;
 }
 
 #ifdef CONFIG_COMPAT
@@ -1891,13 +1934,64 @@ COMPAT_SYSCALL_DEFINE5(io_getevents, 
compat_aio_context_t, ctx_id,

[PATCH 07/36] aio: add delayed cancel support

2018-03-05 Thread Christoph Hellwig

The upcoming aio poll support would like to be able to complete the
iocb inline from the cancellation context, but that would cause
a lock order reversal.  Add support for optionally moving the cancelation
outside the context lock to avoid this reversal.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 49 ++---
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0b6394b4e528..9d7d6e4cde87 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -170,6 +170,10 @@ struct aio_kiocb {
struct list_headki_list;/* the aio core uses this
 * for cancellation */
 
+   unsigned intflags;  /* protected by ctx->ctx_lock */
+#define AIO_IOCB_DELAYED_CANCEL(1 << 0)
+#define AIO_IOCB_CANCELLED (1 << 1)
+
/*
 * If the aio_resfd field of the userspace iocb is not zero,
 * this is the underlying eventfd context to deliver events to.
@@ -536,9 +540,9 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 #define AIO_EVENTS_FIRST_PAGE  ((PAGE_SIZE - sizeof(struct aio_ring)) / 
sizeof(struct io_event))
 #define AIO_EVENTS_OFFSET  (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)
 
-void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel)
+static void __kiocb_set_cancel_fn(struct aio_kiocb *req,
+   kiocb_cancel_fn *cancel, unsigned int iocb_flags)
 {
-   struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw);
struct kioctx *ctx = req->ki_ctx;
unsigned long flags;
 
@@ -548,8 +552,15 @@ void kiocb_set_cancel_fn(struct kiocb *iocb, 
kiocb_cancel_fn *cancel)
spin_lock_irqsave(>ctx_lock, flags);
list_add_tail(>ki_list, >active_reqs);
req->ki_cancel = cancel;
+   req->flags |= iocb_flags;
spin_unlock_irqrestore(>ctx_lock, flags);
 }
+
+void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel)
+{
+   return __kiocb_set_cancel_fn(container_of(iocb, struct aio_kiocb, rw),
+   cancel, 0);
+}
 EXPORT_SYMBOL(kiocb_set_cancel_fn);
 
 /*
@@ -603,17 +614,27 @@ static void free_ioctx_users(struct percpu_ref *ref)
 {
struct kioctx *ctx = container_of(ref, struct kioctx, users);
struct aio_kiocb *req;
+   LIST_HEAD(list);
 
spin_lock_irq(>ctx_lock);
-
while (!list_empty(>active_reqs)) {
req = list_first_entry(>active_reqs,
   struct aio_kiocb, ki_list);
-   kiocb_cancel(req);
-   }
 
+   if (req->flags & AIO_IOCB_DELAYED_CANCEL) {
+   req->flags |= AIO_IOCB_CANCELLED;
+   list_move_tail(>ki_list, );
+   } else {
+   kiocb_cancel(req);
+   }
+   }
spin_unlock_irq(>ctx_lock);
 
+   while (!list_empty()) {
+   req = list_first_entry(, struct aio_kiocb, ki_list);
+   kiocb_cancel(req);
+   }
+
percpu_ref_kill(>reqs);
percpu_ref_put(>reqs);
 }
@@ -1785,15 +1806,22 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, 
struct iocb __user *, iocb,
if (unlikely(!ctx))
return -EINVAL;
 
-   spin_lock_irq(>ctx_lock);
+   ret = -EINVAL;
 
+   spin_lock_irq(>ctx_lock);
kiocb = lookup_kiocb(ctx, iocb, key);
+   if (kiocb) {
+   if (kiocb->flags & AIO_IOCB_DELAYED_CANCEL) {
+   kiocb->flags |= AIO_IOCB_CANCELLED;
+   } else {
+   ret = kiocb_cancel(kiocb);
+   kiocb = NULL;
+   }
+   }
+   spin_unlock_irq(>ctx_lock);
+
if (kiocb)
ret = kiocb_cancel(kiocb);
-   else
-   ret = -EINVAL;
-
-   spin_unlock_irq(>ctx_lock);
 
if (!ret) {
/*
@@ -1805,7 +1833,6 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct 
iocb __user *, iocb,
}
 
percpu_ref_put(>users);
-
return ret;
 }
 
-- 
2.14.2

Re: [RFC net-next 0/6] offload linux bonding tc ingress rules

2018-03-05 Thread Or Gerlitz

On Mon, Mar 5, 2018 at 3:28 PM, John Hurley  wrote:
> This RFC patchset adds support for offloading tc ingress rules applied to
> linux bonds. The premise of these patches is that if a rule is applied to
> a bond port then the rule should be applied to each slave of the bond.
>
> The linux bond itself registers a cb for offloading tc rules. Potential
> slave netdevs on offload devices can then register with the bond for a
> further callback - this code is basically the same as registering for an
> egress dev offload in TC. Then when a rule is offloaded to the bond, it
> can be relayed to each netdev that has registered with the bond code and
> which is a slave of the given bond.
>
> To prevent sync issues between the kernel and offload device, the linux
> bond driver is affectively locked when it has offloaded rules. i.e no new
> ports can be enslaved and no slaves can be released until the offload
> rules are removed. Similarly, if a port on a bond is deleted, the bond is
> destroyed, forcing a flush of all offloaded rules.
>
> Also included in the RFC are changes to the NFP driver to utilise the new
> code by registering NFP port representors for bond offload rules and
> modifying cookie handling to allow the relaying of a rule to multiple ports.

what is your approach for rules whose bond is their egress device
(ingress = vf port
representor)?

[PATCH 04/36] aio: sanitize ki_list handling

2018-03-05 Thread Christoph Hellwig

Instead of handcoded non-null checks always initialize ki_list to an
empty list and use list_empty / list_empty_careful on it.  While we're
at it also error out on a double call to kiocb_set_cancel_fn instead
of ignoring it.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6295fc00f104..c32c315f05b5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -555,13 +555,12 @@ void kiocb_set_cancel_fn(struct kiocb *iocb, 
kiocb_cancel_fn *cancel)
struct kioctx *ctx = req->ki_ctx;
unsigned long flags;
 
-   spin_lock_irqsave(>ctx_lock, flags);
-
-   if (!req->ki_list.next)
-   list_add(>ki_list, >active_reqs);
+   if (WARN_ON_ONCE(!list_empty(>ki_list)))
+   return;
 
+   spin_lock_irqsave(>ctx_lock, flags);
+   list_add_tail(>ki_list, >active_reqs);
req->ki_cancel = cancel;
-
spin_unlock_irqrestore(>ctx_lock, flags);
 }
 EXPORT_SYMBOL(kiocb_set_cancel_fn);
@@ -1034,7 +1033,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx 
*ctx)
goto out_put;
 
percpu_ref_get(>reqs);
-
+   INIT_LIST_HEAD(>ki_list);
req->ki_ctx = ctx;
return req;
 out_put:
@@ -1080,7 +1079,7 @@ static void aio_complete(struct aio_kiocb *iocb, long 
res, long res2)
unsigned tail, pos, head;
unsigned long   flags;
 
-   if (iocb->ki_list.next) {
+   if (!list_empty_careful(iocb->ki_list.next)) {
unsigned long flags;
 
spin_lock_irqsave(>ctx_lock, flags);
-- 
2.14.2

[PATCH 01/36] aio: don't print the page size at boot time

2018-03-05 Thread Christoph Hellwig

The page size is in no way related to the aio code, and printing it in
the (debug) dmesg at every boot serves no purpose.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index a062d75109cb..03d59593912d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -264,9 +264,6 @@ static int __init aio_setup(void)
 
kiocb_cachep = KMEM_CACHE(aio_kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
-
-   pr_debug("sizeof(struct page) = %zu\n", sizeof(struct page));
-
return 0;
 }
 __initcall(aio_setup);
-- 
2.14.2

[PATCH 03/36] aio: refactor read/write iocb setup

2018-03-05 Thread Christoph Hellwig

Don't reference the kiocb structure from the common aio code, and move
any use of it into helper specific to the read/write path.  This is in
preparation for aio_poll support that wants to use the space for different
fields.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 171 ---
 1 file changed, 97 insertions(+), 74 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 41fc8ce6bc7f..6295fc00f104 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -170,7 +170,9 @@ struct kioctx {
 #define KIOCB_CANCELLED((void *) (~0ULL))
 
 struct aio_kiocb {
-   struct kiocbcommon;
+   union {
+   struct kiocbrw;
+   };
 
struct kioctx   *ki_ctx;
kiocb_cancel_fn *ki_cancel;
@@ -549,7 +551,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 
 void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel)
 {
-   struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, common);
+   struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw);
struct kioctx *ctx = req->ki_ctx;
unsigned long flags;
 
@@ -582,7 +584,7 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
cancel = cmpxchg(>ki_cancel, old, KIOCB_CANCELLED);
} while (cancel != old);
 
-   return cancel(>common);
+   return cancel(>rw);
 }
 
 static void free_ioctx(struct work_struct *work)
@@ -1040,15 +1042,6 @@ static inline struct aio_kiocb *aio_get_req(struct 
kioctx *ctx)
return NULL;
 }
 
-static void kiocb_free(struct aio_kiocb *req)
-{
-   if (req->common.ki_filp)
-   fput(req->common.ki_filp);
-   if (req->ki_eventfd != NULL)
-   eventfd_ctx_put(req->ki_eventfd);
-   kmem_cache_free(kiocb_cachep, req);
-}
-
 static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 {
struct aio_ring __user *ring  = (void __user *)ctx_id;
@@ -1079,29 +1072,14 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 /* aio_complete
  * Called when the io request on the given iocb is complete.
  */
-static void aio_complete(struct kiocb *kiocb, long res, long res2)
+static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
 {
-   struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, common);
struct kioctx   *ctx = iocb->ki_ctx;
struct aio_ring *ring;
struct io_event *ev_page, *event;
unsigned tail, pos, head;
unsigned long   flags;
 
-   BUG_ON(is_sync_kiocb(kiocb));
-
-   if (kiocb->ki_flags & IOCB_WRITE) {
-   struct file *file = kiocb->ki_filp;
-
-   /*
-* Tell lockdep we inherited freeze protection from submission
-* thread.
-*/
-   if (S_ISREG(file_inode(file)->i_mode))
-   __sb_writers_acquired(file_inode(file)->i_sb, 
SB_FREEZE_WRITE);
-   file_end_write(file);
-   }
-
if (iocb->ki_list.next) {
unsigned long flags;
 
@@ -1163,11 +1141,12 @@ static void aio_complete(struct kiocb *kiocb, long res, 
long res2)
 * eventfd. The eventfd_signal() function is safe to be called
 * from IRQ context.
 */
-   if (iocb->ki_eventfd != NULL)
+   if (iocb->ki_eventfd) {
eventfd_signal(iocb->ki_eventfd, 1);
+   eventfd_ctx_put(iocb->ki_eventfd);
+   }
 
-   /* everything turned out well, dispose of the aiocb. */
-   kiocb_free(iocb);
+   kmem_cache_free(kiocb_cachep, iocb);
 
/*
 * We have to order our ring_info tail store above and test
@@ -1430,6 +1409,47 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
return -EINVAL;
 }
 
+static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+   struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, rw);
+
+   WARN_ON_ONCE(is_sync_kiocb(kiocb));
+
+   if (kiocb->ki_flags & IOCB_WRITE) {
+   struct inode *inode = file_inode(kiocb->ki_filp);
+
+   /*
+* Tell lockdep we inherited freeze protection from submission
+* thread.
+*/
+   if (S_ISREG(inode->i_mode))
+   __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+   file_end_write(kiocb->ki_filp);
+   }
+
+   fput(kiocb->ki_filp);
+   aio_complete(iocb, res, res2);
+}
+
+static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
+{
+   int ret;
+
+   req->ki_filp = fget(iocb->aio_fildes);
+   if (unlikely(!req->ki_filp))
+   return -EBADF;
+   req->ki_complete = aio_complete_rw;
+   req->ki_pos = iocb->aio_offset;
+   req->ki_flags = iocb_flags(req->ki_filp);
+   if (iocb->aio_flags & IOCB_FLAG_RESFD)
+   req->ki_flags

Re: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-05 Thread Alexander Duyck

On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile  wrote:
> On 03/01/2018 03:22 PM, Alex Williamson wrote:
>>
>> On Wed, 28 Feb 2018 16:36:38 -0800
>> Alexander Duyck  wrote:
>>
>>> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson
>>>  wrote:

 On Wed, 28 Feb 2018 09:49:21 -0800
 Alexander Duyck  wrote:

>
> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck
>  wrote:
>>
>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson
>>  wrote:
>>>
>>> On Tue, 27 Feb 2018 11:06:54 -0800
>>> Alexander Duyck  wrote:
>>>

 From: Alexander Duyck 

 This patch is meant to add support for SR-IOV on devices when the
 VFs are
 not managed by the kernel. Examples of recent patches attempting to
 do this
 include:
>>>
>>>
>>> It appears to enable sriov when the _pf_ is not managed by the
>>> kernel, but by "managed" we mean that either there is no pf driver or
>>> the pf driver doesn't provide an sriov_configure callback,
>>> intentionally or otherwise.
>>>

 virto - https://patchwork.kernel.org/patch/10241225/
 pci-stub - https://patchwork.kernel.org/patch/10109935/
 vfio - https://patchwork.kernel.org/patch/10103353/
 uio - https://patchwork.kernel.org/patch/9974031/
>>>
>>>
>>> So is the goal to get around the issues with enabling sriov on each
>>> of
>>> the above drivers by doing it under the covers or are you really just
>>> trying to enable sriov for a truly unmanage (no pf driver) case?  For
>>> example, should a driver explicitly not wanting sriov enabled
>>> implement
>>> a dummy sriov_configure function?
>>>

 Since this is quickly blowing up into a multi-driver problem it is
 probably
 best to implement this solution in one spot.

 This patch is an attempt to do that. What we do with this patch is
 provide
 a generic call to enable SR-IOV in the case that the PF driver is
 either
 not present, or the PF driver doesn't support configuring SR-IOV.

 A new sysfs value called sriov_unmanaged_autoprobe has been added.
 This
 value is used as the drivers_autoprobe setting of the VFs when they
 are
 being managed by an external entity such as userspace or device
 firmware
 instead of being managed by the kernel.
>>>
>>>
>>> Documentation/ABI/testing/sysfs-bus-pci update is missing.
>>
>>
>> I can make sure to update that in the next version.
>>

 One side effect of this change is that the sriov_drivers_autoprobe
 and
 sriov_unmanaged_autoprobe will only apply their updates when SR-IOV
 is
 disabled. Attempts to update them when SR-IOV is in use will only
 update
 the local value and will not update sriov->autoprobe.
>>>
>>>
>>> And we expect users to understand when sriov_drivers_autoprobe
>>> applies
>>> vs sriov_unmanaged_autoprobe, even though they're using the same
>>> interfaces to enable sriov?  Are all combinations expected to work,
>>> ex.
>>> unmanaged sriov is enabled, a native pf driver loads, vfs work?  Not
>>> only does it seems like there's opportunity to use this incorrectly,
>>> I
>>> think maybe it might be difficult to use correctly.
>>>

 I based my patch set originally on the patch by Mark Rustad but
 there isn't
 much left after going through and cleaning out the bits that were no
 longer
 needed, and after incorporating the feedback from David Miller.

 I have included the authors of the original 4 patches above in the
 Cc here.
 My hope is to get feedback and/or review on if this works for their
 use
 cases.

 Cc: Mark Rustad 
 Cc: Maximilian Heyne 
 Cc: Liang-Min Wang 
 Cc: David Woodhouse 
 Signed-off-by: Alexander Duyck 
 ---
   drivers/pci/iov.c|   27 +++-
   drivers/pci/pci-driver.c |2 +
   drivers/pci/pci-sysfs.c  |   62
 +-
   drivers/pci/pci.h|4 ++-
   include/linux/pci.h  |1 +
   5 files changed, 86 insertions(+), 10 deletions(-)

 diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
 index 677924ae0350..7b8858bd8d03 100644
 --- a/drivers/pci/iov.c

Re: [net-next 4/4] e1000e: allocate ring descriptors with dma_zalloc_coherent

2018-03-05 Thread Jeff Kirsher

On Mon, 2018-03-05 at 16:20 -0500, David Miller wrote:
> From: Jeff Kirsher 
> Date: Mon, 05 Mar 2018 11:09:29 -0800
> 
> > On Mon, 2018-03-05 at 10:23 -0800, Eric Dumazet wrote:
> > > On Mon, 2018-03-05 at 10:16 -0800, Jeff Kirsher wrote:
> > > > From: Pierre-Yves Kerbrat 
> > > > 
> > > > Descriptor rings were not initialized at zero when allocated
> > > > When area contained garbage data, it caused skb_over_panic in
> > > > e1000_clean_rx_irq (if data had E1000_RXD_STAT_DD bit set)
> > > > 
> > > > This patch makes use of dma_zalloc_coherent to make sure the
> > > > ring is memset at 0 to prevent the area from containing
> > > > garbage.
> > > > 
> > > 
> > > This looks like a net candidate, fixing a bug, with 0 chance
> > > adding a
> > > regression IMO.
> > 
> > I am fine with that.  Dave, let me know if you want me to re-submit
> > this change for net/stable.
> 
> Yes, please add this patch to the net-queue pull request you also
> sent today.
> 
> Thanks.

Done.

signature.asc
Description: This is a digitally signed message part

[PATCH 05/36] aio: simplify cancellation

2018-03-05 Thread Christoph Hellwig

With the current aio code there is no need for the magic KIOCB_CANCELLED
value, as a cancelation just kicks the driver to queue the completion
ASAP, with all actual completion handling done in another thread. Given
that both the completion path and cancelation take the context lock there
is no need for magic cmpxchg loops either.

Signed-off-by: Christoph Hellwig 
Acked-by: Jeff Moyer 
---
 fs/aio.c | 37 +
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c32c315f05b5..2d40cf5dd4ec 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -156,19 +156,6 @@ struct kioctx {
unsignedid;
 };
 
-/*
- * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
- * cancelled or completed (this makes a certain amount of sense because
- * successful cancellation - io_cancel() - does deliver the completion to
- * userspace).
- *
- * And since most things don't implement kiocb cancellation and we'd really 
like
- * kiocb completion to be lockless when possible, we use ki_cancel to
- * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
- * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().
- */
-#define KIOCB_CANCELLED((void *) (~0ULL))
-
 struct aio_kiocb {
union {
struct kiocbrw;
@@ -565,24 +552,18 @@ void kiocb_set_cancel_fn(struct kiocb *iocb, 
kiocb_cancel_fn *cancel)
 }
 EXPORT_SYMBOL(kiocb_set_cancel_fn);
 
+/*
+ * Only cancel if there ws a ki_cancel function to start with, and we
+ * are the one how managed to clear it (to protect against simulatinious
+ * cancel calls).
+ */
 static int kiocb_cancel(struct aio_kiocb *kiocb)
 {
-   kiocb_cancel_fn *old, *cancel;
-
-   /*
-* Don't want to set kiocb->ki_cancel = KIOCB_CANCELLED unless it
-* actually has a cancel function, hence the cmpxchg()
-*/
-
-   cancel = READ_ONCE(kiocb->ki_cancel);
-   do {
-   if (!cancel || cancel == KIOCB_CANCELLED)
-   return -EINVAL;
-
-   old = cancel;
-   cancel = cmpxchg(>ki_cancel, old, KIOCB_CANCELLED);
-   } while (cancel != old);
+   kiocb_cancel_fn *cancel = kiocb->ki_cancel;
 
+   if (!cancel)
+   return -EINVAL;
+   kiocb->ki_cancel = NULL;
return cancel(>rw);
 }
 
-- 
2.14.2

[net v2 0/6][pull request] Intel Wired LAN Driver Updates 2018-03-05

2018-03-05 Thread Jeff Kirsher

This series contains fixes to e1000e only.

Benjamin Poirier provides all but one fix in this series, starting with
workaround for a VMWare e1000e emulation issue where ICR reads 0x0 on
the emulated device.  Partially reverted a previous commit dealing with
the "Other" interrupt throttling to avoid unforeseen fallout from these
changes that are not strictly necessary.  Restored the ICS write for
receive and transmit queue interrupts in the case that txq or rxq bits
were set in ICR and the Other interrupt handler read and cleared ICR
before the queue interrupt was raised.  Fixed an bug where interrupts
may be missed if ICR is read while INT_ASSERTED is not set, so avoid the
problem by setting all bits related to events that can trigger the Other
interrupt in IMS.  Fixed the return value for check_for_link() when
auto-negotiation is off.

Pierre-Yves Kerbrat fixes e1000e to use dma_zalloc_coherent() to make
sure the ring is memset to 0 to prevent the area from containing
garbage.

v2: added an additional e1000e fix to the series

The following are changes since commit a7f0fb1bfb66ded5d556d6723d691b77a7146b6f:
  Merge branch 'hv_netvsc-minor-fixes'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 1GbE

Benjamin Poirier (5):
  e1000e: Remove Other from EIAC
  Partial revert "e1000e: Avoid receiver overrun interrupt bursts"
  e1000e: Fix queue interrupt re-raising in Other interrupt
  e1000e: Avoid missed interrupts following ICR read
  e1000e: Fix check_for_link return value with autoneg off

Pierre-Yves Kerbrat (1):
  e1000e: allocate ring descriptors with dma_zalloc_coherent

 drivers/net/ethernet/intel/e1000e/defines.h | 21 -
 drivers/net/ethernet/intel/e1000e/ich8lan.c |  2 +-
 drivers/net/ethernet/intel/e1000e/mac.c |  2 +-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 35 ++---
 4 files changed, 34 insertions(+), 26 deletions(-)

-- 
2.14.3

[PATCH 09/36] fs: unexport poll_schedule_timeout

2018-03-05 Thread Christoph Hellwig

No users outside of select.c.

Signed-off-by: Christoph Hellwig 
---
 fs/select.c  | 3 +--
 include/linux/poll.h | 2 --
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index b6c36254028a..686de7b3a1db 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -233,7 +233,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t 
*wait_address,
add_wait_queue(wait_address, >wait);
 }
 
-int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
  ktime_t *expires, unsigned long slack)
 {
int rc = -EINTR;
@@ -258,7 +258,6 @@ int poll_schedule_timeout(struct poll_wqueues *pwq, int 
state,
 
return rc;
 }
-EXPORT_SYMBOL(poll_schedule_timeout);
 
 /**
  * poll_select_set_timeout - helper function to setup the timeout value
diff --git a/include/linux/poll.h b/include/linux/poll.h
index f45ebd017eaa..a3576da63377 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -96,8 +96,6 @@ struct poll_wqueues {
 
 extern void poll_initwait(struct poll_wqueues *pwq);
 extern void poll_freewait(struct poll_wqueues *pwq);
-extern int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
-ktime_t *expires, unsigned long slack);
 extern u64 select_estimate_accuracy(struct timespec64 *tv);
 
 #define MAX_INT64_SECONDS (((s64)(~((u64)0)>>1)/HZ)-1)
-- 
2.14.2

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller

From: John Fastabend 
Date: Mon, 05 Mar 2018 11:51:22 -0800

> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
> case and in the sendpage case leaves the data untouched. Both cases
> return -EACESS to the user. Returning SK_PASS will allow the msg to
> be sent.
> 
> In the sendmsg case data is copied into kernel space buffers before
> running the BPF program. In the sendpage case data is never copied.
> The implication being users may change data after BPF programs run in
> the sendpage case. (A flag will be added to always copy shortly
> if the copy must always be performed).

I don't see how the sendpage case can be right.

The user can asynchronously change the page contents whenever they
want, and if the BPF program runs on the old contents then the verdict
is not for what actually ends up being sent on the socket.

There is really no way to cheaply freeze the page contents other than
to make a copy.

[net v2 3/6] e1000e: Fix queue interrupt re-raising in Other interrupt

2018-03-05 Thread Jeff Kirsher

From: Benjamin Poirier 

Restores the ICS write for Rx/Tx queue interrupts which was present before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
but was not restored in commit 4aea7a5c5e94
("e1000e: Avoid receiver overrun interrupt bursts", v4.15-rc1).

This re-raises the queue interrupts in case the txq or rxq bits were set in
ICR and the Other interrupt handler read and cleared ICR before the queue
interrupt was raised.

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier 
Acked-by: Alexander Duyck 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 3b36efa6228d..2c9609bee2ae 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1919,6 +1919,9 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
+   if (icr & adapter->eiac_mask)
+   ew32(ICS, (icr & adapter->eiac_mask));
+
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
-- 
2.14.3

[net v2 1/6] e1000e: Remove Other from EIAC

2018-03-05 Thread Jeff Kirsher

From: Benjamin Poirier 

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x8004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 1298b69f990b..153ad406c65e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,6 +1918,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   ew32(ICR, E1000_ICR_OTHER);
+
if (icr & E1000_ICR_RXO) {
ew32(ICR, E1000_ICR_RXO);
enable = false;
@@ -2040,7 +2042,6 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
-   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2265,7 +2266,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
-- 
2.14.3

[net v2 6/6] e1000e: allocate ring descriptors with dma_zalloc_coherent

2018-03-05 Thread Jeff Kirsher

From: Pierre-Yves Kerbrat 

Descriptor rings were not initialized at zero when allocated
When area contained garbage data, it caused skb_over_panic in
e1000_clean_rx_irq (if data had E1000_RXD_STAT_DD bit set)

This patch makes use of dma_zalloc_coherent to make sure the
ring is memset at 0 to prevent the area from containing garbage.

Following is the signature of the panic:
IODDR0@0.0: skbuff: skb_over_panic: text:80407b20 len:64010 put:64010 
head:ab46d800 data:ab46d842 tail:0xab47d24c end:0xab46df40 dev:eth0
IODDR0@0.0: BUG: failure at net/core/skbuff.c:105/skb_panic()!
IODDR0@0.0: Kernel panic - not syncing: BUG!
IODDR0@0.0:
IODDR0@0.0: Process swapper/0 (pid: 0, threadinfo=81728000, task=8173cc00 ,cpu: 
0)
IODDR0@0.0: SP = <815a1c0c>
IODDR0@0.0: Stack:  0001
IODDR0@0.0: b2d89800 815e33ac
IODDR0@0.0: ea73c040 0001
IODDR0@0.0: 60040003 fa0a
IODDR0@0.0: 0002
IODDR0@0.0:
IODDR0@0.0: 804540c0 815a1c70
IODDR0@0.0: b2744000 602ac070
IODDR0@0.0: 815a1c44 b2d89800
IODDR0@0.0: 8173cc00 815a1c08
IODDR0@0.0:
IODDR0@0.0: 0006
IODDR0@0.0: 815a1b50 
IODDR0@0.0: 80079434 0001
IODDR0@0.0: ab46df40 b2744000
IODDR0@0.0: b2d89800
IODDR0@0.0:
IODDR0@0.0: fa0a 8045745c
IODDR0@0.0: 815a1c88 fa0a
IODDR0@0.0: 80407b20 b2789f80
IODDR0@0.0: 0005 80407b20
IODDR0@0.0:
IODDR0@0.0:
IODDR0@0.0: Call Trace:
IODDR0@0.0: [<804540bc>] skb_panic+0xa4/0xa8
IODDR0@0.0: [<80079430>] console_unlock+0x2f8/0x6d0
IODDR0@0.0: [<80457458>] skb_put+0xa0/0xc0
IODDR0@0.0: [<80407b1c>] e1000_clean_rx_irq+0x2dc/0x3e8
IODDR0@0.0: [<80407b1c>] e1000_clean_rx_irq+0x2dc/0x3e8
IODDR0@0.0: [<804079c8>] e1000_clean_rx_irq+0x188/0x3e8
IODDR0@0.0: [<80407b1c>] e1000_clean_rx_irq+0x2dc/0x3e8
IODDR0@0.0: [<80468b48>] __dev_kfree_skb_any+0x88/0xa8
IODDR0@0.0: [<804101ac>] e1000e_poll+0x94/0x288
IODDR0@0.0: [<8046e9d4>] net_rx_action+0x19c/0x4e8
IODDR0@0.0:   ...
IODDR0@0.0: Maximum depth to print reached. Use kstack= 
To specify a custom value (where 0 means to display the full backtrace)
IODDR0@0.0: ---[ end Kernel panic - not syncing: BUG!

Signed-off-by: Pierre-Yves Kerbrat 
Signed-off-by: Marius Gligor 
Tested-by: Aaron Brown 
Reviewed-by: Alexander Duyck 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9fd4050a91ca..c0f23446bf26 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -2323,8 +2323,8 @@ static int e1000_alloc_ring_dma(struct e1000_adapter 
*adapter,
 {
struct pci_dev *pdev = adapter->pdev;
 
-   ring->desc = dma_alloc_coherent(>dev, ring->size, >dma,
-   GFP_KERNEL);
+   ring->desc = dma_zalloc_coherent(>dev, ring->size, >dma,
+GFP_KERNEL);
if (!ring->desc)
return -ENOMEM;
 
-- 
2.14.3

[PATCH 10/36] fs: cleanup do_pollfd

2018-03-05 Thread Christoph Hellwig

Use straigline code with failure handling gotos instead of a lot
of nested conditionals.

Signed-off-by: Christoph Hellwig 
---
 fs/select.c | 48 +++-
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 686de7b3a1db..c6c504a814f9 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -806,34 +806,32 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,
 bool *can_busy_poll,
 __poll_t busy_flag)
 {
-   __poll_t mask;
-   int fd;
-
-   mask = 0;
-   fd = pollfd->fd;
-   if (fd >= 0) {
-   struct fd f = fdget(fd);
-   mask = EPOLLNVAL;
-   if (f.file) {
-   /* userland u16 ->events contains POLL... bitmap */
-   __poll_t filter = demangle_poll(pollfd->events) |
-   EPOLLERR | EPOLLHUP;
-   mask = DEFAULT_POLLMASK;
-   if (f.file->f_op->poll) {
-   pwait->_key = filter;
-   pwait->_key |= busy_flag;
-   mask = f.file->f_op->poll(f.file, pwait);
-   if (mask & busy_flag)
-   *can_busy_poll = true;
-   }
-   /* Mask out unneeded events. */
-   mask &= filter;
-   fdput(f);
-   }
+   int fd = pollfd->fd;
+   __poll_t mask = 0, filter;
+   struct fd f;
+
+   if (fd < 0)
+   goto out;
+   mask = EPOLLNVAL;
+   f = fdget(fd);
+   if (!f.file)
+   goto out;
+
+   /* userland u16 ->events contains POLL... bitmap */
+   filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
+   mask = DEFAULT_POLLMASK;
+   if (f.file->f_op->poll) {
+   pwait->_key = filter | busy_flag;
+   mask = f.file->f_op->poll(f.file, pwait);
+   if (mask & busy_flag)
+   *can_busy_poll = true;
}
+   mask &= filter; /* Mask out unneeded events. */
+   fdput(f);
+
+out:
/* ... and so does ->revents */
pollfd->revents = mangle_poll(mask);
-
return mask;
 }
 
-- 
2.14.2

[net v2 2/6] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-03-05 Thread Jeff Kirsher

From: Benjamin Poirier 

This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit, that is to read ICR in the other interrupt handler. We
remove the fix for the second part of the problem (2), Other interrupt
throttling.

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
avoiding unforeseen fallout from changes that are not strictly necessary.
As discussed, the e1000e driver should be in "maintenance mode".

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier 
Acked-by: Alexander Duyck 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 153ad406c65e..3b36efa6228d 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1915,21 +1915,10 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
u32 icr;
-   bool enable = true;
 
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
-   if (icr & E1000_ICR_RXO) {
-   ew32(ICR, E1000_ICR_RXO);
-   enable = false;
-   /* napi poll will re-enable Other, make sure it runs */
-   if (napi_schedule_prep(>napi)) {
-   adapter->total_rx_bytes = 0;
-   adapter->total_rx_packets = 0;
-   __napi_schedule(>napi);
-   }
-   }
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
@@ -1938,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
mod_timer(>watchdog_timer, jiffies + 1);
}
 
-   if (enable && !test_bit(__E1000_DOWN, >state))
+   if (!test_bit(__E1000_DOWN, >state))
ew32(IMS, E1000_IMS_OTHER);
 
return IRQ_HANDLED;
@@ -2708,8 +2697,7 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
if (adapter->msix_entries)
-   ew32(IMS, adapter->rx_ring->ims_val |
-E1000_IMS_OTHER);
+   ew32(IMS, adapter->rx_ring->ims_val);
else
e1000_irq_enable(adapter);
}
-- 
2.14.3

[net v2 4/6] e1000e: Avoid missed interrupts following ICR read

2018-03-05 Thread Jeff Kirsher

From: Benjamin Poirier 

The 82574 specification update errata 12 states that interrupts may be
missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by
setting all bits related to events that can trigger the Other interrupt in
IMS.

The Other interrupt is raised for such events regardless of whether or not
they are set in IMS. However, only when they are set is the INT_ASSERTED
bit also set in ICR.

By doing this, we ensure that INT_ASSERTED is always set when we read ICR
in e1000_msix_other() and steer clear of the errata. This also ensures that
ICR will automatically be cleared on read, therefore we no longer need to
clear bits explicitly.

Signed-off-by: Benjamin Poirier 
Acked-by: Alexander Duyck 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/defines.h | 21 -
 drivers/net/ethernet/intel/e1000e/netdev.c  | 11 ---
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index afb7ebe20b24..824fd44e25f0 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -400,6 +400,10 @@
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
 #define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
+#define E1000_ICR_MDAC  0x0200 /* MDIO Access Complete */
+#define E1000_ICR_SRPD  0x0001 /* Small Receive Packet Detected */
+#define E1000_ICR_ACK   0x0002 /* Receive ACK Frame Detected */
+#define E1000_ICR_MNG   0x0004 /* Manageability Event Detected */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
 #define E1000_ICR_INT_ASSERTED 0x8000
@@ -407,7 +411,7 @@
 #define E1000_ICR_RXQ1  0x0020 /* Rx Queue 1 Interrupt */
 #define E1000_ICR_TXQ0  0x0040 /* Tx Queue 0 Interrupt */
 #define E1000_ICR_TXQ1  0x0080 /* Tx Queue 1 Interrupt */
-#define E1000_ICR_OTHER 0x0100 /* Other Interrupts */
+#define E1000_ICR_OTHER 0x0100 /* Other Interrupt */
 
 /* PBA ECC Register */
 #define E1000_PBA_ECC_COUNTER_MASK  0xFFF0 /* ECC counter mask */
@@ -431,12 +435,27 @@
E1000_IMS_RXSEQ  |\
E1000_IMS_LSC)
 
+/* These are all of the events related to the OTHER interrupt.
+ */
+#define IMS_OTHER_MASK ( \
+   E1000_IMS_LSC  | \
+   E1000_IMS_RXO  | \
+   E1000_IMS_MDAC | \
+   E1000_IMS_SRPD | \
+   E1000_IMS_ACK  | \
+   E1000_IMS_MNG)
+
 /* Interrupt Mask Set */
 #define E1000_IMS_TXDW  E1000_ICR_TXDW  /* Transmit desc written back 
*/
 #define E1000_IMS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_IMS_RXO   E1000_ICR_RXO   /* Receiver Overrun */
 #define E1000_IMS_RXT0  E1000_ICR_RXT0  /* Rx timer intr */
+#define E1000_IMS_MDAC  E1000_ICR_MDAC  /* MDIO Access Complete */
+#define E1000_IMS_SRPD  E1000_ICR_SRPD  /* Small Receive Packet */
+#define E1000_IMS_ACK   E1000_ICR_ACK   /* Receive ACK Frame Detected 
*/
+#define E1000_IMS_MNG   E1000_ICR_MNG   /* Manageability Event */
 #define E1000_IMS_ECCER E1000_ICR_ECCER /* Uncorrectable ECC Error */
 #define E1000_IMS_RXQ0  E1000_ICR_RXQ0  /* Rx Queue 0 Interrupt */
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2c9609bee2ae..9fd4050a91ca 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1914,16 +1914,12 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr;
-
-   icr = er32(ICR);
-   ew32(ICR, E1000_ICR_OTHER);
+   u32 icr = er32(ICR);
 
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
if (icr & E1000_ICR_LSC) {
-   ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, >state))
@@ -1931,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
}
 
if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS,

[net v2 5/6] e1000e: Fix check_for_link return value with autoneg off

2018-03-05 Thread Jeff Kirsher

From: Benjamin Poirier 

When autoneg is off, the .check_for_link callback functions clear the
get_link_status flag and systematically return a "pseudo-error". This means
that the link is not detected as up until the next execution of the
e1000_watchdog_task() 2 seconds later.

CC: stable 
Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
Signed-off-by: Benjamin Poirier 
Acked-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 +-
 drivers/net/ethernet/intel/e1000e/mac.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 31277d3bb7dc..ff308b05d68c 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1602,7 +1602,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index f457c5703d0c..db735644b312 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -450,7 +450,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
-- 
2.14.3

[PATCH 12/36] fs: add new vfs_poll and file_can_poll helpers

2018-03-05 Thread Christoph Hellwig

These abstract out calls to the poll method in preparation for changes
in how we poll.

Signed-off-by: Christoph Hellwig 
---
 drivers/staging/comedi/drivers/serial2002.c |  4 ++--
 drivers/vfio/virqfd.c   |  2 +-
 drivers/vhost/vhost.c   |  2 +-
 fs/eventpoll.c  |  5 ++---
 fs/select.c | 23 ---
 include/linux/poll.h| 12 
 mm/memcontrol.c |  2 +-
 net/9p/trans_fd.c   | 18 --
 virt/kvm/eventfd.c  |  2 +-
 9 files changed, 32 insertions(+), 38 deletions(-)

diff --git a/drivers/staging/comedi/drivers/serial2002.c 
b/drivers/staging/comedi/drivers/serial2002.c
index b3f3b4a201af..5471b2212a62 100644
--- a/drivers/staging/comedi/drivers/serial2002.c
+++ b/drivers/staging/comedi/drivers/serial2002.c
@@ -113,7 +113,7 @@ static void serial2002_tty_read_poll_wait(struct file *f, 
int timeout)
long elapsed;
__poll_t mask;
 
-   mask = f->f_op->poll(f, );
+   mask = vfs_poll(f, );
if (mask & (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)) {
break;
@@ -136,7 +136,7 @@ static int serial2002_tty_read(struct file *f, int timeout)
 
result = -1;
if (!IS_ERR(f)) {
-   if (f->f_op->poll) {
+   if (file_can_poll(f)) {
serial2002_tty_read_poll_wait(f, timeout);
 
if (kernel_read(f, , 1, ) == 1)
diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c
index 085700f1be10..2a1be859ee71 100644
--- a/drivers/vfio/virqfd.c
+++ b/drivers/vfio/virqfd.c
@@ -166,7 +166,7 @@ int vfio_virqfd_enable(void *opaque,
init_waitqueue_func_entry(>wait, virqfd_wakeup);
init_poll_funcptr(>pt, virqfd_ptable_queue_proc);
 
-   events = irqfd.file->f_op->poll(irqfd.file, >pt);
+   events = vfs_poll(irqfd.file, >pt);
 
/*
 * Check if there was an event already pending on the eventfd
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1b3e8d2d5c8b..4d27e288bb1d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -208,7 +208,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file 
*file)
if (poll->wqh)
return 0;
 
-   mask = file->f_op->poll(file, >table);
+   mask = vfs_poll(file, >table);
if (mask)
vhost_poll_wakeup(>wait, 0, 0, poll_to_key(mask));
if (mask & EPOLLERR) {
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0f3494ed3ed0..2bebae5a38cf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -884,8 +884,7 @@ static __poll_t ep_item_poll(const struct epitem *epi, 
poll_table *pt,
 
pt->_key = epi->event.events;
if (!is_file_epoll(epi->ffd.file))
-   return epi->ffd.file->f_op->poll(epi->ffd.file, pt) &
-  epi->event.events;
+   return vfs_poll(epi->ffd.file, pt) & epi->event.events;
 
ep = epi->ffd.file->private_data;
poll_wait(epi->ffd.file, >poll_wait, pt);
@@ -2020,7 +2019,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 
/* The target file descriptor must support poll */
error = -EPERM;
-   if (!tf.file->f_op->poll)
+   if (!file_can_poll(tf.file))
goto error_tgt_fput;
 
/* Check if EPOLLWAKEUP is allowed */
diff --git a/fs/select.c b/fs/select.c
index c6c504a814f9..ba91103707ea 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -502,14 +502,10 @@ static int do_select(int n, fd_set_bits *fds, struct 
timespec64 *end_time)
continue;
f = fdget(i);
if (f.file) {
-   const struct file_operations *f_op;
-   f_op = f.file->f_op;
-   mask = DEFAULT_POLLMASK;
-   if (f_op->poll) {
-   wait_key_set(wait, in, out,
-bit, busy_flag);
-   mask = (*f_op->poll)(f.file, 
wait);
-   }
+   wait_key_set(wait, in, out, bit,
+busy_flag);
+   mask = vfs_poll(f.file, wait);
+
fdput(f);
if ((mask & POLLIN_SET) && (in & bit)) {
res_in |= bit;
@@ -819,13 +815,10 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,

[PATCH 15/36] net: refactor socket_poll

2018-03-05 Thread Christoph Hellwig

Factor out two busy poll related helpers for late reuse, and remove
a command that isn't very helpful, especially with the __poll_t
annotations in place.

Signed-off-by: Christoph Hellwig 
---
 include/net/busy_poll.h | 15 +++
 net/socket.c| 21 -
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 71c72a939bf8..c5187438af38 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -121,6 +121,21 @@ static inline void sk_busy_loop(struct sock *sk, int 
nonblock)
 #endif
 }
 
+static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
+{
+   if (sk_can_busy_loop(sock->sk) &&
+   events && (events & POLL_BUSY_LOOP)) {
+   /* once, only if requested by syscall */
+   sk_busy_loop(sock->sk, 1);
+   }
+}
+
+/* if this socket can poll_ll, tell the system call */
+static inline __poll_t sock_poll_busy_flag(struct socket *sock)
+{
+   return sk_can_busy_loop(sock->sk) ? POLL_BUSY_LOOP : 0;
+}
+
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
struct napi_struct *napi)
diff --git a/net/socket.c b/net/socket.c
index a93c99b518ca..3f859a07641a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1117,24 +1117,11 @@ EXPORT_SYMBOL(sock_create_lite);
 /* No kernel lock held - perfect */
 static __poll_t sock_poll(struct file *file, poll_table *wait)
 {
-   __poll_t busy_flag = 0;
-   struct socket *sock;
-
-   /*
-*  We can't return errors to poll, so it's either yes or no.
-*/
-   sock = file->private_data;
-
-   if (sk_can_busy_loop(sock->sk)) {
-   /* this socket can poll_ll so tell the system call */
-   busy_flag = POLL_BUSY_LOOP;
-
-   /* once, only if requested by syscall */
-   if (wait && (wait->_key & POLL_BUSY_LOOP))
-   sk_busy_loop(sock->sk, 1);
-   }
+   struct socket *sock = file->private_data;
+   __poll_t events = poll_requested_events(wait);
 
-   return busy_flag | sock->ops->poll(file, sock, wait);
+   sock_poll_busy_loop(sock, events);
+   return sock->ops->poll(file, sock, wait) | sock_poll_busy_flag(sock);
 }
 
 static int sock_mmap(struct file *file, struct vm_area_struct *vma)
-- 
2.14.2

Re: [RFC net-next 4/6] nfp: add ndo_set_mac_address for representors

2018-03-05 Thread Or Gerlitz

On Mon, Mar 5, 2018 at 3:28 PM, John Hurley  wrote:
> A representor hardware address does not have any meaning outside of the
> kernel netdev/networking stack. Thus there is no need for any app specific
> code for setting a representors hardware address, the default eth_mac_addr
> is sufficient.

where did you need that? does libvirt attempts to change the mac address or
it's for bonding to call, worth mentioning the use-case in the change log

[PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-03-05 Thread Christoph Hellwig

Simple one-shot poll through the io_submit() interface.  To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL.  It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig 
---
 fs/aio.c | 102 +++
 include/uapi/linux/aio_abi.h |   6 +--
 2 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index da87cbf7c67a..0bafc4975d51 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -5,6 +5,7 @@
  * Implements an efficient asynchronous io interface.
  *
  * Copyright 2000, 2001, 2002 Red Hat, Inc.  All Rights Reserved.
+ * Copyright 2018 Christoph Hellwig.
  *
  * See ../COPYING for licensing terms.
  */
@@ -156,9 +157,17 @@ struct kioctx {
unsignedid;
 };
 
+struct poll_iocb {
+   struct file *file;
+   __poll_tevents;
+   struct wait_queue_head  *head;
+   struct wait_queue_entry wait;
+};
+
 struct aio_kiocb {
union {
struct kiocbrw;
+   struct poll_iocbpoll;
};
 
struct kioctx   *ki_ctx;
@@ -1565,6 +1574,96 @@ static ssize_t aio_write(struct kiocb *req, struct iocb 
*iocb, bool vectored,
return ret;
 }
 
+static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask)
+{
+   fput(req->file);
+   aio_complete(container_of(req, struct aio_kiocb, poll),
+   mangle_poll(mask), 0);
+}
+
+static void aio_complete_poll(struct poll_iocb *req, __poll_t mask)
+{
+   struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
+
+   if (!(iocb->flags & AIO_IOCB_CANCELLED))
+   __aio_complete_poll(req, mask);
+}
+
+static int aio_poll_cancel(struct kiocb *rw)
+{
+   struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw);
+
+   remove_wait_queue(iocb->poll.head, >poll.wait);
+   __aio_complete_poll(>poll, 0); /* no events to report */
+   return 0;
+}
+
+static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int 
sync,
+   void *key)
+{
+   struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
+   struct file *file = req->file;
+   __poll_t mask = key_to_poll(key);
+
+   assert_spin_locked(>head->lock);
+
+   /* for instances that support it check for an event match first: */
+   if (mask && !(mask & req->events))
+   return 0;
+
+   mask = vfs_poll_mask(file, req->events);
+   if (!mask)
+   return 0;
+
+   __remove_wait_queue(req->head, >wait);
+   aio_complete_poll(req, mask);
+   return 1;
+}
+
+static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
+{
+   struct poll_iocb *req = >poll;
+   unsigned long flags;
+   __poll_t mask;
+
+   /* reject any unknown events outside the normal event mask. */
+   if ((u16)iocb->aio_buf != iocb->aio_buf)
+   return -EINVAL;
+   /* reject fields that are not defined for poll */
+   if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
+   return -EINVAL;
+
+   req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
+   req->file = fget(iocb->aio_fildes);
+   if (unlikely(!req->file))
+   return -EBADF;
+
+   req->head = vfs_get_poll_head(req->file, req->events);
+   if (!req->head) {
+   fput(req->file);
+   return -EINVAL; /* same as no support for IOCB_CMD_POLL */
+   }
+   if (IS_ERR(req->head)) {
+   mask = PTR_TO_POLL(req->head);
+   goto done;
+   }
+
+   init_waitqueue_func_entry(>wait, aio_poll_wake);
+
+   spin_lock_irqsave(>head->lock, flags);
+   mask = vfs_poll_mask(req->file, req->events);
+   if (!mask) {
+   __kiocb_set_cancel_fn(aiocb, aio_poll_cancel,
+   AIO_IOCB_DELAYED_CANCEL);
+   __add_wait_queue(req->head, >wait);
+   }
+   spin_unlock_irqrestore(>head->lock, flags);
+done:
+   if (mask)
+   aio_complete_poll(req, mask);
+   return -EIOCBQUEUED;
+}
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 struct iocb *iocb, bool compat)
 {
@@ -1628,6 +1727,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
case IOCB_CMD_PWRITEV:
ret = aio_write(>rw, iocb, true, compat);
break;
+   case IOCB_CMD_POLL:
+   ret = aio_poll(req, iocb);
+   break;
default:
pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);

[PATCH 13/36] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-03-05 Thread Christoph Hellwig

->get_poll_head returns the waitqueue that the poll operation is going
to sleep on.  Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events.  But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.

Signed-off-by: Christoph Hellwig 
---
 Documentation/filesystems/Locking |  7 ++-
 Documentation/filesystems/vfs.txt | 13 +
 fs/select.c   | 28 
 include/linux/fs.h|  2 ++
 include/linux/poll.h  | 27 +++
 5 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index 220bba28f72b..6d227f9d7bd9 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -440,6 +440,8 @@ prototypes:
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
+   struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+   __poll_t (*poll_mask) (struct file *, __poll_t);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
@@ -470,7 +472,7 @@ prototypes:
 };
 
 locking rules:
-   All may block.
+   All except for ->poll_mask may block.
 
 ->llseek() locking has moved from llseek to the individual llseek
 implementations.  If your fs is not using generic_file_llseek, you
@@ -498,6 +500,9 @@ in sys_read() and friends.
 the lease within the individual filesystem to record the result of the
 operation
 
+->poll_mask can be called with or without the waitqueue lock for the waitqueue
+returned from ->get_poll_head.
+
 --- dquot_operations ---
 prototypes:
int (*write_dquot) (struct dquot *);
diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f608180ad59d..50ee13563271 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,8 @@ struct file_operations {
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
+   struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+   __poll_t (*poll_mask) (struct file *, __poll_t);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
@@ -901,6 +903,17 @@ otherwise noted.
activity on this file and (optionally) go to sleep until there
is activity. Called by the select(2) and poll(2) system calls
 
+  get_poll_head: Returns the struct wait_queue_head that poll, select,
+  epoll or aio poll should wait on in case this instance only has single
+  waitqueue.  Can return NULL to indicate polling is not supported,
+  or a POLL* value using the POLL_TO_PTR helper in case a grave error
+  occured and ->poll_mask shall not be called.
+
+  poll_mask: return the mask of POLL* values describing the file descriptor
+  state.  Called either before going to sleep on the waitqueue returned by
+  get_poll_head, or after it has been woken.  If ->get_poll_head and
+  ->poll_mask are implemented ->poll does not need to be implement.
+
   unlocked_ioctl: called by the ioctl(2) system call.
 
   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
diff --git a/fs/select.c b/fs/select.c
index ba91103707ea..cc270d7f6192 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -34,6 +34,34 @@
 
 #include 
 
+__poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
+{
+   unsigned int events = poll_requested_events(pt);
+   struct wait_queue_head *head;
+
+   if (unlikely(!file_can_poll(file)))
+   return DEFAULT_POLLMASK;
+
+   if (file->f_op->poll)
+   return file->f_op->poll(file, pt);
+
+   /*
+* Only get the poll head and do the first mask check if we are actually
+* going to sleep on this file:
+*/
+   if (pt && pt->_qproc) {
+   head = vfs_get_poll_head(file, events);
+   if (!head)
+   return DEFAULT_POLLMASK;
+   if (IS_ERR(head))
+   return PTR_TO_POLL(head);
+
+   pt->_qproc(file, head, pt);
+   }
+
+   return file->f_op->poll_mask(file, events);
+}

[PATCH 20/36] net: convert datagram_poll users tp ->poll_mask

2018-03-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/isdn/mISDN/socket.c|  2 +-
 drivers/net/ppp/pppoe.c|  2 +-
 drivers/staging/ipx/af_ipx.c   |  2 +-
 drivers/staging/irda/net/af_irda.c |  6 +++---
 include/linux/skbuff.h |  3 +--
 include/net/udp.h  |  2 +-
 net/appletalk/ddp.c|  2 +-
 net/ax25/af_ax25.c |  2 +-
 net/bluetooth/hci_sock.c   |  2 +-
 net/can/bcm.c  |  2 +-
 net/can/raw.c  |  2 +-
 net/core/datagram.c| 13 -
 net/decnet/af_decnet.c |  6 +++---
 net/ieee802154/socket.c|  4 ++--
 net/ipv4/af_inet.c |  6 +++---
 net/ipv4/udp.c | 10 +-
 net/ipv6/af_inet6.c|  2 +-
 net/ipv6/raw.c |  4 ++--
 net/kcm/kcmsock.c  |  4 ++--
 net/key/af_key.c   |  2 +-
 net/l2tp/l2tp_ip.c |  2 +-
 net/l2tp/l2tp_ip6.c|  2 +-
 net/l2tp/l2tp_ppp.c|  2 +-
 net/llc/af_llc.c   |  2 +-
 net/netlink/af_netlink.c   |  2 +-
 net/netrom/af_netrom.c |  2 +-
 net/nfc/rawsock.c  |  4 ++--
 net/packet/af_packet.c |  9 -
 net/phonet/socket.c|  2 +-
 net/qrtr/qrtr.c|  2 +-
 net/rose/af_rose.c |  2 +-
 net/x25/af_x25.c   |  2 +-
 32 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c84270e16bdd..61d6e4c9e7d1 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -589,7 +589,7 @@ static const struct proto_ops data_sock_ops = {
.getname= data_sock_getname,
.sendmsg= mISDN_sock_sendmsg,
.recvmsg= mISDN_sock_recvmsg,
-   .poll   = datagram_poll,
+   .poll_mask  = datagram_poll_mask,
.listen = sock_no_listen,
.shutdown   = sock_no_shutdown,
.setsockopt = data_sock_setsockopt,
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 5aa59f41bf8c..8c311e626884 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -1120,7 +1120,7 @@ static const struct proto_ops pppoe_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname= pppoe_getname,
-   .poll   = datagram_poll,
+   .poll_mask  = datagram_poll_mask,
.listen = sock_no_listen,
.shutdown   = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/drivers/staging/ipx/af_ipx.c b/drivers/staging/ipx/af_ipx.c
index d21a9d128d3e..3373f7f67d35 100644
--- a/drivers/staging/ipx/af_ipx.c
+++ b/drivers/staging/ipx/af_ipx.c
@@ -1967,7 +1967,7 @@ static const struct proto_ops ipx_dgram_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname= ipx_getname,
-   .poll   = datagram_poll,
+   .poll_mask  = datagram_poll_mask,
.ioctl  = ipx_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ipx_compat_ioctl,
diff --git a/drivers/staging/irda/net/af_irda.c 
b/drivers/staging/irda/net/af_irda.c
index 2f1e9ab3d6d0..77659b1c40ba 100644
--- a/drivers/staging/irda/net/af_irda.c
+++ b/drivers/staging/irda/net/af_irda.c
@@ -2600,7 +2600,7 @@ static const struct proto_ops irda_seqpacket_ops = {
.socketpair =   sock_no_socketpair,
.accept =   irda_accept,
.getname =  irda_getname,
-   .poll = datagram_poll,
+   .poll_mask =datagram_poll_mask,
.ioctl =irda_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl = irda_compat_ioctl,
@@ -2624,7 +2624,7 @@ static const struct proto_ops irda_dgram_ops = {
.socketpair =   sock_no_socketpair,
.accept =   irda_accept,
.getname =  irda_getname,
-   .poll = datagram_poll,
+   .poll_mask =datagram_poll_mask,
.ioctl =irda_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl = irda_compat_ioctl,
@@ -2649,7 +2649,7 @@ static const struct proto_ops irda_ultra_ops = {
.socketpair =   sock_no_socketpair,
.accept =   sock_no_accept,
.getname =  irda_getname,
-   .poll = datagram_poll,
+   .poll_mask =datagram_poll_mask,
.ioctl =irda_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl = irda_compat_ioctl,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c1e66bdcf583..455f4660c2a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3246,8 +3246,7 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned flags,
int *peeked, int *off, int *err);
 struct sk_buff

[PATCH 19/36] net/unix: convert to ->poll_mask

2018-03-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 net/unix/af_unix.c | 30 +++---
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2d465bdeccbc..619c6921dd46 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -638,9 +638,8 @@ static int unix_stream_connect(struct socket *, struct 
sockaddr *,
 static int unix_socketpair(struct socket *, struct socket *);
 static int unix_accept(struct socket *, struct socket *, int, bool);
 static int unix_getname(struct socket *, struct sockaddr *, int *, int);
-static __poll_t unix_poll(struct file *, struct socket *, poll_table *);
-static __poll_t unix_dgram_poll(struct file *, struct socket *,
-   poll_table *);
+static __poll_t unix_poll_mask(struct socket *, __poll_t);
+static __poll_t unix_dgram_poll_mask(struct socket *, __poll_t);
 static int unix_ioctl(struct socket *, unsigned int, unsigned long);
 static int unix_shutdown(struct socket *, int);
 static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
@@ -681,7 +680,7 @@ static const struct proto_ops unix_stream_ops = {
.socketpair =   unix_socketpair,
.accept =   unix_accept,
.getname =  unix_getname,
-   .poll = unix_poll,
+   .poll_mask =unix_poll_mask,
.ioctl =unix_ioctl,
.listen =   unix_listen,
.shutdown = unix_shutdown,
@@ -704,7 +703,7 @@ static const struct proto_ops unix_dgram_ops = {
.socketpair =   unix_socketpair,
.accept =   sock_no_accept,
.getname =  unix_getname,
-   .poll = unix_dgram_poll,
+   .poll_mask =unix_dgram_poll_mask,
.ioctl =unix_ioctl,
.listen =   sock_no_listen,
.shutdown = unix_shutdown,
@@ -726,7 +725,7 @@ static const struct proto_ops unix_seqpacket_ops = {
.socketpair =   unix_socketpair,
.accept =   unix_accept,
.getname =  unix_getname,
-   .poll = unix_dgram_poll,
+   .poll_mask =unix_dgram_poll_mask,
.ioctl =unix_ioctl,
.listen =   unix_listen,
.shutdown = unix_shutdown,
@@ -2640,13 +2639,10 @@ static int unix_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
return err;
 }
 
-static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table 
*wait)
+static __poll_t unix_poll_mask(struct socket *sock, __poll_t events)
 {
struct sock *sk = sock->sk;
-   __poll_t mask;
-
-   sock_poll_wait(file, sk_sleep(sk), wait);
-   mask = 0;
+   __poll_t mask = 0;
 
/* exceptional events? */
if (sk->sk_err)
@@ -2675,15 +2671,11 @@ static __poll_t unix_poll(struct file *file, struct 
socket *sock, poll_table *wa
return mask;
 }
 
-static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
-   poll_table *wait)
+static __poll_t unix_dgram_poll_mask(struct socket *sock, __poll_t events)
 {
struct sock *sk = sock->sk, *other;
-   unsigned int writable;
-   __poll_t mask;
-
-   sock_poll_wait(file, sk_sleep(sk), wait);
-   mask = 0;
+   int writable;
+   __poll_t mask = 0;
 
/* exceptional events? */
if (sk->sk_err || !skb_queue_empty(>sk_error_queue))
@@ -2709,7 +2701,7 @@ static __poll_t unix_dgram_poll(struct file *file, struct 
socket *sock,
}
 
/* No write status requested, avoid expensive OUT tests. */
-   if (!(poll_requested_events(wait) & (EPOLLWRBAND|EPOLLWRNORM|EPOLLOUT)))
+   if (!(events & (EPOLLWRBAND|EPOLLWRNORM|EPOLLOUT)))
return mask;
 
writable = unix_writable(sk);
-- 
2.14.2

[PATCH 17/36] net: remove sock_no_poll

2018-03-05 Thread Christoph Hellwig

Now that sock_poll handles a NULL ->poll or ->poll_mask there is no need
for a stub.

Signed-off-by: Christoph Hellwig 
---
 crypto/af_alg.c | 1 -
 crypto/algif_hash.c | 2 --
 crypto/algif_rng.c  | 1 -
 drivers/isdn/mISDN/socket.c | 1 -
 drivers/net/ppp/pptp.c  | 1 -
 include/net/sock.h  | 2 --
 net/bluetooth/bnep/sock.c   | 1 -
 net/bluetooth/cmtp/sock.c   | 1 -
 net/bluetooth/hidp/sock.c   | 1 -
 net/core/sock.c | 6 --
 10 files changed, 17 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index c49766b03165..50d75de539f5 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -347,7 +347,6 @@ static const struct proto_ops alg_proto_ops = {
.sendpage   =   sock_no_sendpage,
.sendmsg=   sock_no_sendmsg,
.recvmsg=   sock_no_recvmsg,
-   .poll   =   sock_no_poll,
 
.bind   =   alg_bind,
.release=   af_alg_release,
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 6c9b1927a520..bfcf595fd8f9 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -288,7 +288,6 @@ static struct proto_ops algif_hash_ops = {
.mmap   =   sock_no_mmap,
.bind   =   sock_no_bind,
.setsockopt =   sock_no_setsockopt,
-   .poll   =   sock_no_poll,
 
.release=   af_alg_release,
.sendmsg=   hash_sendmsg,
@@ -396,7 +395,6 @@ static struct proto_ops algif_hash_ops_nokey = {
.mmap   =   sock_no_mmap,
.bind   =   sock_no_bind,
.setsockopt =   sock_no_setsockopt,
-   .poll   =   sock_no_poll,
 
.release=   af_alg_release,
.sendmsg=   hash_sendmsg_nokey,
diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
index 150c2b6480ed..22df3799a17b 100644
--- a/crypto/algif_rng.c
+++ b/crypto/algif_rng.c
@@ -106,7 +106,6 @@ static struct proto_ops algif_rng_ops = {
.bind   =   sock_no_bind,
.accept =   sock_no_accept,
.setsockopt =   sock_no_setsockopt,
-   .poll   =   sock_no_poll,
.sendmsg=   sock_no_sendmsg,
.sendpage   =   sock_no_sendpage,
 
diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c5603d1a07d6..c84270e16bdd 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -746,7 +746,6 @@ static const struct proto_ops base_sock_ops = {
.getname= sock_no_getname,
.sendmsg= sock_no_sendmsg,
.recvmsg= sock_no_recvmsg,
-   .poll   = sock_no_poll,
.listen = sock_no_listen,
.shutdown   = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index 6dde9a0cfe76..87f892f1d0fe 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -627,7 +627,6 @@ static const struct proto_ops pptp_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname= pptp_getname,
-   .poll   = sock_no_poll,
.listen = sock_no_listen,
.shutdown   = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/include/net/sock.h b/include/net/sock.h
index 169c92afcafa..d9249fe65859 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1585,8 +1585,6 @@ int sock_no_connect(struct socket *, struct sockaddr *, 
int, int);
 int sock_no_socketpair(struct socket *, struct socket *);
 int sock_no_accept(struct socket *, struct socket *, int, bool);
 int sock_no_getname(struct socket *, struct sockaddr *, int *, int);
-__poll_t sock_no_poll(struct file *, struct socket *,
- struct poll_table_struct *);
 int sock_no_ioctl(struct socket *, unsigned int, unsigned long);
 int sock_no_listen(struct socket *, int);
 int sock_no_shutdown(struct socket *, int);
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index b5116fa9835e..00deacdcb51c 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -175,7 +175,6 @@ static const struct proto_ops bnep_sock_ops = {
.getname= sock_no_getname,
.sendmsg= sock_no_sendmsg,
.recvmsg= sock_no_recvmsg,
-   .poll   = sock_no_poll,
.listen = sock_no_listen,
.shutdown   = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index ce86a7bae844..e08f28fadd65 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -178,7 +178,6 @@ static const struct proto_ops cmtp_sock_ops = {
.getname= sock_no_getname,
.sendmsg= sock_no_sendmsg,
.recvmsg=

[PATCH 18/36] net/tcp: convert to ->poll_mask

2018-03-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 include/net/tcp.h   |  4 ++--
 net/ipv4/af_inet.c  |  3 ++-
 net/ipv4/tcp.c  | 31 ++-
 net/ipv6/af_inet6.c |  3 ++-
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e3fc667f9ac2..fb52f93d556c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -387,8 +387,8 @@ bool tcp_peer_is_proven(struct request_sock *req, struct 
dst_entry *dst);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
 void tcp_init_transfer(struct sock *sk, int bpf_op);
-__poll_t tcp_poll(struct file *file, struct socket *sock,
- struct poll_table_struct *wait);
+struct wait_queue_head *tcp_get_poll_head(struct socket *sock, __poll_t 
events);
+__poll_t tcp_poll_mask(struct socket *sock, __poll_t events);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
   char __user *optval, int __user *optlen);
 int tcp_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e4329e161943..ec32cc263b18 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -952,7 +952,8 @@ const struct proto_ops inet_stream_ops = {
.socketpair= sock_no_socketpair,
.accept= inet_accept,
.getname   = inet_getname,
-   .poll  = tcp_poll,
+   .get_poll_head = tcp_get_poll_head,
+   .poll_mask = tcp_poll_mask,
.ioctl = inet_ioctl,
.listen= inet_listen,
.shutdown  = inet_shutdown,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 48636aee23c3..ad8e281066a0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -484,33 +484,30 @@ static void tcp_tx_timestamp(struct sock *sk, u16 tsflags)
}
 }
 
+struct wait_queue_head *tcp_get_poll_head(struct socket *sock, __poll_t events)
+{
+   sock_poll_busy_loop(sock, events);
+   sock_rps_record_flow(sock->sk);
+   return sk_sleep(sock->sk);
+}
+EXPORT_SYMBOL(tcp_get_poll_head);
+
 /*
- * Wait for a TCP event.
- *
- * Note that we don't need to lock the socket, as the upper poll layers
- * take care of normal races (between the test and the event) and we don't
- * go look at any of the socket buffers directly.
+ * Socket is not locked. We are protected from async events by poll logic and
+ * correct handling of state changes made by other threads is impossible in
+ * any case.
  */
-__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
+__poll_t tcp_poll_mask(struct socket *sock, __poll_t events)
 {
-   __poll_t mask;
struct sock *sk = sock->sk;
const struct tcp_sock *tp = tcp_sk(sk);
+   __poll_t mask = 0;
int state;
 
-   sock_poll_wait(file, sk_sleep(sk), wait);
-
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
return inet_csk_listen_poll(sk);
 
-   /* Socket is not locked. We are protected from async events
-* by poll logic and correct handling of state changes
-* made by other threads is impossible in any case.
-*/
-
-   mask = 0;
-
/*
 * EPOLLHUP is certainly not done right. But poll() doesn't
 * have a notion of HUP in just one direction, and for a
@@ -591,7 +588,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, 
poll_table *wait)
 
return mask;
 }
-EXPORT_SYMBOL(tcp_poll);
+EXPORT_SYMBOL(tcp_poll_mask);
 
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 416917719a6f..c470549d6ef9 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -547,7 +547,8 @@ const struct proto_ops inet6_stream_ops = {
.socketpair= sock_no_socketpair,/* a do nothing */
.accept= inet_accept,   /* ok   */
.getname   = inet6_getname,
-   .poll  = tcp_poll,  /* ok   */
+   .get_poll_head = tcp_get_poll_head,
+   .poll_mask = tcp_poll_mask, /* ok   */
.ioctl = inet6_ioctl,   /* must change  */
.listen= inet_listen,   /* ok   */
.shutdown  = inet_shutdown, /* ok   */
-- 
2.14.2

Re: [bpf-next PATCH 04/16] net: generalize sk_alloc_sg to work with scatterlist rings

2018-03-05 Thread David Miller

From: John Fastabend 
Date: Mon, 05 Mar 2018 11:51:17 -0800

> The current implementation of sk_alloc_sg expects scatterlist to always
> start at entry 0 and complete at entry MAX_SKB_FRAGS.
> 
> Future patches will want to support starting at arbitrary offset into
> scatterlist so add an additional sg_start parameters and then default
> to the current values in TLS code paths.
> 
> Signed-off-by: John Fastabend 

Acked-by: David S. Miller

1 2 3 4 >

1 - 100 of 300 matches

Mail list logo