date:20170320

Re: [PATCH net-next v6 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-03-20 Thread Alexei Starovoitov

On Mon, Mar 20, 2017 at 09:08:57PM -0700, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Add a sample program to demostrate the possible usage of
> get_socket_cookie and get_socket_uid helper function. The program will
> store bytes and packets counting of in/out traffic monitored by iptables
> and store the stats in a bpf map in per socket base. The owner uid of
> the socket will be stored as part of the data entry. A shell script for
> running the program is also included.
> 
> Signed-off-by: Chenbo Feng 

Acked-by: Alexei Starovoitov

Re: [PATCHv2 net-next 2/7] sctp: implement receiver-side procedures for the SSN/TSN Reset Request Parameter

2017-03-20 Thread Xin Long

On Tue, Mar 21, 2017 at 2:04 AM, Marcelo Ricardo Leitner
 wrote:
> On Fri, Mar 10, 2017 at 12:11:07PM +0800, Xin Long wrote:
>> This patch is to implement Receiver-Side Procedures for the SSN/TSN
>> Reset Request Parameter described in rfc6525 section 6.2.4.
>>
>> The process is kind of complicate, it's wonth having some comments
>> from section 6.2.4 in the codes.
>>
>> Signed-off-by: Xin Long 
>> ---
>>  include/net/sctp/sm.h   |  4 +++
>>  net/sctp/sm_statefuns.c |  3 ++
>>  net/sctp/stream.c   | 79 
>> +
>>  3 files changed, 86 insertions(+)
>>
>> diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
>> index b6f682e..2629d66 100644
>> --- a/include/net/sctp/sm.h
>> +++ b/include/net/sctp/sm.h
>> @@ -293,6 +293,10 @@ struct sctp_chunk *sctp_process_strreset_inreq(
>>   struct sctp_association *asoc,
>>   union sctp_params param,
>>   struct sctp_ulpevent **evp);
>> +struct sctp_chunk *sctp_process_strreset_tsnreq(
>> + struct sctp_association *asoc,
>> + union sctp_params param,
>> + struct sctp_ulpevent **evp);
>>
>>  /* Prototypes for statetable processing. */
>>
>> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
>> index e03bb1a..6982064 100644
>> --- a/net/sctp/sm_statefuns.c
>> +++ b/net/sctp/sm_statefuns.c
>> @@ -3872,6 +3872,9 @@ sctp_disposition_t sctp_sf_do_reconf(struct net *net,
>>   else if (param.p->type == SCTP_PARAM_RESET_IN_REQUEST)
>>   reply = sctp_process_strreset_inreq(
>>   (struct sctp_association *)asoc, param, );
>> + else if (param.p->type == SCTP_PARAM_RESET_TSN_REQUEST)
>> + reply = sctp_process_strreset_tsnreq(
>> + (struct sctp_association *)asoc, param, );
>>   /* More handles for other types will be added here, by now it
>>* just ignores other types.
>>*/
>> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
>> index 1c6cc04..7e993b0 100644
>> --- a/net/sctp/stream.c
>> +++ b/net/sctp/stream.c
>> @@ -477,3 +477,82 @@ struct sctp_chunk *sctp_process_strreset_inreq(
>>
>>   return chunk;
>>  }
>> +
>> +struct sctp_chunk *sctp_process_strreset_tsnreq(
>> + struct sctp_association *asoc,
>> + union sctp_params param,
>> + struct sctp_ulpevent **evp)
>> +{
>> + __u32 init_tsn = 0, next_tsn = 0, max_tsn_seen;
>> + struct sctp_strreset_tsnreq *tsnreq = param.v;
>> + struct sctp_stream *stream = asoc->stream;
>> + __u32 result = SCTP_STRRESET_DENIED;
>> + __u32 request_seq;
>> + __u16 i;
>> +
>> + request_seq = ntohl(tsnreq->request_seq);
>> + if (request_seq > asoc->strreset_inseq) {
>> + result = SCTP_STRRESET_ERR_BAD_SEQNO;
>> + goto out;
>> + } else if (request_seq == asoc->strreset_inseq) {
>> + asoc->strreset_inseq++;
>> + }
>
> I guess I already asked this, but.. why request_seq <
> asoc->strreset_inseq is allowed?
we can not just ignore or response with ERR.
rfc6525#section-5.2.1:

   ... If the received RE-CONFIG chunk contains at least
   one request and based on the analysis of the Re-configuration Request
   Sequence Numbers this is the last received RE-CONFIG chunk (i.e., a
   retransmission), the same RE-CONFIG chunk MUST to be sent back in
   response, as it was earlier.


>
>> +
>> + if (!(asoc->strreset_enable & SCTP_ENABLE_RESET_ASSOC_REQ))
>> + goto out;
>> +
>> + if (asoc->strreset_outstanding) {
>> + result = SCTP_STRRESET_ERR_IN_PROGRESS;
>> + goto out;
>> + }
>> +
>> + /* G3: The same processing as though a SACK chunk with no gap report
>> +  * and a cumulative TSN ACK of the Sender's Next TSN minus 1 were
>> +  * received MUST be performed.
>> +  */
>> + max_tsn_seen = sctp_tsnmap_get_max_tsn_seen(>peer.tsn_map);
>> + sctp_ulpq_reasm_flushtsn(>ulpq, max_tsn_seen);
>> + sctp_ulpq_abort_pd(>ulpq, GFP_ATOMIC);
>> +
>> + /* G1: Compute an appropriate value for the Receiver's Next TSN -- the
>> +  * TSN that the peer should use to send the next DATA chunk.  The
>> +  * value SHOULD be the smallest TSN not acknowledged by the
>> +  * receiver of the request plus 2^31.
>> +  */
>> + init_tsn = sctp_tsnmap_get_ctsn(>peer.tsn_map) + (1 << 31);
>> + sctp_tsnmap_init(>peer.tsn_map, SCTP_TSN_MAP_INITIAL,
>> +  init_tsn, GFP_ATOMIC);
>> +
>> + /* G4: The same processing as though a FWD-TSN chunk (as defined in
>> +  * [RFC3758]) with all streams affected and a new cumulative TSN
>> +  * ACK of the

Re: [PATCH net-next v6 2/3] Add a eBPF helper function to retrieve socket uid

2017-03-20 Thread Alexei Starovoitov

On Mon, Mar 20, 2017 at 09:08:56PM -0700, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Returns the owner uid of the socket inside a sk_buff. This is useful to
> perform per-UID accounting of network traffic or per-UID packet
> filtering. The socket need to be a fullsock otherwise overflowuid is
> returned.
> 
> Signed-off-by: Chenbo Feng 

Acked-by: Alexei Starovoitov

[PATCH net-stable] ipv4: keep skb->dst around in presence of IP options

2017-03-20 Thread Eric Dumazet

From: Eric Dumazet 

Upstream commit 34b2cef20f19c87999fff3da4071e66937db9644
("ipv4: keep skb->dst around in presence of IP options") incorrectly
root caused commit d826eb14ecef ("ipv4: PKTINFO doesnt need dst
reference") as bug origin.

This patch should fix the issue for 3.2.xx stable kernels, since IPv4
options seem to get more traction these days, after years of oblivion ;)

Fixes: f84af32cbca70 ("net: ip_queue_rcv_skb() helper"))
Signed-off-by: Eric Dumazet 
Reported-by: Anarcheuz Fritz 
---

This is a backport for 3.2 kernels.

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index b3648bbef0da..a6e1eeb02267 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1009,7 +1009,8 @@ e_inval:
  */
 int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-   if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
+   if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
+   !IPCB(skb)->opt.optlen)
skb_dst_drop(skb);
return sock_queue_rcv_skb(sk, skb);
 }

[PATCH net-next 2/8] skb_array: introduce batch dequeuing

2017-03-20 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 include/linux/skb_array.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..90e44b9 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct 
skb_array *a)
return ptr_ring_consume(>ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
return ptr_ring_consume_irq(>ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_irq(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
return ptr_ring_consume_any(>ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_any(>ring, array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
return ptr_ring_consume_bh(>ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+  void **array, int n)
+{
+   return ptr_ring_consume_batched_bh(>ring, array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {
-- 
2.7.4

[PATCH net-next 5/8] tun: support receiving skb through msg_control

2017-03-20 Thread Jason Wang

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 70dd9ec..a82bced 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file 
*tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
-   struct sk_buff *skb;
ssize_t ret;
int err;
 
@@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
if (!iov_iter_count(to))
return 0;
 
-   /* Read frames from ring */
-   skb = tun_ring_recv(tfile, noblock, );
-   if (!skb)
-   return err;
+   if (!skb) {
+   /* Read frames from ring */
+   skb = tun_ring_recv(tfile, noblock, );
+   if (!skb)
+   return err;
+   }
 
ret = tun_put_user(tun, tfile, skb, to);
if (unlikely(ret < 0))
@@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
 
if (!tun)
return -EBADFD;
-   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr 
*m, size_t total_len,
 SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
-   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > (ssize_t)total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

[PATCH net-next 7/8] vhost_net: try batch dequing from skb array

2017-03-20 Thread Jason Wang

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
  Before: 1.88Mpps
  After : 2.25Mpps (+19.6%)
- mergeable buffer:
  Before: 1.83Mpps
  After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 64 +
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..53f09f2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
 struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;
@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   void *rxq[VHOST_RX_BATCH];
+   int rt;
+   int rh;
 };
 
 struct vhost_net {
@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
 
 }
@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int peek_head_len_batched(struct vhost_net_virtqueue *rvq)
+{
+   if (rvq->rh != rvq->rt)
+   goto out;
+
+   rvq->rh = rvq->rt = 0;
+   rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq,
+   VHOST_RX_BATCH);
+   if (!rvq->rt)
+   return 0;
+out:
+   return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
struct socket *sock = sk->sk_socket;
struct sk_buff *head;
int len = 0;
unsigned long flags;
 
+   if (rvq->rx_array)
+   return peek_head_len_batched(rvq);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
 
@@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(>sk_receive_queue);
 }
 
-static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+ struct sock *sk)
 {
+   struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
-   int len = peek_head_len(sk);
+   int len = peek_head_len(rvq, sk);
 
if (!len && vq->busyloop_timeout) {
/* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_poll_queue(>poll);
mutex_unlock(>mutex);
 
-   len = peek_head_len(sk);
+   len = peek_head_len(rvq, sk);
}
 
return len;
@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
+   if (nvq->rx_array)
+   msg.msg_control = nvq->rxq[nvq->rh++];
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(_iter, READ, vq->iov, 1, 1);
@@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net 
*n,
struct vhost_virtqueue *vq)
 {
struct socket *sock;
+   struct vhost_net_virtqueue *nvq =
+   container_of(vq, struct vhost_net_virtqueue, vq);
 
mutex_lock(>mutex);
sock = vq->private_data;
vhost_net_disable_vq(n, vq);
vq->private_data = NULL;
+   while

[PATCH net-next 3/8] tun: export skb_array

2017-03-20 Thread Jason Wang

This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c  | 13 +
 include/linux/if_tun.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c418f0a..70dd9ec 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+   struct tun_file *tfile;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   tfile = file->private_data;
+   if (!tfile)
+   return ERR_PTR(-EBADFD);
+   return >tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4

[PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling

2017-03-20 Thread Jason Wang

For the socket that exports its skb array, we can use lockless polling
to avoid touching spinlock during busy polling.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53f09f2..41153a3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, 
struct sock *sk)
return len;
 }
 
-static int sk_has_rx_data(struct sock *sk)
+static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
struct socket *sock = sk->sk_socket;
 
+   if (rvq->rx_array)
+   return !__skb_array_empty(rvq->rx_array);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
 
@@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net,
endtime = busy_clock() + vq->busyloop_timeout;
 
while (vhost_can_busy_poll(>dev, endtime) &&
-  !sk_has_rx_data(sk) &&
+  !sk_has_rx_data(rvq, sk) &&
   vhost_vq_avail_empty(>dev, vq))
cpu_relax();
 
-- 
2.7.4

[PATCH net-next 4/8] tap: export skb_array

2017-03-20 Thread Jason Wang

This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang 
---
 drivers/net/tap.c  | 13 +
 include/linux/if_tap.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+   struct tap_queue *q;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   q = file->private_data;
+   if (!q)
+   return ERR_PTR(-EBADFD);
+   return >skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include 
-- 
2.7.4

[PATCH net-next 0/8] vhost-net rx batching

2017-03-20 Thread Jason Wang

Hi all:

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying.

Tests shows at most 19% improvment on rx pps.

Please review.

Thanks

Jason Wang (8):
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array
  vhost_net: use lockless peeking for skb array during busy polling

 drivers/net/tap.c | 25 ++---
 drivers/net/tun.c | 31 +++--
 drivers/vhost/net.c   | 71 +++
 include/linux/if_tap.h|  5 
 include/linux/if_tun.h|  5 
 include/linux/ptr_ring.h  | 65 +++
 include/linux/skb_array.h | 25 +
 7 files changed, 209 insertions(+), 18 deletions(-)

-- 
2.7.4

[PATCH net-next 6/8] tap: support receiving skb from msg_control

2017-03-20 Thread Jason Wang

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang 
---
 drivers/net/tap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..07d9174 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
DEFINE_WAIT(wait);
-   struct sk_buff *skb;
ssize_t ret = 0;
 
if (!iov_iter_count(to))
return 0;
 
+   if (skb)
+   goto done;
+
while (1) {
if (!noblock)
prepare_to_wait(sk_sleep(>sk), ,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
if (!noblock)
finish_wait(sk_sleep(>sk), );
 
+done:
if (skb) {
ret = tap_put_user(q, skb, to);
if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
struct tap_queue *q = file->private_data;
ssize_t len = iov_iter_count(to), ret;
 
-   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr 
*m,
int ret;
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
return -EINVAL;
-   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

[PATCH net-next v6 2/3] Add a eBPF helper function to retrieve socket uid

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.

Signed-off-by: Chenbo Feng 
---
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 22 ++
 tools/include/uapi/linux/bpf.h |  3 ++-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dc81a9f..ff42111 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -462,6 +462,12 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: 8 Bytes non-decreasing number on success or 0 if the socket
  * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -510,7 +516,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b65ae3..2f022df 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,24 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   struct sock *sk = sk_to_full_sk(skb->sk);
+   kuid_t kuid;
+
+   if (!sk || !sk_fullsock(sk))
+   return overflowuid;
+   kuid = sock_net_uid(sock_net(sk), sk);
+   return from_kuid_munged(current_user_ns(), kuid);
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2648,6 +2666,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
return _skb_load_bytes_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2709,6 +2729,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _skb_under_cgroup_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a94bdd3..4a2d56d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -504,7 +504,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4

[PATCH net-next v6 1/3] Add a helper function to get socket cookie in eBPF

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Signed-off-by: Chenbo Feng 
---
 include/linux/sock_diag.h  |  1 +
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 17 +
 net/core/sock_diag.c   |  2 +-
 tools/include/uapi/linux/bpf.h |  3 ++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index a0596ca0..a2f8109 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h);
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 
+u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0539a0c..dc81a9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -456,6 +456,12 @@ union bpf_attr {
  * Return:
  *   > 0 length of the string including the trailing NUL on success
  *   < 0 error
+ *
+ * u64 bpf_bpf_get_socket_cookie(skb)
+ * Get the cookie for the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: 8 Bytes non-decreasing number on success or 0 if the socket
+ * field is missing inside sk_buff
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -503,7 +509,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index ebaeaf2..5b65ae3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2599,6 +2600,18 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
+{
+   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
+}
+
+static const struct bpf_func_proto bpf_get_socket_cookie_proto = {
+   .func   = bpf_get_socket_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
switch (func_id) {
case BPF_FUNC_skb_load_bytes:
return _skb_load_bytes_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _get_smp_processor_id_proto;
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..acd2a6c 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct 
nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
-static u64 sock_gen_cookie(struct sock *sk)
+u64 sock_gen_cookie(struct sock *sk)
 {
while (1) {
u64 res = atomic64_read(>sk_cookie);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0539a0c..a94bdd3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -503,7 +503,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4

[PATCH net-next v6 0/3] net: core: Two Helper function about socket information

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty. These helper functions
can be used on socket and uid based traffic monitoring programs.

Change since V5:
* Delete unnecessary blank lines in sample program.
* Refine the variable orders in get_uid helper function.

Change since V4:
* Using current user namespace to get uid instead of using init_ns.
* Add compiling setup of example program in to Makefile.
* Change the name style of the example program binaries.

Change since V3:
* Fixed some typos and incorrect comments in sample program
* replaced raw insns with BPF_STX_XADD and add it to libbpf.h
* Use a temp dir as mount point instead and added a check for
  the user input string.
* Make the get uid helper function returns the user namespace uid
  instead of kuid.
* Return a overflowuid instead of 0 when no uid information is found.

Change since V2:
* Add a sample program to demostrate the usage of the helper function.
* Moved the helper function proto invoking place.
* Add function header into tools/include
* Apply sk_to_full_sk() before getting uid.

Change since V1:
* Removed the unnecessary declarations and export command
* resolved conflict with master branch.
* Examine if the socket is a full socket before getting the uid.


Chenbo Feng (3):
  Add a helper function to get socket cookie in eBPF
  Add a eBPF helper function to retrieve socket uid
  A Sample of using socket cookie and uid for traffic monitoring

 include/linux/sock_diag.h|   1 +
 include/uapi/linux/bpf.h |  16 +-
 net/core/filter.c|  39 +
 net/core/sock_diag.c |   2 +-
 samples/bpf/Makefile |   3 +
 samples/bpf/cookie_uid_helper_example.c  | 217 +++
 samples/bpf/libbpf.h |  10 ++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 tools/include/uapi/linux/bpf.h   |   4 +-
 9 files changed, 303 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

-- 
2.7.4

[PATCH net-next v6 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptables
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Signed-off-by: Chenbo Feng 
---
 samples/bpf/Makefile |   3 +
 samples/bpf/cookie_uid_helper_example.c  | 217 +++
 samples/bpf/libbpf.h |  10 ++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 4 files changed, 244 insertions(+)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 09e9d53..f803f51 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -34,6 +34,7 @@ hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
+hostprogs-y += per_socket_stats_example
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -72,6 +73,7 @@ sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
+per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -105,6 +107,7 @@ always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
 always += xdp_tx_iptunnel_kern.o
+always += cookie_uid_helper_example.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
new file mode 100644
index 000..f6e5e58
--- /dev/null
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -0,0 +1,217 @@
+/* This test is a demo of using get_socket_uid and get_socket_cookie
+ * helper function to do per socket based network traffic monitoring.
+ * It requires iptables version higher then 1.6.1. to load pinned eBPF
+ * program into the xt_bpf match.
+ *
+ * TEST:
+ * ./run_cookie_uid_helper_example.sh
+ * Then generate some traffic in variate ways. ping 0 -c 10 would work
+ * but the cookie and uid in this case could both be 0. A sample output
+ * with some traffic generated by web browser is shown below:
+ *
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286
+ * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726
+ * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712
+ * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70
+ *
+ * Clean up: if using shell script, the script file will delete the iptables
+ * rule and unmount the bpf program when exit. Else the iptables rule need
+ * to be deleted by hand, see run_cookie_uid_helper_example.sh for detail.
+ */
+
+#define _GNU_SOURCE
+
+#define offsetof(type, member) __builtin_offsetof(type, member)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+struct stats {
+   uint32_t uid;
+   uint64_t packets;
+   uint64_t bytes;
+};
+
+static int map_fd, prog_fd;
+
+static void maps_create(void)
+{
+   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t),
+   sizeof(struct stats), 100, 0);
+   if (map_fd < 0)
+   error(1, errno, "map create failed!\n");
+}
+
+static void prog_load(void)
+{
+   static char log_buf[1 << 16];
+
+   struct bpf_insn prog[] = {
+   /*
+* Save sk_buff for future usage. value stored in R6 to R10 will
+* not be reset after a bpf helper function call.
+*/
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   /*
+* pc1: BPF_FUNC_get_socket_cookie takes one parameter,
+* R1: sk_buff
+*/
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_cookie),
+   /* pc2-4: save  to r7 for future usage*/
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
+   /*
+*

[PATCH net-next 1/2] net: dwc-xlgmac: declaration of dual license in headers

2017-03-20 Thread Jie Deng

This patch adds declaration of dual license in file headers.

Signed-off-by: Jie Deng 
---
 drivers/net/ethernet/synopsys/dwc-xlgmac-common.c | 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac-desc.c   | 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c | 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac-net.c| 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac-pci.c| 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac-reg.h| 6 ++
 drivers/net/ethernet/synopsys/dwc-xlgmac.h| 6 ++
 7 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
index 726d78a..53ad707 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
@@ -2,10 +2,8 @@
  *
  * Copyright (c) 2017 Synopsys, Inc. (www.synopsys.com)
  *
- * This program is free software; you can redistribute it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
+ * This program is dual-licensed; you may select either version 2 of
+ * the GNU General Public License ("GPL") or BSD license ("BSD").
  *
  * This Synopsys DWC XLGMAC software driver and associated documentation
  * (hereinafter the "Software") is an unsupported proprietary work of
diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-desc.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-desc.c
index 55c796e..bfe810e 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-desc.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-desc.c
@@ -2,10 +2,8 @@
  *
  * Copyright (c) 2017 Synopsys, Inc. (www.synopsys.com)
  *
- * This program is free software; you can redistribute it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
+ * This program is dual-licensed; you may select either version 2 of
+ * the GNU General Public License ("GPL") or BSD license ("BSD").
  *
  * This Synopsys DWC XLGMAC software driver and associated documentation
  * (hereinafter the "Software") is an unsupported proprietary work of
diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
index 5cf3e90..e2a58ec 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
@@ -2,10 +2,8 @@
  *
  * Copyright (c) 2017 Synopsys, Inc. (www.synopsys.com)
  *
- * This program is free software; you can redistribute it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
+ * This program is dual-licensed; you may select either version 2 of
+ * the GNU General Public License ("GPL") or BSD license ("BSD").
  *
  * This Synopsys DWC XLGMAC software driver and associated documentation
  * (hereinafter the "Software") is an unsupported proprietary work of
diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-net.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-net.c
index 5e8428b..6acf86c 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-net.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-net.c
@@ -2,10 +2,8 @@
  *
  * Copyright (c) 2017 Synopsys, Inc. (www.synopsys.com)
  *
- * This program is free software; you can redistribute it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
+ * This program is dual-licensed; you may select either version 2 of
+ * the GNU General Public License ("GPL") or BSD license ("BSD").
  *
  * This Synopsys DWC XLGMAC software driver and associated documentation
  * (hereinafter the "Software") is an unsupported proprietary work of
diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-pci.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-pci.c
index 504e80d..386bafe 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-pci.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-pci.c
@@ -2,10 +2,8 @@
  *
  * Copyright (c) 2017 Synopsys, Inc. (www.synopsys.com)
  *
- * This program is free software; you can redistribute it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
+ * This program is dual-licensed; you may select either version 2 of
+ * the GNU General Public License ("GPL") or BSD license ("BSD").
  *
  * This Synopsys DWC XLGMAC software driver and associated documentation
  * (hereinafter the "Software") is an unsupported proprietary work of
diff --git

[PATCH net-next 2/2] net: dwc-xlgmac: add module license

2017-03-20 Thread Jie Deng

Fix the warning about missing MODULE_LICENSE().

Signed-off-by: Jie Deng 
---
 drivers/net/ethernet/synopsys/dwc-xlgmac-common.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c 
b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
index 53ad707..07def2b 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
@@ -21,6 +21,8 @@
 #include "dwc-xlgmac.h"
 #include "dwc-xlgmac-reg.h"
 
+MODULE_LICENSE("Dual BSD/GPL");
+
 static int debug = -1;
 module_param(debug, int, 0644);
 MODULE_PARM_DESC(debug, "DWC ethernet debug level (0=none,...,16=all)");
-- 
1.9.1

Re: [PATCH 2/2] [net-next] net: dwc-xlgmac: add module license

2017-03-20 Thread Jie Deng

On 2017/3/20 16:51, Arnd Bergmann wrote:
> When building the driver as a module, we get a warning about the
> lack of a license:
>
> WARNING: modpost: missing MODULE_LICENSE() in 
> drivers/net/ethernet/synopsys/dwc-xlgmac.o
> see include/linux/module.h for more information
>
> Curiously the text in the .c files only mentions GPLv2+, while the license
> tag in the PCI driver contains both GPL and BSD. I picked the license text
> as the more definite reference here and put a GPL tag in there.
>
> Fixes: 65e0ace2c5cd ("net: dwc-xlgmac: Initial driver for DesignWare 
> Enterprise Ethernet")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/ethernet/synopsys/dwc-xlgmac-common.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c 
> b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
> index 726d78ac4907..b72196ab647f 100644
> --- a/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
> +++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-common.c
> @@ -25,6 +25,7 @@
>  
>  static int debug = -1;
>  module_param(debug, int, 0644);
> +MODULE_LICENSE("GPL");
>  MODULE_PARM_DESC(debug, "DWC ethernet debug level (0=none,...,16=all)");
>  static const u32 default_msg_level = (NETIF_MSG_LINK | NETIF_MSG_IFDOWN |
> NETIF_MSG_IFUP);
This driver uses dual license. I will update the headers to include BSD. Thanks!

Re: [PATCH 1/2] [net-next] net: dwc-xlgmac: include dcbnl.h

2017-03-20 Thread Jie Deng



On 2017/3/20 16:51, Arnd Bergmann wrote:
> Without this header, we can run into a build error:
>
> drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c: In function 
> 'xlgmac_config_queue_mapping':
> drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c:1548:36: error: 
> 'IEEE_8021QAZ_MAX_TCS' undeclared (first use in this function)
>   prio_queues = min_t(unsigned int, IEEE_8021QAZ_MAX_TCS,
>
> Fixes: 65e0ace2c5cd ("net: dwc-xlgmac: Initial driver for DesignWare 
> Enterprise Ethernet")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c 
> b/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
> index 5cf3e90d4834..1e25a86f6a27 100644
> --- a/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
> +++ b/drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "dwc-xlgmac.h"
>  #include "dwc-xlgmac-reg.h"
Thanks.
Reviewed-by: Jie Deng

Re: [PATCH net-next v4 1/2]L2TP:Adjust intf MTU, add underlay L3, L2 hdrs

2017-03-20 Thread R. Parameswaran



Hi James,

Thanks for the response and suggestions, please see inline:

On Mon, 20 Mar 2017, James Chapman wrote:

> The patch comment of each patch should represent the changes of the
> patch. You seem to be using a common description for your two commits
> and this will look out of place when viewed using git log on one of the
> files modified by this patch. The patch summary line here is also
> inaccurate.
> 

For this specific patch, I was thinking of the following
header:

"New kernel API to get IP overhead on a socket.

A new API is needed to calculate the cumulative 
overhead imposed by the IP Header and IP options,
if any, on a socket's payload. Provided by the patch
here, this API is then used to determine the
the default pseudowire MTU on an L2TP interface,
relative to the underlay MTU. The new API returns
an overhead of zero for sockets that do not belong
to the IPv4 or IPv6 address families."

Please feel free to edit or suggest changes.

> Are you using git format-patch? Its "Patch 0" can be useful to provide a
> summary description of a patch series to help reviewers.
>

Yes, I am using git format-patch, but was individually generating each
commit's patch. I just figured out how to generate a cover letter and
multiple patches in one shot with git format-patch, will update with the 
suggested changes in a day or so. I also tested the latest patch, 
verified it to be working correctly.

thanks,

Ramkumar


 
> James
> 
> On 18/03/17 01:53, R. Parameswaran wrote:
> > In existing kernel code, when setting up the L2TP interface, all of the
> > tunnel encapsulation headers are not taken into account when setting
> > up the MTU on the  L2TP logical interface device. Due to this, the
> > packets created by the applications on top of the L2TP layer are larger
> > than they ought to be, relative to the underlay MTU, which leads to
> > needless fragmentation once the L2TP packet is encapsulated in an outer IP
> > packet.  Specifically, the MTU calculation  does not take into account the
> > (outer) IP header imposed on the encapsulated L2TP packet, and the Layer 2
> > header imposed on the inner L2TP packet prior to encapsulation.
> >
> > Change-set here (1/2) introduces a new kernel API to compute the IP overhead
> > on an IPv4 or IPv6 socket, which is then used in the L2TP code-path.
> >
> > Signed-off-by: R. Parameswaran 
> > ---
> >  include/linux/net.h |  3 +++
> >  net/socket.c| 44 
> >  2 files changed, 47 insertions(+)
> >
> > diff --git a/include/linux/net.h b/include/linux/net.h
> > index 0620f5e..a42fab2 100644
> > --- a/include/linux/net.h
> > +++ b/include/linux/net.h
> > @@ -298,6 +298,9 @@ int kernel_sendpage(struct socket *sock, struct page 
> > *page, int offset,
> >  int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
> >  int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
> >  
> > +/* Following routine returns the IP overhead imposed by a socket.  */
> > +u32 kernel_sock_ip_overhead(struct sock *sk);
> > +
> >  #define MODULE_ALIAS_NETPROTO(proto) \
> > MODULE_ALIAS("net-pf-" __stringify(proto))
> >  
> > diff --git a/net/socket.c b/net/socket.c
> > index e034fe4..69598e1 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -3345,3 +3345,47 @@ int kernel_sock_shutdown(struct socket *sock, enum 
> > sock_shutdown_cmd how)
> > return sock->ops->shutdown(sock, how);
> >  }
> >  EXPORT_SYMBOL(kernel_sock_shutdown);
> > +
> > +/* This routine returns the IP overhead imposed by a socket i.e.
> > + * the length of the underlying IP header, depending on whether
> > + * this is an IPv4 or IPv6 socket and the length from IP options turned
> > + * on at the socket.
> > + */
> > +u32 kernel_sock_ip_overhead(struct sock *sk)
> > +{
> > +   struct inet_sock *inet;
> > +   struct ipv6_pinfo *np;
> > +   struct ip_options_rcu *opt;
> > +   struct ipv6_txoptions *optv6 = NULL;
> > +   u32 overhead = 0;
> > +   bool owned_by_user;
> > +
> > +   if (!sk)
> > +   return overhead;
> > +
> > +   owned_by_user = sock_owned_by_user(sk);
> > +   switch (sk->sk_family) {
> > +   case AF_INET:
> > +   inet = inet_sk(sk);
> > +   overhead += sizeof(struct iphdr);
> > +   opt = rcu_dereference_protected(inet->inet_opt,
> > +   owned_by_user);
> > +   if (opt)
> > +   overhead += opt->opt.optlen;
> > +   return overhead;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   case AF_INET6:
> > +   np = inet6_sk(sk);
> > +   overhead += sizeof(struct ipv6hdr);
> > +   if (np)
> > +   optv6 = rcu_dereference_protected(np->opt,
> > + owned_by_user);
> > +   if (optv6)
> > +   overhead += (optv6->opt_flen + optv6->opt_nflen);
> > +   return overhead;
> > +#endif /*

[PATCH net-next] liquidio: fix Coverity scan errors

2017-03-20 Thread Felix Manlunas

Fix Coverity scan errors by not dereferencing lio->glists_dma_base pointer
if it's NULL.

See http://marc.info/?l=linux-netdev=149002294305614=2

Reported-by: Stephen Hemminger 
Signed-off-by: Felix Manlunas 
Signed-off-by: VSR Burru 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 3 ++-
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 761061b..72b69bd 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -748,7 +748,8 @@ static void delete_glists(struct lio *lio)
kfree(g);
} while (g);
 
-   if (lio->glists_virt_base && lio->glists_virt_base[i]) {
+   if (lio->glists_virt_base && lio->glists_virt_base[i] &&
+   lio->glists_dma_base && lio->glists_dma_base[i]) {
lio_dma_free(lio->oct_dev,
 lio->glist_entry_size * lio->tx_qsize,
 lio->glists_virt_base[i],
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 5ec5c24..8d9db23 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -506,7 +506,8 @@ static void delete_glists(struct lio *lio)
kfree(g);
} while (g);
 
-   if (lio->glists_virt_base && lio->glists_virt_base[i]) {
+   if (lio->glists_virt_base && lio->glists_virt_base[i] &&
+   lio->glists_dma_base && lio->glists_dma_base[i]) {
lio_dma_free(lio->oct_dev,
 lio->glist_entry_size * lio->tx_qsize,
 lio->glists_virt_base[i],

Re: [PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread Govindarajulu Varadarajan


On Mon, 20 Mar 2017, PJ Waskiewicz wrote:


On Mon, Mar 20, 2017 at 5:33 PM, Govindarajulu Varadarajan
 wrote:

On Mon, 20 Mar 2017, PJ Waskiewicz wrote:


From: PJ Waskiewicz 

The permanent MAC address is useful to store for things like ethtool,
and when bonding with modes such as active/passive or LACP.


Is this patch fixing an issue with bonding drive on enic?


We noticed that running ethtool -P  on an enic, even on 4.9,
returned nothing.  This has fallout when using bonding, where LACP or
Active/Passive overrides the LAA on one of the slaves, one can't
figure out what the physical MAC address is of each slave.  So not a
problem with bonding directly, it's more secondary as a result of the
driver not reporting the actual permanent address.




This follows the model of other Ethernet drivers, such as ixgbe.



While other drivers set netdev->perm_addr, doesn't this actually come free
in
register_netdevice().


I thought it did as well, but in 4.9 when we tested it wasn't working.
Hence the patch.  :-)



Can you try with net-next? In my setup I do not see the issue on net-next and on
4.9 kernel. The issue for all drivers was fixed in
948b337e62ca9 ("net: init perm_addr in register_netdevice()")

[PATCH net v3 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread fgao

From: Gao Feng 

When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.

Signed-off-by: Gao Feng 
---
 v3: Correct the logic error, per Neal
 v2: Make codes more clearer, per Eric
 v1: initial version

 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1e319a5..4f7f163 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2470,7 +2470,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
/* Values greater than interface MTU won't take effect. However
 * at the point when this call is done we typically don't yet
 * know which interface is going to be used */
-   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
+   if (val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
err = -EINVAL;
break;
}
-- 
1.9.1

Re: [PATCH net v2 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread Feng Gao

On Tue, Mar 21, 2017 at 9:23 AM, Neal Cardwell  wrote:
> On Mon, Mar 20, 2017 at 8:45 PM,  wrote:
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 1e319a5..4f7f163 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -2470,7 +2470,7 @@ static int do_tcp_setsockopt(struct sock *sk, int 
>> level,
>> /* Values greater than interface MTU won't take effect. 
>> However
>>  * at the point when this call is done we typically don't yet
>>  * know which interface is going to be used */
>> -   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
>> +   if (!val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
>> err = -EINVAL;
>> break;
>
> I believe the sense of the val check is flipped in the proposed patch.
>
> I believe Eric suggested:
>
>   if (val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
>
> Has this been tested?
>
> neal

Sorry, I missed the test this time because of the minor fix.
As a result, wrote the wrong logic.

Regards
Feng

Re: [PATCH net v2 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread Neal Cardwell

On Mon, Mar 20, 2017 at 8:45 PM,  wrote:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1e319a5..4f7f163 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2470,7 +2470,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> /* Values greater than interface MTU won't take effect. 
> However
>  * at the point when this call is done we typically don't yet
>  * know which interface is going to be used */
> -   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
> +   if (!val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
> err = -EINVAL;
> break;

I believe the sense of the val check is flipped in the proposed patch.

I believe Eric suggested:

  if (val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {

Has this been tested?

neal

RE: [PATCH] bna: integer overflow bug in debugfs

2017-03-20 Thread Mody, Rasesh

> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Friday, March 17, 2017 1:53 PM
> 
> We could allocate less memory than intended because we do:
> 
>   bnad->regdata = kzalloc(len << 2, GFP_KERNEL);
> 
> The shift can overflow leading to a crash.  This is debugfs code so the impact
> is very small.
> 
> Fixes: 7afc5dbde091 ("bna: Add debugfs interface.")
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/net/ethernet/brocade/bna/bnad_debugfs.c
> b/drivers/net/ethernet/brocade/bna/bnad_debugfs.c
> index 05c1c1dd7751..cebfe3bd086e 100644
> --- a/drivers/net/ethernet/brocade/bna/bnad_debugfs.c
> +++ b/drivers/net/ethernet/brocade/bna/bnad_debugfs.c
> @@ -325,7 +325,7 @@ bnad_debugfs_write_regrd(struct file *file, const
> char __user *buf,
>   return PTR_ERR(kern_buf);
> 
>   rc = sscanf(kern_buf, "%x:%x", , );
> - if (rc < 2) {
> + if (rc < 2 || len > UINT_MAX >> 2) {
>   netdev_warn(bnad->netdev, "failed to read user buffer\n");
>   kfree(kern_buf);
>   return -EINVAL;

You are correct, thanks Dan for adding the check.

Acked-by: Rasesh Mody

Re: [PATCH v2 04/20] ARM: sun8i: dt: Add DT bindings documentation for Allwinner syscon

2017-03-20 Thread Rob Herring

On Tue, Mar 14, 2017 at 03:18:40PM +0100, Corentin Labbe wrote:
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/misc/allwinner,syscon.txt | 19 
> +++
>  1 file changed, 19 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> 
> diff --git a/Documentation/devicetree/bindings/misc/allwinner,syscon.txt 
> b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> new file mode 100644
> index 000..9f5f1f5
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> @@ -0,0 +1,19 @@
> +* Allwinner sun8i system controller
> +
> +This file describes the bindings for the system controller present in
> +Allwinner SoC H3, A83T and A64.
> +The principal function of this syscon is to control EMAC PHY choice and
> +config.
> +
> +Required properties for the system controller:
> +- reg: address and length of the register for the device.
> +- compatible: should be "syscon" and one of the following string:
> + "allwinner,sun8i-h3-system-controller"
> + "allwinner,sun8i-a64-system-controller"
> + "allwinner,sun8i-a83t-system-controller"
> +
> +Example:
> +syscon: syscon@01c0 {
> + compatible = "syscon", "allwinner,sun8i-h3-system-controller";

Wrong order of compatibles.

> + reg = <0x01c0 0x1000>;
> +};
> -- 
> 2.10.2
>

Re: [PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread Govindarajulu Varadarajan


On Mon, 20 Mar 2017, PJ Waskiewicz wrote:


From: PJ Waskiewicz 

The permanent MAC address is useful to store for things like ethtool,
and when bonding with modes such as active/passive or LACP.


Hi Peter,

Is this patch fixing an issue with bonding drive on enic?


This follows the model of other Ethernet drivers, such as ixgbe.



While other drivers set netdev->perm_addr, doesn't this actually come free in
register_netdevice().

[PATCH net v2 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread fgao

From: Gao Feng 

When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.

Signed-off-by: Gao Feng 
---
 v2: Make codes more clearer, per Eric
 v1: initial version

 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1e319a5..4f7f163 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2470,7 +2470,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
/* Values greater than interface MTU won't take effect. However
 * at the point when this call is done we typically don't yet
 * know which interface is going to be used */
-   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
+   if (!val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
err = -EINVAL;
break;
}
-- 
1.9.1

Re: [PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread PJ Waskiewicz

On Mon, Mar 20, 2017 at 5:33 PM, Govindarajulu Varadarajan
 wrote:
> On Mon, 20 Mar 2017, PJ Waskiewicz wrote:
>
>> From: PJ Waskiewicz 
>>
>> The permanent MAC address is useful to store for things like ethtool,
>> and when bonding with modes such as active/passive or LACP.
>
>
> Hi Peter,
>
> Is this patch fixing an issue with bonding drive on enic?

We noticed that running ethtool -P  on an enic, even on 4.9,
returned nothing.  This has fallout when using bonding, where LACP or
Active/Passive overrides the LAA on one of the slaves, one can't
figure out what the physical MAC address is of each slave.  So not a
problem with bonding directly, it's more secondary as a result of the
driver not reporting the actual permanent address.

>
>> This follows the model of other Ethernet drivers, such as ixgbe.
>>
>
> While other drivers set netdev->perm_addr, doesn't this actually come free
> in
> register_netdevice().

I thought it did as well, but in 4.9 when we tested it wasn't working.
Hence the patch.  :-)

Cheers,
-PJ

Re: PROBLEM: null-ptr deref in ip_options_echo may lead to denial of service

2017-03-20 Thread Ben Hutchings

On Tue, 2017-03-21 at 00:33 +, Ben Hutchings wrote:
> On Sun, 2017-03-19 at 22:25 -0700, Eric Dumazet wrote:
> > On Mon, 2017-03-20 at 12:59 +0800, Anarcheuz Fritz wrote:
> > > Hi David,
> > > 
> > > 
> > > While working on some legacy kernel I stumbled upon a null-ptr deref in
> > > ip_options_echo. The bug has been verified on the latest version
> > > 3.2.87 from the supported long-term branch.
> > > 
> > 
> > Fixed in commit 34b2cef20f19c87999fff3da4071e66937db9644
> > ("ipv4: keep skb->dst around in presence of IP options")
> > 
> > For 3.2, since d826eb14ecef was not backported, following patch should
> > do it.
> > 
> > (Bug origin was f84af32cbca70 ("net: ip_queue_rcv_skb() helper"))
> 
> I see, I thought the vulnerability was introduced by d826eb14ecef.
> 
> > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> > index b3648bbef0da..a6e1eeb02267 100644
> > --- a/net/ipv4/ip_sockglue.c
> > +++ b/net/ipv4/ip_sockglue.c
> > @@ -1009,7 +1009,8 @@ e_inval:
> >   */
> >  int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> > -   if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
> > +   if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
> > +   !IPCB(skb)->opt.optlen)
> >     skb_dst_drop(skb);
> >     return sock_queue_rcv_skb(sk, skb);
> >  }
> 
> Thanks to both of you; I'll queue this up for 3.2.

Actually, could I have a Signed-off-by for this, Eric?

Ben.

-- 
Ben Hutchings
Power corrupts.  Absolute power is kind of neat.
   - John Lehman, Secretary of the US Navy
1981-1987



signature.asc
Description: This is a digitally signed message part

Re: PROBLEM: null-ptr deref in ip_options_echo may lead to denial of service

2017-03-20 Thread Ben Hutchings

On Sun, 2017-03-19 at 22:25 -0700, Eric Dumazet wrote:
> On Mon, 2017-03-20 at 12:59 +0800, Anarcheuz Fritz wrote:
> > Hi David,
> > 
> > 
> > While working on some legacy kernel I stumbled upon a null-ptr deref in
> > ip_options_echo. The bug has been verified on the latest version
> > 3.2.87 from the supported long-term branch.
> > 
> 
> Fixed in commit 34b2cef20f19c87999fff3da4071e66937db9644
> ("ipv4: keep skb->dst around in presence of IP options")
> 
> For 3.2, since d826eb14ecef was not backported, following patch should
> do it.
> 
> (Bug origin was f84af32cbca70 ("net: ip_queue_rcv_skb() helper"))

I see, I thought the vulnerability was introduced by d826eb14ecef.

> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index b3648bbef0da..a6e1eeb02267 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -1009,7 +1009,8 @@ e_inval:
>   */
>  int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  {
> - if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
> + if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
> + !IPCB(skb)->opt.optlen)
>   skb_dst_drop(skb);
>   return sock_queue_rcv_skb(sk, skb);
>  }

Thanks to both of you; I'll queue this up for 3.2.

Ben.

-- 
Ben Hutchings
Power corrupts.  Absolute power is kind of neat.
   - John Lehman, Secretary of the US Navy
1981-1987


signature.asc
Description: This is a digitally signed message part

[PATCH nf v3 1/1] netfilter: snmp: Fix one possible panic when snmp_trap_helper fail to register

2017-03-20 Thread fgao

From: Gao Feng 

In the commit 93557f53e1fb ("netfilter: nf_conntrack: nf_conntrack snmp
helper"), the snmp_helper is replaced by nf_nat_snmp_hook. So the
snmp_helper is never registered. But it still tries to unregister the
snmp_helper, it could cause the panic.

Now remove the useless snmp_helper and the unregister call in the
error handler.

Fixes: 93557f53e1fb ("netfilter: nf_conntrack: nf_conntrack snmp helper")

Signed-off-by: Gao Feng 
---
 v3: Remove the angle brackets in description, per Sergei
 v2: Add the SHA1 ID in the description, per Sergei
 v1: Initial version

 net/ipv4/netfilter/nf_nat_snmp_basic.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_snmp_basic.c 
b/net/ipv4/netfilter/nf_nat_snmp_basic.c
index c9b52c3..5787364 100644
--- a/net/ipv4/netfilter/nf_nat_snmp_basic.c
+++ b/net/ipv4/netfilter/nf_nat_snmp_basic.c
@@ -1260,16 +1260,6 @@ static int help(struct sk_buff *skb, unsigned int 
protoff,
.timeout= 180,
 };
 
-static struct nf_conntrack_helper snmp_helper __read_mostly = {
-   .me = THIS_MODULE,
-   .help   = help,
-   .expect_policy  = _exp_policy,
-   .name   = "snmp",
-   .tuple.src.l3num= AF_INET,
-   .tuple.src.u.udp.port   = cpu_to_be16(SNMP_PORT),
-   .tuple.dst.protonum = IPPROTO_UDP,
-};
-
 static struct nf_conntrack_helper snmp_trap_helper __read_mostly = {
.me = THIS_MODULE,
.help   = help,
@@ -1294,10 +1284,8 @@ static int __init nf_nat_snmp_basic_init(void)
RCU_INIT_POINTER(nf_nat_snmp_hook, help);
 
ret = nf_conntrack_helper_register(_trap_helper);
-   if (ret < 0) {
-   nf_conntrack_helper_unregister(_helper);
+   if (ret < 0)
return ret;
-   }
return ret;
 }
 
-- 
1.9.1

Re: [PATCH nf v2 1/1] netfilter: snmp: Fix one possible panic when snmp_trap_helper fail to register

2017-03-20 Thread Feng Gao

On Tue, Mar 21, 2017 at 12:35 AM, Sergei Shtylyov
 wrote:
> On 03/20/2017 01:15 PM, Feng Gao wrote:
>
 From: Gao Feng 

 In the commit <93557f53e1fb> ("netfilter: nf_conntrack: nf_conntrack
 snmp
>>>
>>>
>>>
>>>Angle brackets not needed. :-)
>>>The commit citing style is the same as for the Fixes: tag.
>>
>>
>> The checkpatch.pl reports the following error, if remove the angle
>> brackets.
>
>
>Because it stops recognizing the commit ID! :-)
>
>> ERROR: Please use git commit description style 'commit <12+ chars of
>> sha1> ("")' - ie: 'commit fatal: ambig ("evision or path
>> not in the working tree.")'
>
>
>So check the patch in the correct tree because that seems to be the
> problem... Angle brackets are surely not required.

Actually I didn't add the angle brackets firstly, but it fail to pass
the check_patch.pl check.
So I had to modify it.

Ok, I removed the angle brackets now, just ignored the error report of
check_patch.pl.

Best Regards
Feng

>
>> #7:
>> In the commit 93557f53e1fb ("netfilter: nf_conntrack: nf_conntrack snmp
>>
>> total: 1 errors, 0 warnings, 0 checks, 27 lines checked
>>
>>
>> Regards
>> Feng
>
>
> [...]
>
> MBR, Sergei
>

Re: [PATCH] [v2, -net] cpsw/netcp: cpts depends on posix_timers

2017-03-20 Thread Nicolas Pitre

On Mon, 20 Mar 2017, Arnd Bergmann wrote:

> With posix timers having become optional, we get a build error with
> the cpts time sync option of the CPSW driver:
> 
> drivers/net/ethernet/ti/cpts.c: In function 'cpts_find_ts':
> drivers/net/ethernet/ti/cpts.c:291:23: error: implicit declaration of 
> function 'ptp_classify_raw';did you mean 'ptp_classifier_init'? 
> [-Werror=implicit-function-declaration]
> 
> This adds a hard dependency on PTP_CLOCK to avoid the problem, as
> building it without PTP support makes no sense anyway.
> 
> Fixes: baa73d9e478f ("posix-timers: Make them configurable")
> Cc: Nicolas Pitre 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 

Acked-by: Nicolas Pitre 

> ---
> v2: use 'depends on' instead of 'select' as suggested by Nico.
> ---
>  drivers/net/ethernet/ti/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
> index d923890a9fda..9e631952b86f 100644
> --- a/drivers/net/ethernet/ti/Kconfig
> +++ b/drivers/net/ethernet/ti/Kconfig
> @@ -76,7 +76,7 @@ config TI_CPSW
>  config TI_CPTS
>   bool "TI Common Platform Time Sync (CPTS) Support"
>   depends on TI_CPSW || TI_KEYSTONE_NETCP
> - imply PTP_1588_CLOCK
> + depends on PTP_1588_CLOCK
>   ---help---
> This driver supports the Common Platform Time Sync unit of
> the CPSW Ethernet Switch and Keystone 2 1g/10g Switch Subsystem.
> -- 
> 2.9.0
> 
>

Re: [Bridge] [PATCH net] bridge: ebtables: fix reception of frames DNAT-ed to bridge device

2017-03-20 Thread Linus Lüssing

On Sun, Mar 19, 2017 at 05:55:06PM +0100, Linus Lüssing wrote:
> On Fri, Mar 17, 2017 at 02:10:44PM +0100, Pablo Neira Ayuso wrote:
> > Wait.
> > 
> > May this break local multicast listener that are bound to the bridge
> > interface? Assuming the bridge interface got an IP address, and that
> > there is local multicast listener.
> > 
> > Missing anything here?
> 
> Hm, for multicast packets usually the code path a few lines
> later in br_handle_frame_finish() should be taken instead.
> 
> But you might be right for IP multicast packets with a unicast MAC
> destination (due to whatever reason, for instance via DNAT'ing
> again).
> 
> Will check that - thanks!

Ok, I tested DNAT'ing an IP multicast packet to the unicast MAC address
of the bridge interface.

Both ping-ing to an IPv4 and IPv6 multicast listener on br0 worked
and was replied to fine, both with or without changing skb->pkt_type
from PACKET_MULTICAST to PACKET_HOST.
("$ ping 224.1.0.123" and "$ ping6 ff02::1:ff40:707c%in0" from a
 network namespace, tied into the bridge via veth)

Also, a DNAT'ed PACKET_BROADCAST worked, with or without changing
it to PACKET_HOST.

I also checked via tcpdump that the destination MAC was changed
successfully.

So, so far I wasn't able to find any bugs with the current
patch. But I think I like the idea of leaving the skb->pkt_type
unaltered for PACKET_MULTICAST and PACKET_BROADCAST, seems cleaner.

I'd just add an "if (skb->pkt_type == PACKET_OTHERHOST)" check
then and resend a PATCH v2.

[net-next 07/13] i40e: exit ATR mode only when adding TCP/IPv4 filter succeeds

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

Move ATR exit check after we have sent the TCP/IPv4 filter to the ring
successfully. This avoids an issue where we potentially update the
filter count without actually succeeding in adding the filter. Now, we
only increment the fd_tcp_rule after we've succeeded. Additionally, we
will re-enable ATR mode only after deletion of the filter is actually
posted to the FDIR ring.

Change-ID: If5c1dea422081cc5e2de65618b01b4c3bf6bd586
Signed-off-by: Jacob Keller 
Reviewed-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 34 ++---
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 005257b4f218..05f3d0d5a004 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -284,23 +284,6 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
ip->saddr = fd_data->src_ip;
tcp->source = fd_data->src_port;
 
-   if (add) {
-   pf->fd_tcp_rule++;
-   if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
-   I40E_DEBUG_FD & pf->hw.debug_mask)
-   dev_info(>pdev->dev, "Forcing ATR off, sideband 
rules for TCP/IPv4 flow being applied\n");
-   pf->hw_disabled_flags |= I40E_FLAG_FD_ATR_ENABLED;
-   } else {
-   pf->fd_tcp_rule = (pf->fd_tcp_rule > 0) ?
- (pf->fd_tcp_rule - 1) : 0;
-   if (pf->fd_tcp_rule == 0) {
-   if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
-   I40E_DEBUG_FD & pf->hw.debug_mask)
-   dev_info(>pdev->dev, "ATR re-enabled due to 
no sideband TCP/IPv4 rules\n");
-   pf->hw_disabled_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
-   }
-   }
-
fd_data->pctype = I40E_FILTER_PCTYPE_NONF_IPV4_TCP;
ret = i40e_program_fdir_filter(fd_data, raw_packet, pf, add);
if (ret) {
@@ -320,6 +303,23 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
 fd_data->pctype, fd_data->fd_id);
}
 
+   if (add) {
+   pf->fd_tcp_rule++;
+   if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
+   I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(>pdev->dev, "Forcing ATR off, sideband 
rules for TCP/IPv4 flow being applied\n");
+   pf->hw_disabled_flags |= I40E_FLAG_FD_ATR_ENABLED;
+   } else {
+   pf->fd_tcp_rule = (pf->fd_tcp_rule > 0) ?
+ (pf->fd_tcp_rule - 1) : 0;
+   if (pf->fd_tcp_rule == 0) {
+   if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
+   I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(>pdev->dev, "ATR re-enabled due to 
no sideband TCP/IPv4 rules\n");
+   pf->hw_disabled_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
+   }
+   }
+
return 0;
 }
 
-- 
2.12.0

[net-next 09/13] i40e: reset fd_tcp_rule count when restoring filters

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

Since we're about to reprogram the filters, we need to ensure that the
fd_tcp_rule count is correctly reset to 0. Otherwise, we will keep
a stale count that does not accurately reflect the number of programmed
TCPv4 filters.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 221e1705c031..437b79eeb8b5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3283,6 +3283,9 @@ static void i40e_fdir_filter_restore(struct i40e_vsi *vsi)
if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED))
return;
 
+   /* Reset FDir counters as we're replaying all existing filters */
+   pf->fd_tcp_rule = 0;
+
hlist_for_each_entry_safe(filter, node,
  >fdir_filter_list, fdir_node) {
i40e_add_del_fdir(vsi, filter, true);
-- 
2.12.0

[net-next 00/13][pull request] 40GbE Intel Wired LAN Driver Updates 2017-03-20

2017-03-20 Thread Jeff Kirsher

This series contains updates to i40e and i40evf only.

Philippe Reynes updates i40e and i40evf to use the new ethtool API for
{get|set}_link_ksettings.

Jake provides the remaining patches in the series, starting with a fix
for i40e where the firmware expected the port numbers for the offloaded
UDP tunnels in Little Endian format and we were sending them in Big Endian
format which put the wrong port number to be put in the UDP tunnel list.
Changed the driver to use __be32 values instead of arrays for
(src|dst)_ip.  Refactored the exit flow of i40e_add_fdir_ethtool() which
removes the dependency on having a non-zero return value.  Fixed a memory
leak by running kfree() and returning immediately when we fail to add
flow director filter.  Fixed a potential issue where could update the
filter count without actually succeeding in adding a filter, by moving
the ATR exit check to after we have sent the TCP/IPv4 filter to the ring
successfully.  Ensures that the fd_tcp_rule count is reset to 0, before
we reprogram the filters so that we do not end up with a stale count
which does not correctly reflect the number of programmed filters.  Added
a check whether we have TCP/IPv4 filters before re-enabling ATR after
flushing and replaying FDIR filters.  Added counters for each filter
type in preparation for adding code to properly check the mask value.
Fixed potential issues by explicitly checking the flow type at the
start of i40e_add_fdir_ethtool().  To avoid possible memory leaks,
we now unconditionally delete the old filter, even if it is identical to
the new filter and ensures will always update the filters as expected.

The following are changes since commit fe723dff0fa4181ddb8116e72bc67d00d4239cb6:
  liquidio: fix wrong information about link modes reported to ethtool
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Jacob Keller (11):
  i40e: send correct port number to AdminQ when enabling UDP tunnels
  i40e: don't use arrays for (src|dst)_ip
  i40e: rework exit flow of i40e_add_fdir_ethtool
  i40e: return immediately when failing to add fdir filter
  i40e: exit ATR mode only when adding TCP/IPv4 filter succeeds
  i40e: remove redundant check for fd_tcp_rule when restoring filters
  i40e: reset fd_tcp_rule count when restoring filters
  i40e: don't re-enable ATR when flushing filters if SB has TCP4/IPv4
rules
  i40e: add counters for UDP/IPv4 and IPv4 filters
  i40e: explicitly fail on extended MAC field for ethtool_rx_flow_spec
  i40e: always remove old filter when adding new FDir filter

Philippe Reynes (2):
  i40e: use new api ethtool_{get|set}_link_ksettings
  i40evf: use new api ethtool_{get|set}_link_ksettings

 drivers/net/ethernet/intel/i40e/i40e.h |  16 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 323 -
 drivers/net/ethernet/intel/i40e/i40e_main.c|  43 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  84 +++---
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |  31 +-
 5 files changed, 273 insertions(+), 224 deletions(-)

-- 
2.12.0

[net-next 01/13] i40e: use new api ethtool_{get|set}_link_ksettings

2017-03-20 Thread Jeff Kirsher

From: Philippe Reynes 

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 264 ++---
 1 file changed, 153 insertions(+), 111 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a933c6c2aff8..ceb57ad59e8f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -387,7 +387,7 @@ static void i40e_phy_type_to_ethtool(struct i40e_pf *pf, 
u32 *supported,
  *
  **/
 static void i40e_get_settings_link_up(struct i40e_hw *hw,
- struct ethtool_cmd *ecmd,
+ struct ethtool_link_ksettings *cmd,
  struct net_device *netdev,
  struct i40e_pf *pf)
 {
@@ -395,90 +395,96 @@ static void i40e_get_settings_link_up(struct i40e_hw *hw,
u32 link_speed = hw_link_info->link_speed;
u32 e_advertising = 0x0;
u32 e_supported = 0x0;
+   u32 supported, advertising;
+
+   ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.supported);
+   ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.advertising);
 
/* Initialize supported and advertised settings based on phy settings */
switch (hw_link_info->phy_type) {
case I40E_PHY_TYPE_40GBASE_CR4:
case I40E_PHY_TYPE_40GBASE_CR4_CU:
-   ecmd->supported = SUPPORTED_Autoneg |
- SUPPORTED_4baseCR4_Full;
-   ecmd->advertising = ADVERTISED_Autoneg |
-   ADVERTISED_4baseCR4_Full;
+   supported = SUPPORTED_Autoneg |
+   SUPPORTED_4baseCR4_Full;
+   advertising = ADVERTISED_Autoneg |
+ ADVERTISED_4baseCR4_Full;
break;
case I40E_PHY_TYPE_XLAUI:
case I40E_PHY_TYPE_XLPPI:
case I40E_PHY_TYPE_40GBASE_AOC:
-   ecmd->supported = SUPPORTED_4baseCR4_Full;
+   supported = SUPPORTED_4baseCR4_Full;
break;
case I40E_PHY_TYPE_40GBASE_SR4:
-   ecmd->supported = SUPPORTED_4baseSR4_Full;
+   supported = SUPPORTED_4baseSR4_Full;
break;
case I40E_PHY_TYPE_40GBASE_LR4:
-   ecmd->supported = SUPPORTED_4baseLR4_Full;
+   supported = SUPPORTED_4baseLR4_Full;
break;
case I40E_PHY_TYPE_10GBASE_SR:
case I40E_PHY_TYPE_10GBASE_LR:
case I40E_PHY_TYPE_1000BASE_SX:
case I40E_PHY_TYPE_1000BASE_LX:
-   ecmd->supported = SUPPORTED_1baseT_Full;
+   supported = SUPPORTED_1baseT_Full;
if (hw_link_info->module_type[2] &
I40E_MODULE_TYPE_1000BASE_SX ||
hw_link_info->module_type[2] &
I40E_MODULE_TYPE_1000BASE_LX) {
-   ecmd->supported |= SUPPORTED_1000baseT_Full;
+   supported |= SUPPORTED_1000baseT_Full;
if (hw_link_info->requested_speeds &
I40E_LINK_SPEED_1GB)
-   ecmd->advertising |= ADVERTISED_1000baseT_Full;
+   advertising |= ADVERTISED_1000baseT_Full;
}
if (hw_link_info->requested_speeds & I40E_LINK_SPEED_10GB)
-   ecmd->advertising |= ADVERTISED_1baseT_Full;
+   advertising |= ADVERTISED_1baseT_Full;
break;
case I40E_PHY_TYPE_10GBASE_T:
case I40E_PHY_TYPE_1000BASE_T:
case I40E_PHY_TYPE_100BASE_TX:
-   ecmd->supported = SUPPORTED_Autoneg |
- SUPPORTED_1baseT_Full |
- SUPPORTED_1000baseT_Full |
- SUPPORTED_100baseT_Full;
-   ecmd->advertising = ADVERTISED_Autoneg;
+   supported = SUPPORTED_Autoneg |
+   SUPPORTED_1baseT_Full |
+   SUPPORTED_1000baseT_Full |
+   SUPPORTED_100baseT_Full;
+   advertising = ADVERTISED_Autoneg;
if (hw_link_info->requested_speeds & I40E_LINK_SPEED_10GB)
-   ecmd->advertising |= ADVERTISED_1baseT_Full;
+   advertising |=

[net-next 03/13] i40e: send correct port number to AdminQ when enabling UDP tunnels

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

The firmware expects the port numbers for offloaded UDP tunnels in
Little Endian format. We accidentally sent the value in Big Endian
format which obviously will cause the wrong port number to be put into
the UDP tunnels list. This results in VxLAN and Geneve tunnel Rx
offloads being essentially disabled, unless the port number happens to
be identical after byte swapping. Note that i40e_aq_add_udp_tunnel()
will byteswap the parameter from host order into Little Endian so we
don't need worry about passing strictly a __le16 value to the command.

This patch essentially reverts b3f5c7bc88ba ("i40e: Fix for extra byte
swap in tunnel setup", 2016-08-24), but in a way that makes the result
much more clear to the reader.

Fixes: b3f5c7bc88ba ("i40e: Fix for extra byte swap in tunnel setup", 
2016-08-24)
Signed-off-by: Jacob Keller 
Reviewed-by: Williams, Mitch A 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  3 ++-
 drivers/net/ethernet/intel/i40e/i40e_main.c | 17 -
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index a5cf5d11d0e7..fba8495a8787 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -244,7 +244,8 @@ struct i40e_tc_configuration {
 };
 
 struct i40e_udp_port_config {
-   __be16 index;
+   /* AdminQ command interface expects port number in Host byte order */
+   u16 index;
u8 type;
 };
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9df0d86812e7..6e63459ceb65 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7353,7 +7353,7 @@ static void i40e_sync_udp_filters_subtask(struct i40e_pf 
*pf)
 {
struct i40e_hw *hw = >hw;
i40e_status ret;
-   __be16 port;
+   u16 port;
int i;
 
if (!(pf->flags & I40E_FLAG_UDP_FILTER_SYNC))
@@ -7377,7 +7377,7 @@ static void i40e_sync_udp_filters_subtask(struct i40e_pf 
*pf)
"%s %s port %d, index %d failed, err %s 
aq_err %s\n",
pf->udp_ports[i].type ? "vxlan" : 
"geneve",
port ? "add" : "delete",
-   ntohs(port), i,
+   port, i,
i40e_stat_str(>hw, ret),
i40e_aq_str(>hw,
pf->hw.aq.asq_last_status));
@@ -9014,7 +9014,7 @@ static int i40e_set_features(struct net_device *netdev,
  *
  * Returns the index number or I40E_MAX_PF_UDP_OFFLOAD_PORTS if port not found
  **/
-static u8 i40e_get_udp_port_idx(struct i40e_pf *pf, __be16 port)
+static u8 i40e_get_udp_port_idx(struct i40e_pf *pf, u16 port)
 {
u8 i;
 
@@ -9037,7 +9037,7 @@ static void i40e_udp_tunnel_add(struct net_device *netdev,
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
struct i40e_pf *pf = vsi->back;
-   __be16 port = ti->port;
+   u16 port = ntohs(ti->port);
u8 next_idx;
u8 idx;
 
@@ -9045,8 +9045,7 @@ static void i40e_udp_tunnel_add(struct net_device *netdev,
 
/* Check if port already exists */
if (idx < I40E_MAX_PF_UDP_OFFLOAD_PORTS) {
-   netdev_info(netdev, "port %d already offloaded\n",
-   ntohs(port));
+   netdev_info(netdev, "port %d already offloaded\n", port);
return;
}
 
@@ -9055,7 +9054,7 @@ static void i40e_udp_tunnel_add(struct net_device *netdev,
 
if (next_idx == I40E_MAX_PF_UDP_OFFLOAD_PORTS) {
netdev_info(netdev, "maximum number of offloaded UDP ports 
reached, not adding port %d\n",
-   ntohs(port));
+   port);
return;
}
 
@@ -9089,7 +9088,7 @@ static void i40e_udp_tunnel_del(struct net_device *netdev,
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
struct i40e_pf *pf = vsi->back;
-   __be16 port = ti->port;
+   u16 port = ntohs(ti->port);
u8 idx;
 
idx = i40e_get_udp_port_idx(pf, port);
@@ -9121,7 +9120,7 @@ static void i40e_udp_tunnel_del(struct net_device *netdev,
return;
 not_found:
netdev_warn(netdev, "UDP port %d was not found, not deleting\n",
-   ntohs(port));
+   port);
 }
 
 static int i40e_get_phys_port_id(struct net_device *netdev,
-- 
2.12.0

[net-next 11/13] i40e: add counters for UDP/IPv4 and IPv4 filters

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

In preparation for adding code to properly check the mask values, we
will need to know the number of active filters for each type. Add
counters for each filter type. Rename the already existing fd_tcp_rule
to fd_tcp4_filter_cnt to match the style of other names. To avoid style
warnings, avoid assigning multiple parameters at once, and fix up one
other case where we did so previously.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  9 -
 drivers/net/ethernet/intel/i40e/i40e_main.c | 19 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 17 +
 3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 55ea1c0221e6..c0f2286c2b72 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -286,7 +286,14 @@ struct i40e_pf {
u32 fd_flush_cnt;
u32 fd_add_err;
u32 fd_atr_cnt;
-   u32 fd_tcp_rule;
+
+   /* Book-keeping of side-band filter count per flow-type.
+* This is used to detect and handle input set changes for
+* respective flow-type.
+*/
+   u16 fd_tcp4_filter_cnt;
+   u16 fd_udp4_filter_cnt;
+   u16 fd_ip4_filter_cnt;
 
struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS];
u16 pending_udp_bitmap;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index cc33ac835181..caccb8e97f1b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3284,7 +3284,9 @@ static void i40e_fdir_filter_restore(struct i40e_vsi *vsi)
return;
 
/* Reset FDir counters as we're replaying all existing filters */
-   pf->fd_tcp_rule = 0;
+   pf->fd_tcp4_filter_cnt = 0;
+   pf->fd_udp4_filter_cnt = 0;
+   pf->fd_ip4_filter_cnt = 0;
 
hlist_for_each_entry_safe(filter, node,
  >fdir_filter_list, fdir_node) {
@@ -5468,7 +5470,8 @@ static int i40e_up_complete(struct i40e_vsi *vsi)
/* replay FDIR SB filters */
if (vsi->type == I40E_VSI_FDIR) {
/* reset fd counters */
-   pf->fd_add_err = pf->fd_atr_cnt = 0;
+   pf->fd_add_err = 0;
+   pf->fd_atr_cnt = 0;
i40e_fdir_filter_restore(vsi);
}
 
@@ -5751,7 +5754,11 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
hlist_del(>fdir_node);
kfree(filter);
}
+
pf->fdir_pf_active_filters = 0;
+   pf->fd_tcp4_filter_cnt = 0;
+   pf->fd_udp4_filter_cnt = 0;
+   pf->fd_ip4_filter_cnt = 0;
 }
 
 /**
@@ -6156,7 +6163,7 @@ void i40e_fdir_check_and_reenable(struct i40e_pf *pf)
if (fcnt_prog < (fcnt_avail - I40E_FDIR_BUFFER_HEAD_ROOM * 2)) {
if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
(pf->hw_disabled_flags & I40E_FLAG_FD_ATR_ENABLED) &&
-   (pf->fd_tcp_rule == 0)) {
+   (pf->fd_tcp4_filter_cnt == 0)) {
pf->hw_disabled_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
if (I40E_DEBUG_FD & pf->hw.debug_mask)
dev_info(>pdev->dev, "ATR is being enabled 
since we have space in the table and there are no conflicting ntuple rules\n");
@@ -6228,7 +6235,7 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf *pf)
} else {
/* replay sideband filters */
i40e_fdir_filter_restore(pf->vsi[pf->lan_vsi]);
-   if (!disable_atr && !pf->fd_tcp_rule)
+   if (!disable_atr && !pf->fd_tcp4_filter_cnt)
pf->hw_disabled_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
clear_bit(__I40E_FD_FLUSH_REQUESTED, >state);
if (I40E_DEBUG_FD & pf->hw.debug_mask)
@@ -8937,8 +8944,8 @@ bool i40e_set_ntuple(struct i40e_pf *pf, 
netdev_features_t features)
pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
pf->hw_disabled_flags &= ~I40E_FLAG_FD_SB_ENABLED;
/* reset fd counters */
-   pf->fd_add_err = pf->fd_atr_cnt = pf->fd_tcp_rule = 0;
-   pf->fdir_pf_active_filters = 0;
+   pf->fd_add_err = 0;
+   pf->fd_atr_cnt = 0;
/* if ATR was auto disabled it can be re-enabled. */
if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
(pf->hw_disabled_flags & I40E_FLAG_FD_ATR_ENABLED)) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 05f3d0d5a004..3880e417f167 100644
---

[net-next 02/13] i40evf: use new api ethtool_{get|set}_link_ksettings

2017-03-20 Thread Jeff Kirsher

From: Philippe Reynes 

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c | 31 +++---
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
index 272d600c1ed0..122efbd29a19 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
@@ -64,51 +64,50 @@ static const struct i40evf_stats i40evf_gstrings_stats[] = {
(I40EVF_GLOBAL_STATS_LEN + I40EVF_QUEUE_STATS_LEN(_dev))
 
 /**
- * i40evf_get_settings - Get Link Speed and Duplex settings
+ * i40evf_get_link_ksettings - Get Link Speed and Duplex settings
  * @netdev: network interface device structure
- * @ecmd: ethtool command
+ * @cmd: ethtool command
  *
  * Reports speed/duplex settings. Because this is a VF, we don't know what
  * kind of link we really have, so we fake it.
  **/
-static int i40evf_get_settings(struct net_device *netdev,
-  struct ethtool_cmd *ecmd)
+static int i40evf_get_link_ksettings(struct net_device *netdev,
+struct ethtool_link_ksettings *cmd)
 {
struct i40evf_adapter *adapter = netdev_priv(netdev);
 
-   ecmd->supported = 0;
-   ecmd->autoneg = AUTONEG_DISABLE;
-   ecmd->transceiver = XCVR_DUMMY1;
-   ecmd->port = PORT_NONE;
+   ethtool_link_ksettings_zero_link_mode(cmd, supported);
+   cmd->base.autoneg = AUTONEG_DISABLE;
+   cmd->base.port = PORT_NONE;
/* Set speed and duplex */
switch (adapter->link_speed) {
case I40E_LINK_SPEED_40GB:
-   ethtool_cmd_speed_set(ecmd, SPEED_4);
+   cmd->base.speed = SPEED_4;
break;
case I40E_LINK_SPEED_25GB:
 #ifdef SPEED_25000
-   ethtool_cmd_speed_set(ecmd, SPEED_25000);
+   cmd->base.speed = SPEED_25000;
 #else
netdev_info(netdev,
"Speed is 25G, display not supported by this 
version of ethtool.\n");
 #endif
break;
case I40E_LINK_SPEED_20GB:
-   ethtool_cmd_speed_set(ecmd, SPEED_2);
+   cmd->base.speed = SPEED_2;
break;
case I40E_LINK_SPEED_10GB:
-   ethtool_cmd_speed_set(ecmd, SPEED_1);
+   cmd->base.speed = SPEED_1;
break;
case I40E_LINK_SPEED_1GB:
-   ethtool_cmd_speed_set(ecmd, SPEED_1000);
+   cmd->base.speed = SPEED_1000;
break;
case I40E_LINK_SPEED_100MB:
-   ethtool_cmd_speed_set(ecmd, SPEED_100);
+   cmd->base.speed = SPEED_100;
break;
default:
break;
}
-   ecmd->duplex = DUPLEX_FULL;
+   cmd->base.duplex = DUPLEX_FULL;
 
return 0;
 }
@@ -643,7 +642,6 @@ static int i40evf_set_rxfh(struct net_device *netdev, const 
u32 *indir,
 }
 
 static const struct ethtool_ops i40evf_ethtool_ops = {
-   .get_settings   = i40evf_get_settings,
.get_drvinfo= i40evf_get_drvinfo,
.get_link   = ethtool_op_get_link,
.get_ringparam  = i40evf_get_ringparam,
@@ -663,6 +661,7 @@ static const struct ethtool_ops i40evf_ethtool_ops = {
.set_rxfh   = i40evf_set_rxfh,
.get_channels   = i40evf_get_channels,
.get_rxfh_key_size  = i40evf_get_rxfh_key_size,
+   .get_link_ksettings = i40evf_get_link_ksettings,
 };
 
 /**
-- 
2.12.0

[net-next 04/13] i40e: don't use arrays for (src|dst)_ip

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

The code originally included src_ip and dst_ip with enough space to
support ipv6 filters. However, no actual support for ipv6 filters has
been implemented. Thus, remove the arrays and just use __be32 values.
Should ipv6 support be added in the future, we can replace these with
a union that has sizes for both values.

Change-Id: I1bc04032244a80eb6ebc8a4e6c723a4a665c1dd5
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 12 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 12 ++--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index fba8495a8787..55ea1c0221e6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -208,8 +208,8 @@ struct i40e_fdir_filter {
u8 flow_type;
u8 ip4_proto;
/* TX packet view of src and dst */
-   __be32 dst_ip[4];
-   __be32 src_ip[4];
+   __be32 dst_ip;
+   __be32 src_ip;
__be16 src_port;
__be16 dst_port;
__be32 sctp_v_tag;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index ceb57ad59e8f..7a22b473dbdd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2406,8 +2406,8 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf *pf,
 */
fsp->h_u.tcp_ip4_spec.psrc = rule->dst_port;
fsp->h_u.tcp_ip4_spec.pdst = rule->src_port;
-   fsp->h_u.tcp_ip4_spec.ip4src = rule->dst_ip[0];
-   fsp->h_u.tcp_ip4_spec.ip4dst = rule->src_ip[0];
+   fsp->h_u.tcp_ip4_spec.ip4src = rule->dst_ip;
+   fsp->h_u.tcp_ip4_spec.ip4dst = rule->src_ip;
 
if (rule->dest_ctl == I40E_FILTER_PROGRAM_DESC_DEST_DROP_PACKET)
fsp->ring_cookie = RX_CLS_FLOW_DISC;
@@ -2630,8 +2630,8 @@ static int i40e_set_rss_hash_opt(struct i40e_pf *pf, 
struct ethtool_rxnfc *nfc)
 static bool i40e_match_fdir_input_set(struct i40e_fdir_filter *rule,
  struct i40e_fdir_filter *input)
 {
-   if ((rule->dst_ip[0] != input->dst_ip[0]) ||
-   (rule->src_ip[0] != input->src_ip[0]) ||
+   if ((rule->dst_ip != input->dst_ip) ||
+   (rule->src_ip != input->src_ip) ||
(rule->dst_port != input->dst_port) ||
(rule->src_port != input->src_port))
return false;
@@ -2807,8 +2807,8 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
 */
input->dst_port = fsp->h_u.tcp_ip4_spec.psrc;
input->src_port = fsp->h_u.tcp_ip4_spec.pdst;
-   input->dst_ip[0] = fsp->h_u.tcp_ip4_spec.ip4src;
-   input->src_ip[0] = fsp->h_u.tcp_ip4_spec.ip4dst;
+   input->dst_ip = fsp->h_u.tcp_ip4_spec.ip4src;
+   input->src_ip = fsp->h_u.tcp_ip4_spec.ip4dst;
 
if (ntohl(fsp->m_ext.data[1])) {
vf_id = ntohl(fsp->h_ext.data[1]);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6eb5dc4168f3..c4d3a40a3f10 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -219,9 +219,9 @@ static int i40e_add_del_fdir_udpv4(struct i40e_vsi *vsi,
udp = (struct udphdr *)(raw_packet + IP_HEADER_OFFSET
  + sizeof(struct iphdr));
 
-   ip->daddr = fd_data->dst_ip[0];
+   ip->daddr = fd_data->dst_ip;
udp->dest = fd_data->dst_port;
-   ip->saddr = fd_data->src_ip[0];
+   ip->saddr = fd_data->src_ip;
udp->source = fd_data->src_port;
 
fd_data->pctype = I40E_FILTER_PCTYPE_NONF_IPV4_UDP;
@@ -281,9 +281,9 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
tcp = (struct tcphdr *)(raw_packet + IP_HEADER_OFFSET
  + sizeof(struct iphdr));
 
-   ip->daddr = fd_data->dst_ip[0];
+   ip->daddr = fd_data->dst_ip;
tcp->dest = fd_data->dst_port;
-   ip->saddr = fd_data->src_ip[0];
+   ip->saddr = fd_data->src_ip;
tcp->source = fd_data->src_port;
 
if (add) {
@@ -359,8 +359,8 @@ static int i40e_add_del_fdir_ipv4(struct i40e_vsi *vsi,
memcpy(raw_packet, packet, I40E_IP_DUMMY_PACKET_LEN);
ip = (struct iphdr *)(raw_packet + IP_HEADER_OFFSET);
 
-   ip->saddr = fd_data->src_ip[0];
-   ip->daddr = fd_data->dst_ip[0];
+   ip->saddr = fd_data->src_ip;
+   ip->daddr = fd_data->dst_ip;
ip->protocol = 0;
 
fd_data->pctype = i;
-- 
2.12.0

[net-next 13/13] i40e: always remove old filter when adding new FDir filter

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

The previous code relied on i40e_match_fdir_input_set to determine when
determining whether to free the old filter. Change this code so that we
simply unconditionally delete the old filter, even if it's identical to
the new filter. This ensures that we don't leak any memory, and that we
always update the filters as expected.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 32 ++
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 905d66e87247..1c3805b4fcf3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2621,24 +2621,6 @@ static int i40e_set_rss_hash_opt(struct i40e_pf *pf, 
struct ethtool_rxnfc *nfc)
 }
 
 /**
- * i40e_match_fdir_input_set - Match a new filter against an existing one
- * @rule: The filter already added
- * @input: The new filter to comapre against
- *
- * Returns true if the two input set match
- **/
-static bool i40e_match_fdir_input_set(struct i40e_fdir_filter *rule,
- struct i40e_fdir_filter *input)
-{
-   if ((rule->dst_ip != input->dst_ip) ||
-   (rule->src_ip != input->src_ip) ||
-   (rule->dst_port != input->dst_port) ||
-   (rule->src_port != input->src_port))
-   return false;
-   return true;
-}
-
-/**
  * i40e_update_ethtool_fdir_entry - Updates the fdir filter entry
  * @vsi: Pointer to the targeted VSI
  * @input: The filter to update or NULL to indicate deletion
@@ -2673,22 +2655,22 @@ static int i40e_update_ethtool_fdir_entry(struct 
i40e_vsi *vsi,
 
/* if there is an old rule occupying our place remove it */
if (rule && (rule->fd_id == sw_idx)) {
-   if (input && !i40e_match_fdir_input_set(rule, input))
-   err = i40e_add_del_fdir(vsi, rule, false);
-   else if (!input)
-   err = i40e_add_del_fdir(vsi, rule, false);
+   /* Remove this rule, since we're either deleting it, or
+* replacing it.
+*/
+   err = i40e_add_del_fdir(vsi, rule, false);
hlist_del(>fdir_node);
kfree(rule);
pf->fdir_pf_active_filters--;
}
 
-   /* If no input this was a delete, err should be 0 if a rule was
-* successfully found and removed from the list else -EINVAL
+   /* If we weren't given an input, this is a delete, so just return the
+* error code indicating if there was an entry at the requested slot
 */
if (!input)
return err;
 
-   /* initialize node and set software index */
+   /* Otherwise, install the new rule as requested */
INIT_HLIST_NODE(>fdir_node);
 
/* add filter to the list */
-- 
2.12.0

[net-next 10/13] i40e: don't re-enable ATR when flushing filters if SB has TCP4/IPv4 rules

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

When flushing and replaying FDIR filters, it is possible we would
disable ATR, and then re-enable it even though we should have kept
it disabled due to existing TCP/IPv4 filters. Fix this by checking
whether we have TCP4/IPv4 filters before re-enabling.

Alternatively, we could instead restore ATR and then replay filters,
however, this would cause us to rapidly enable and then disable ATR in
some cases.

Change-ID: I076e4cc1e4409bce7f98f3c213295433a4ff43d8
Signed-off-by: Jacob Keller 
Reviewed-by: Avinash Dayanand 
Reviewed-by: Alan Brady 
Reviewed-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 437b79eeb8b5..cc33ac835181 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6228,7 +6228,7 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf *pf)
} else {
/* replay sideband filters */
i40e_fdir_filter_restore(pf->vsi[pf->lan_vsi]);
-   if (!disable_atr)
+   if (!disable_atr && !pf->fd_tcp_rule)
pf->hw_disabled_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
clear_bit(__I40E_FD_FLUSH_REQUESTED, >state);
if (I40E_DEBUG_FD & pf->hw.debug_mask)
-- 
2.12.0

[net-next 12/13] i40e: explicitly fail on extended MAC field for ethtool_rx_flow_spec

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

Although we will fail the filter later due to checking flow_type which
will have a bogus invalid type, it is possible future refactoring will
remove this hidden failure case. Avoid a possible issue in the future by
explicitly checking the flow type at the start.

Change-Id: Ia98eb26f7b93ccbe38c7141e8f203ef496fc6598
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index d16a5a6b24fc..905d66e87247 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2771,6 +2771,10 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
 
fsp = (struct ethtool_rx_flow_spec *)>fs;
 
+   /* Extended MAC field is not supported */
+   if (fsp->flow_type & FLOW_MAC_EXT)
+   return -EINVAL;
+
if (fsp->location >= (pf->hw.func_caps.fd_filters_best_effort +
  pf->hw.func_caps.fd_filters_guaranteed)) {
return -EINVAL;
-- 
2.12.0

[net-next 05/13] i40e: rework exit flow of i40e_add_fdir_ethtool

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

Refactor the exit flow of the i40e_add_fdir_ethtool function. Move the
input_label to the end of the function, removing the dependency on
having a non-zero return value. Add a comment explaining why it is ok
not to free the fdir data structure, because the structure is now stored
in the fdir_filter_list.

Change-Id: I723342181d59cd0c9f3b31140c37961ba37bb242
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 7a22b473dbdd..d16a5a6b24fc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2828,12 +2828,19 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
}
 
ret = i40e_add_del_fdir(vsi, input, true);
-free_input:
if (ret)
-   kfree(input);
-   else
-   i40e_update_ethtool_fdir_entry(vsi, input, fsp->location, NULL);
+   goto free_input;
+
+   /* Add the input filter to the fdir_input_list, possibly replacing
+* a previous filter. Do not free the input structure after adding it
+* to the list as this would cause a use-after-free bug.
+*/
+   i40e_update_ethtool_fdir_entry(vsi, input, fsp->location, NULL);
 
+   return 0;
+
+free_input:
+   kfree(input);
return ret;
 }
 
-- 
2.12.0

[net-next 06/13] i40e: return immediately when failing to add fdir filter

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

Instead of setting err=true and checking this to determine when to free
the raw_packet near the end of the function, simply kfree and return
immediately. The resulting code is a bit cleaner and has one less
variable. This also resolves a subtle bug in the ipv4 case which could
fail to add the first filter and then never free the memory, resulting
in a small memory leak.

Change-ID: I7583aac033481dc794b4acaa14445059c8930ff1
Signed-off-by: Jacob Keller 
Reviewed-by: Avinash Dayanand 
Reviewed-by: Alan Brady 
Reviewed-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 33 -
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c4d3a40a3f10..005257b4f218 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -203,7 +203,6 @@ static int i40e_add_del_fdir_udpv4(struct i40e_vsi *vsi,
struct i40e_pf *pf = vsi->back;
struct udphdr *udp;
struct iphdr *ip;
-   bool err = false;
u8 *raw_packet;
int ret;
static char packet[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x08, 0,
@@ -230,7 +229,9 @@ static int i40e_add_del_fdir_udpv4(struct i40e_vsi *vsi,
dev_info(>pdev->dev,
 "PCTYPE:%d, Filter command send failed for fd_id:%d 
(ret = %d)\n",
 fd_data->pctype, fd_data->fd_id, ret);
-   err = true;
+   /* Free the packet buffer since it wasn't added to the ring */
+   kfree(raw_packet);
+   return -EOPNOTSUPP;
} else if (I40E_DEBUG_FD & pf->hw.debug_mask) {
if (add)
dev_info(>pdev->dev,
@@ -241,10 +242,8 @@ static int i40e_add_del_fdir_udpv4(struct i40e_vsi *vsi,
 "Filter deleted for PCTYPE %d loc = %d\n",
 fd_data->pctype, fd_data->fd_id);
}
-   if (err)
-   kfree(raw_packet);
 
-   return err ? -EOPNOTSUPP : 0;
+   return 0;
 }
 
 #define I40E_TCPIP_DUMMY_PACKET_LEN 54
@@ -263,7 +262,6 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
struct i40e_pf *pf = vsi->back;
struct tcphdr *tcp;
struct iphdr *ip;
-   bool err = false;
u8 *raw_packet;
int ret;
/* Dummy packet */
@@ -305,12 +303,13 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
 
fd_data->pctype = I40E_FILTER_PCTYPE_NONF_IPV4_TCP;
ret = i40e_program_fdir_filter(fd_data, raw_packet, pf, add);
-
if (ret) {
dev_info(>pdev->dev,
 "PCTYPE:%d, Filter command send failed for fd_id:%d 
(ret = %d)\n",
 fd_data->pctype, fd_data->fd_id, ret);
-   err = true;
+   /* Free the packet buffer since it wasn't added to the ring */
+   kfree(raw_packet);
+   return -EOPNOTSUPP;
} else if (I40E_DEBUG_FD & pf->hw.debug_mask) {
if (add)
dev_info(>pdev->dev, "Filter OK for PCTYPE %d loc = 
%d)\n",
@@ -321,10 +320,7 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
 fd_data->pctype, fd_data->fd_id);
}
 
-   if (err)
-   kfree(raw_packet);
-
-   return err ? -EOPNOTSUPP : 0;
+   return 0;
 }
 
 #define I40E_IP_DUMMY_PACKET_LEN 34
@@ -343,7 +339,6 @@ static int i40e_add_del_fdir_ipv4(struct i40e_vsi *vsi,
 {
struct i40e_pf *pf = vsi->back;
struct iphdr *ip;
-   bool err = false;
u8 *raw_packet;
int ret;
int i;
@@ -365,12 +360,15 @@ static int i40e_add_del_fdir_ipv4(struct i40e_vsi *vsi,
 
fd_data->pctype = i;
ret = i40e_program_fdir_filter(fd_data, raw_packet, pf, add);
-
if (ret) {
dev_info(>pdev->dev,
 "PCTYPE:%d, Filter command send failed for 
fd_id:%d (ret = %d)\n",
 fd_data->pctype, fd_data->fd_id, ret);
-   err = true;
+   /* The packet buffer wasn't added to the ring so we
+* need to free it now.
+*/
+   kfree(raw_packet);
+   return -EOPNOTSUPP;
} else if (I40E_DEBUG_FD & pf->hw.debug_mask) {
if (add)
dev_info(>pdev->dev,
@@ -383,10 +381,7 @@ static int i40e_add_del_fdir_ipv4(struct i40e_vsi *vsi,
}

[net-next 08/13] i40e: remove redundant check for fd_tcp_rule when restoring filters

2017-03-20 Thread Jeff Kirsher

From: Jacob Keller 

i40e_fdir_filter_restore re-adds all existing filters, which already
checks when adding a TCPv4 filter to disable ATR. We don't need to make
the check twice, so remove this redundant code.

Change-ID: Ia0b0690e23523915199d601494557def135c9d7f
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6e63459ceb65..221e1705c031 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5466,12 +5466,6 @@ static int i40e_up_complete(struct i40e_vsi *vsi)
if (vsi->type == I40E_VSI_FDIR) {
/* reset fd counters */
pf->fd_add_err = pf->fd_atr_cnt = 0;
-   if (pf->fd_tcp_rule > 0) {
-   pf->hw_disabled_flags |= I40E_FLAG_FD_ATR_ENABLED;
-   if (I40E_DEBUG_FD & pf->hw.debug_mask)
-   dev_info(>pdev->dev, "Forcing ATR off, 
sideband rules for TCP/IPv4 exist\n");
-   pf->fd_tcp_rule = 0;
-   }
i40e_fdir_filter_restore(vsi);
}
 
-- 
2.12.0

[PATCH 3/3] soc: qcom: smd-rpm: Add msm8996 compatibility

2017-03-20 Thread Bjorn Andersson

With the RPM driver transitioned to RPMSG we can reuse the SMD-RPM
driver ontop of GLINK for 8996, without any modifications.

Signed-off-by: Bjorn Andersson 
---
 drivers/soc/qcom/smd-rpm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/soc/qcom/smd-rpm.c b/drivers/soc/qcom/smd-rpm.c
index 0dcf1bf33126..c2346752b3ea 100644
--- a/drivers/soc/qcom/smd-rpm.c
+++ b/drivers/soc/qcom/smd-rpm.c
@@ -225,6 +225,7 @@ static const struct of_device_id qcom_smd_rpm_of_match[] = {
{ .compatible = "qcom,rpm-apq8084" },
{ .compatible = "qcom,rpm-msm8916" },
{ .compatible = "qcom,rpm-msm8974" },
+   { .compatible = "qcom,rpm-msm8996" },
{}
 };
 MODULE_DEVICE_TABLE(of, qcom_smd_rpm_of_match);
-- 
2.12.0

[PATCH 1/3] soc: qcom: smd: Transition client drivers from smd to rpmsg

2017-03-20 Thread Bjorn Andersson

By moving these client drivers to use RPMSG instead of the direct SMD
API we can reuse them ontop of the newly added GLINK wire-protocol
support found in the 820 and 835 Qualcomm platforms.

As the new (RPMSG-based) and old SMD implementations are mutually
exclusive we have to change all client drivers in one commit, to make
sure we have a working system before and after this transition.

Signed-off-by: Bjorn Andersson 
---

Based on v4.11-rc3 with Arnd's Kconfig dependency fixes for BT_QCOMSMD
(https://lkml.org/lkml/2017/3/20/1038).

 drivers/bluetooth/Kconfig  |  2 +-
 drivers/bluetooth/btqcomsmd.c  | 32 +--
 drivers/net/wireless/ath/wcn36xx/Kconfig   |  2 +-
 drivers/net/wireless/ath/wcn36xx/main.c|  6 ++--
 drivers/net/wireless/ath/wcn36xx/smd.c | 10 +++---
 drivers/net/wireless/ath/wcn36xx/smd.h |  6 ++--
 drivers/net/wireless/ath/wcn36xx/wcn36xx.h |  2 +-
 drivers/soc/qcom/Kconfig   |  4 +--
 drivers/soc/qcom/smd-rpm.c | 43 +
 drivers/soc/qcom/wcnss_ctrl.c  | 50 +-
 include/linux/soc/qcom/wcnss_ctrl.h| 11 ---
 net/qrtr/Kconfig   |  2 +-
 net/qrtr/smd.c | 42 -
 13 files changed, 108 insertions(+), 104 deletions(-)

diff --git a/drivers/bluetooth/Kconfig b/drivers/bluetooth/Kconfig
index 08e054507d0b..a6a9dd4d0eef 100644
--- a/drivers/bluetooth/Kconfig
+++ b/drivers/bluetooth/Kconfig
@@ -344,7 +344,7 @@ config BT_WILINK
 
 config BT_QCOMSMD
tristate "Qualcomm SMD based HCI support"
-   depends on QCOM_SMD || (COMPILE_TEST && QCOM_SMD=n)
+   depends on RPMSG || (COMPILE_TEST && RPMSG=n)
depends on QCOM_WCNSS_CTRL || (COMPILE_TEST && QCOM_WCNSS_CTRL=n)
select BT_QCA
help
diff --git a/drivers/bluetooth/btqcomsmd.c b/drivers/bluetooth/btqcomsmd.c
index 8d4868af9bbd..ef730c173d4b 100644
--- a/drivers/bluetooth/btqcomsmd.c
+++ b/drivers/bluetooth/btqcomsmd.c
@@ -14,7 +14,7 @@
 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 
@@ -26,8 +26,8 @@
 struct btqcomsmd {
struct hci_dev *hdev;
 
-   struct qcom_smd_channel *acl_channel;
-   struct qcom_smd_channel *cmd_channel;
+   struct rpmsg_endpoint *acl_channel;
+   struct rpmsg_endpoint *cmd_channel;
 };
 
 static int btqcomsmd_recv(struct hci_dev *hdev, unsigned int type,
@@ -48,19 +48,19 @@ static int btqcomsmd_recv(struct hci_dev *hdev, unsigned 
int type,
return hci_recv_frame(hdev, skb);
 }
 
-static int btqcomsmd_acl_callback(struct qcom_smd_channel *channel,
- const void *data, size_t count)
+static int btqcomsmd_acl_callback(struct rpmsg_device *rpdev, void *data,
+ int count, void *priv, u32 addr)
 {
-   struct btqcomsmd *btq = qcom_smd_get_drvdata(channel);
+   struct btqcomsmd *btq = priv;
 
btq->hdev->stat.byte_rx += count;
return btqcomsmd_recv(btq->hdev, HCI_ACLDATA_PKT, data, count);
 }
 
-static int btqcomsmd_cmd_callback(struct qcom_smd_channel *channel,
- const void *data, size_t count)
+static int btqcomsmd_cmd_callback(struct rpmsg_device *rpdev, void *data,
+ int count, void *priv, u32 addr)
 {
-   struct btqcomsmd *btq = qcom_smd_get_drvdata(channel);
+   struct btqcomsmd *btq = priv;
 
return btqcomsmd_recv(btq->hdev, HCI_EVENT_PKT, data, count);
 }
@@ -72,12 +72,12 @@ static int btqcomsmd_send(struct hci_dev *hdev, struct 
sk_buff *skb)
 
switch (hci_skb_pkt_type(skb)) {
case HCI_ACLDATA_PKT:
-   ret = qcom_smd_send(btq->acl_channel, skb->data, skb->len);
+   ret = rpmsg_send(btq->acl_channel, skb->data, skb->len);
hdev->stat.acl_tx++;
hdev->stat.byte_tx += skb->len;
break;
case HCI_COMMAND_PKT:
-   ret = qcom_smd_send(btq->cmd_channel, skb->data, skb->len);
+   ret = rpmsg_send(btq->cmd_channel, skb->data, skb->len);
hdev->stat.cmd_tx++;
break;
default:
@@ -114,18 +114,15 @@ static int btqcomsmd_probe(struct platform_device *pdev)
wcnss = dev_get_drvdata(pdev->dev.parent);
 
btq->acl_channel = qcom_wcnss_open_channel(wcnss, "APPS_RIVA_BT_ACL",
-  btqcomsmd_acl_callback);
+  btqcomsmd_acl_callback, btq);
if (IS_ERR(btq->acl_channel))
return PTR_ERR(btq->acl_channel);
 
btq->cmd_channel = qcom_wcnss_open_channel(wcnss, "APPS_RIVA_BT_CMD",
-  btqcomsmd_cmd_callback);
+  btqcomsmd_cmd_callback, btq);
if

[PATCH 2/3] soc: qcom: smd: Remove standalone driver

2017-03-20 Thread Bjorn Andersson

Remove the standalone SMD implementation as we have transitioned the
client drivers to use the RPMSG based one.

Also remove all dependencies on QCOM_SMD from Kconfig files, in order to
keep them selectable in the absence of the removed symbol.

Signed-off-by: Bjorn Andersson 
---

Based on v4.11-rc3 with Arnd's Kconfig dependency fixes in remoteproc
(https://lkml.org/lkml/2017/3/20/1027).

 drivers/remoteproc/Kconfig |6 +-
 drivers/rpmsg/Kconfig  |1 -
 drivers/soc/qcom/Kconfig   |8 -
 drivers/soc/qcom/Makefile  |1 -
 drivers/soc/qcom/smd.c | 1560 
 include/linux/rpmsg/qcom_smd.h |2 +-
 include/linux/soc/qcom/smd.h   |  139 
 7 files changed, 4 insertions(+), 1713 deletions(-)
 delete mode 100644 drivers/soc/qcom/smd.c
 delete mode 100644 include/linux/soc/qcom/smd.h

diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
index 1dc43fc5f65f..faad69a1a597 100644
--- a/drivers/remoteproc/Kconfig
+++ b/drivers/remoteproc/Kconfig
@@ -76,7 +76,7 @@ config QCOM_ADSP_PIL
depends on OF && ARCH_QCOM
depends on REMOTEPROC
depends on QCOM_SMEM
-   depends on RPMSG_QCOM_SMD || QCOM_SMD || (COMPILE_TEST && QCOM_SMD=n && 
RPMSG_QCOM_SMD=n)
+   depends on RPMSG_QCOM_SMD || (COMPILE_TEST && RPMSG_QCOM_SMD=n)
select MFD_SYSCON
select QCOM_MDT_LOADER
select QCOM_RPROC_COMMON
@@ -93,7 +93,7 @@ config QCOM_Q6V5_PIL
depends on OF && ARCH_QCOM
depends on QCOM_SMEM
depends on REMOTEPROC
-   depends on RPMSG_QCOM_SMD || QCOM_SMD || (COMPILE_TEST && QCOM_SMD=n && 
RPMSG_QCOM_SMD=n)
+   depends on RPMSG_QCOM_SMD || (COMPILE_TEST && RPMSG_QCOM_SMD=n)
select MFD_SYSCON
select QCOM_RPROC_COMMON
select QCOM_SCM
@@ -104,7 +104,7 @@ config QCOM_Q6V5_PIL
 config QCOM_WCNSS_PIL
tristate "Qualcomm WCNSS Peripheral Image Loader"
depends on OF && ARCH_QCOM
-   depends on RPMSG_QCOM_SMD || QCOM_SMD || (COMPILE_TEST && QCOM_SMD=n && 
RPMSG_QCOM_SMD=n)
+   depends on RPMSG_QCOM_SMD || (COMPILE_TEST && RPMSG_QCOM_SMD=n)
depends on QCOM_SMEM
depends on REMOTEPROC
select QCOM_MDT_LOADER
diff --git a/drivers/rpmsg/Kconfig b/drivers/rpmsg/Kconfig
index f12ac0b28263..edc008f55663 100644
--- a/drivers/rpmsg/Kconfig
+++ b/drivers/rpmsg/Kconfig
@@ -16,7 +16,6 @@ config RPMSG_CHAR
 config RPMSG_QCOM_SMD
tristate "Qualcomm Shared Memory Driver (SMD)"
depends on QCOM_SMEM
-   depends on QCOM_SMD=n
select RPMSG
help
  Say y here to enable support for the Qualcomm Shared Memory Driver
diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index 751dce0c19b3..4428a580a995 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -33,14 +33,6 @@ config QCOM_SMEM
  The driver provides an interface to items in a heap shared among all
  processors in a Qualcomm platform.
 
-config QCOM_SMD
-   tristate "Qualcomm Shared Memory Driver (SMD)"
-   depends on QCOM_SMEM
-   help
- Say y here to enable support for the Qualcomm Shared Memory Driver
- providing communication channels to remote processors in Qualcomm
- platforms.
-
 config QCOM_SMD_RPM
tristate "Qualcomm Resource Power Manager (RPM) over SMD"
depends on RPMSG && OF
diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index 1f30260b06b8..414f0de274fa 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -1,7 +1,6 @@
 obj-$(CONFIG_QCOM_GSBI)+=  qcom_gsbi.o
 obj-$(CONFIG_QCOM_MDT_LOADER)  += mdt_loader.o
 obj-$(CONFIG_QCOM_PM)  +=  spm.o
-obj-$(CONFIG_QCOM_SMD) +=  smd.o
 obj-$(CONFIG_QCOM_SMD_RPM) += smd-rpm.o
 obj-$(CONFIG_QCOM_SMEM) += smem.o
 obj-$(CONFIG_QCOM_SMEM_STATE) += smem_state.o
diff --git a/drivers/soc/qcom/smd.c b/drivers/soc/qcom/smd.c
deleted file mode 100644
index 322034ab9d37..
--- a/drivers/soc/qcom/smd.c
+++ /dev/null
@@ -1,1560 +0,0 @@
-/*
- * Copyright (c) 2015, Sony Mobile Communications AB.
- * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 and
- * only version 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/*
- * The Qualcomm Shared Memory communication solution provides point-to-point
- * channels for clients to send and

[net-next sample action optimization v4 1/4] openvswitch: Deferred fifo API change.

2017-03-20 Thread Andy Zhou

add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.

Signed-off-by: Andy Zhou 
Acked-by: Joe Stringer 
---
 net/openvswitch/actions.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index c82301c..75182e9 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -51,6 +51,7 @@ static int do_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
+   int actions_len;
 
/* Store pkt_key clone when creating deferred action. */
struct sw_flow_key pkt_key;
@@ -119,8 +120,9 @@ static struct deferred_action *action_fifo_put(struct 
action_fifo *fifo)
 
 /* Return true if fifo is not full */
 static struct deferred_action *add_deferred_actions(struct sk_buff *skb,
-   const struct sw_flow_key 
*key,
-   const struct nlattr *attr)
+   const struct sw_flow_key *key,
+   const struct nlattr *actions,
+   const int actions_len)
 {
struct action_fifo *fifo;
struct deferred_action *da;
@@ -129,7 +131,8 @@ static struct deferred_action *add_deferred_actions(struct 
sk_buff *skb,
da = action_fifo_put(fifo);
if (da) {
da->skb = skb;
-   da->actions = attr;
+   da->actions = actions;
+   da->actions_len = actions_len;
da->pkt_key = *key;
}
 
@@ -966,7 +969,8 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
/* Skip the sample action when out of memory. */
return 0;
 
-   if (!add_deferred_actions(skb, key, a)) {
+   if (!add_deferred_actions(skb, key, nla_data(acts_list),
+ nla_len(acts_list))) {
if (net_ratelimit())
pr_warn("%s: deferred actions limit reached, dropping 
sample action\n",
ovs_dp_name(dp));
@@ -1123,7 +1127,7 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   da = add_deferred_actions(skb, key, NULL);
+   da = add_deferred_actions(skb, key, NULL, 0);
if (da) {
da->pkt_key.recirc_id = nla_get_u32(a);
} else {
@@ -1278,10 +1282,10 @@ static void process_deferred_actions(struct datapath 
*dp)
struct sk_buff *skb = da->skb;
struct sw_flow_key *key = >pkt_key;
const struct nlattr *actions = da->actions;
+   int actions_len = da->actions_len;
 
if (actions)
-   do_execute_actions(dp, skb, key, actions,
-  nla_len(actions));
+   do_execute_actions(dp, skb, key, actions, actions_len);
else
ovs_dp_process_packet(skb, key);
} while (!action_fifo_is_empty(fifo));
-- 
1.8.3.1

[net-next sample action optimization v4 0/4]

2017-03-20 Thread Andy Zhou

The sample action can be used for translating Openflow 'clone' action.
However its implementation has not been sufficiently optimized for this
use case. This series attempts to close the gap.

Patch 3 commit message has more details on the specific optimizations
implemented.

---
v3->v4: Enhance patch 4.
Fix two bugs pointed out by Pravin,
Remove 'is_sample' variable.

v2->v3: Enhance patch 4, Rafctor to move more common logic to clone_execute().

v1->v2: Address Pravin's comment, Refactor recirc and sample
to share more common code

Andy Zhou (4):
openvswitch: Deferred fifo API change.
openvswitch: Refactor recirc key allocation.
openvswitch: Optimize sample action for the clone use cases
Openvswitch: Refactor sample and recirc actions implementation

 include/uapi/linux/openvswitch.h |  15 +++
 net/openvswitch/actions.c| 271 ++-
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +---
 4 files changed, 263 insertions(+), 166 deletions(-)

-- 
1.8.3.1

[net-next sample action optimization v4 4/4] Openvswitch: Refactor sample and recirc actions implementation

2017-03-20 Thread Andy Zhou

Added clone_execute() that both the sample and the recirc
action implementation can use.

Signed-off-by: Andy Zhou 
---
 net/openvswitch/actions.c | 176 --
 1 file changed, 93 insertions(+), 83 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 3529f7b..e461067 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -44,10 +44,6 @@
 #include "conntrack.h"
 #include "vport.h"
 
-static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
- struct sw_flow_key *key,
- const struct nlattr *attr, int len);
-
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
@@ -166,6 +162,12 @@ static bool is_flow_key_valid(const struct sw_flow_key 
*key)
return !(key->mac_proto & SW_FLOW_KEY_INVALID);
 }
 
+static int clone_execute(struct datapath *dp, struct sk_buff *skb,
+struct sw_flow_key *key,
+u32 recirc_id,
+const struct nlattr *actions, int len,
+bool last, bool clone_flow_key);
+
 static void update_ethertype(struct sk_buff *skb, struct ethhdr *hdr,
 __be16 ethertype)
 {
@@ -938,10 +940,9 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 {
struct nlattr *actions;
struct nlattr *sample_arg;
-   struct sw_flow_key *orig_key = key;
int rem = nla_len(attr);
-   int err = 0;
const struct sample_arg *arg;
+   bool clone_flow_key;
 
/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
sample_arg = nla_data(attr);
@@ -955,43 +956,9 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
return 0;
}
 
-   /* Unless the last action, sample works on the clone of SKB.  */
-   skb = last ? skb : skb_clone(skb, GFP_ATOMIC);
-   if (!skb) {
-   /* Out of memory, skip this sample action.
-*/
-   return 0;
-   }
-
-   /* In case the sample actions won't change 'key',
-* it can be used directly to execute sample actions.
-* Otherwise, allocate a new key from the
-* next recursion level of 'flow_keys'. If
-* successful, execute the sample actions without
-* deferring.
-*
-* Defer the sample actions if the recursion
-* limit has been reached.
-*/
-   if (!arg->exec) {
-   __this_cpu_inc(exec_actions_level);
-   key = clone_key(key);
-   }
-
-   if (key) {
-   err = do_execute_actions(dp, skb, key, actions, rem);
-   } else if (!add_deferred_actions(skb, orig_key, actions, rem)) {
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop sample 
action\n",
-   ovs_dp_name(dp));
-   kfree_skb(skb);
-   }
-
-   if (!arg->exec)
-   __this_cpu_dec(exec_actions_level);
-
-   return err;
+   clone_flow_key = !arg->exec;
+   return clone_execute(dp, skb, key, 0, actions, rem, last,
+clone_flow_key);
 }
 
 static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key,
@@ -1102,10 +1069,9 @@ static int execute_masked_set_action(struct sk_buff *skb,
 
 static int execute_recirc(struct datapath *dp, struct sk_buff *skb,
  struct sw_flow_key *key,
- const struct nlattr *a, int rem)
+ const struct nlattr *a, bool last)
 {
-   struct sw_flow_key *recirc_key;
-   struct deferred_action *da;
+   u32 recirc_id;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1116,40 +1082,8 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
}
BUG_ON(!is_flow_key_valid(key));
 
-   if (!nla_is_last(a, rem)) {
-   /* Recirc action is the not the last action
-* of the action list, need to clone the skb.
-*/
-   skb = skb_clone(skb, GFP_ATOMIC);
-
-   /* Skip the recirc action when out of memory, but
-* continue on with the rest of the action list.
-*/
-   if (!skb)
-   return 0;
-   }
-
-   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
-* recirc immediately, otherwise, defer it for later execution.
-*/
-   recirc_key = clone_key(key);
-   if (recirc_key) {
-   recirc_key->recirc_id = nla_get_u32(a);
-   ovs_dp_process_packet(skb, recirc_key);
-   } else {
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   recirc_key = >pkt_key;
-   recirc_key->recirc_id =

RE: [Intel-wired-lan] [PATCH] i40e: fix memcpy with swapped arguments

2017-03-20 Thread Keller, Jacob E

> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Colin King
> Sent: Monday, March 20, 2017 7:46 AM
> To: Kirsher, Jeffrey T ; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH] i40e: fix memcpy with swapped arguments
> 
> From: Colin Ian King 

Hi there,

> 
> The current code copies an uninitialized params into
> cdev->lan_info.params and then passes the uninitialized params
> to the call cdev->client->ops->l2_param_change.  I believe the
> order of the source and destination in the memcpy is the wrong
> way around and should be swapped.
> 

So you are correct that params is uninitialized. However, the fix here is not 
correct. Somehow we dropped the code for initializing the parameters.

See commit d7ce6422d6e6 ("i40e: don't check params until after checking for 
client instance", 2017-02-09) It looks like the commit itself was malformed 
when applied upstream, and a later commit which should have preserved the 
changes 3140aa9a78c9 ("i40e: KISS the client interface", 2017-03-14) 
accidentally dropped them.

I'll provide a patch to get this back into the correct state.

Thanks for catching this.

Regards,
Jake

[net-next sample action optimization v4 2/4] openvswitch: Refactor recirc key allocation.

2017-03-20 Thread Andy Zhou

The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well.  Refactor the logic into its own function clone_key().

Signed-off-by: Andy Zhou 
Acked-by: Pravin B Shelar 
---
 net/openvswitch/actions.c | 66 ---
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 75182e9..8c9c60c 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007-2014 Nicira, Inc.
+ * Copyright (c) 2007-2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -83,14 +83,31 @@ struct action_fifo {
struct deferred_action fifo[DEFERRED_ACTION_FIFO_SIZE];
 };
 
-struct recirc_keys {
+struct action_flow_keys {
struct sw_flow_key key[OVS_DEFERRED_ACTION_THRESHOLD];
 };
 
 static struct action_fifo __percpu *action_fifos;
-static struct recirc_keys __percpu *recirc_keys;
+static struct action_flow_keys __percpu *flow_keys;
 static DEFINE_PER_CPU(int, exec_actions_level);
 
+/* Make a clone of the 'key', using the pre-allocated percpu 'flow_keys'
+ * space. Return NULL if out of key spaces.
+ */
+static struct sw_flow_key *clone_key(const struct sw_flow_key *key_)
+{
+   struct action_flow_keys *keys = this_cpu_ptr(flow_keys);
+   int level = this_cpu_read(exec_actions_level);
+   struct sw_flow_key *key = NULL;
+
+   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
+   key = >key[level - 1];
+   *key = *key_;
+   }
+
+   return key;
+}
+
 static void action_fifo_init(struct action_fifo *fifo)
 {
fifo->head = 0;
@@ -1090,8 +1107,8 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
  struct sw_flow_key *key,
  const struct nlattr *a, int rem)
 {
+   struct sw_flow_key *recirc_key;
struct deferred_action *da;
-   int level;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1115,29 +1132,26 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   level = this_cpu_read(exec_actions_level);
-   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
-   struct recirc_keys *rks = this_cpu_ptr(recirc_keys);
-   struct sw_flow_key *recirc_key = >key[level - 1];
-
-   *recirc_key = *key;
+   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
+* recirc immediately, otherwise, defer it for later execution.
+*/
+   recirc_key = clone_key(key);
+   if (recirc_key) {
recirc_key->recirc_id = nla_get_u32(a);
ovs_dp_process_packet(skb, recirc_key);
-
-   return 0;
-   }
-
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   da->pkt_key.recirc_id = nla_get_u32(a);
} else {
-   kfree_skb(skb);
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop recirc 
action\n",
-   ovs_dp_name(dp));
+   da = add_deferred_actions(skb, key, NULL, 0);
+   if (da) {
+   recirc_key = >pkt_key;
+   recirc_key->recirc_id = nla_get_u32(a);
+   } else {
+   /* Log an error in case action fifo is full.  */
+   kfree_skb(skb);
+   if (net_ratelimit())
+   pr_warn("%s: deferred action limit reached, 
drop recirc action\n",
+   ovs_dp_name(dp));
+   }
}
-
return 0;
 }
 
@@ -1327,8 +1341,8 @@ int action_fifos_init(void)
if (!action_fifos)
return -ENOMEM;
 
-   recirc_keys = alloc_percpu(struct recirc_keys);
-   if (!recirc_keys) {
+   flow_keys = alloc_percpu(struct action_flow_keys);
+   if (!flow_keys) {
free_percpu(action_fifos);
return -ENOMEM;
}
@@ -1339,5 +1353,5 @@ int action_fifos_init(void)
 void action_fifos_exit(void)
 {
free_percpu(action_fifos);
-   free_percpu(recirc_keys);
+   free_percpu(flow_keys);
 }
-- 
1.8.3.1

[net-next sample action optimization v4 3/4] openvswitch: Optimize sample action for the clone use cases

2017-03-20 Thread Andy Zhou

With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.

Signed-off-by: Andy Zhou 
---
 include/uapi/linux/openvswitch.h |  15 +
 net/openvswitch/actions.c| 107 ++---
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +++
 4 files changed, 167 insertions(+), 98 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 7f41f7d..66d1c3c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -578,10 +578,25 @@ enum ovs_sample_attr {
OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
OVS_SAMPLE_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
__OVS_SAMPLE_ATTR_MAX,
+
+#ifdef __KERNEL__
+   OVS_SAMPLE_ATTR_ARG  /* struct sample_arg  */
+#endif
 };
 
 #define OVS_SAMPLE_ATTR_MAX (__OVS_SAMPLE_ATTR_MAX - 1)
 
+#ifdef __KERNEL__
+struct sample_arg {
+   bool exec;   /* When true, actions in sample will not
+ * change flow keys. False otherwise.
+ */
+   u32  probability;/* Same value as
+ * 'OVS_SAMPLE_ATTR_PROBABILITY'.
+ */
+};
+#endif
+
 /**
  * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
  * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 8c9c60c..3529f7b 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -928,73 +928,70 @@ static int output_userspace(struct datapath *dp, struct 
sk_buff *skb,
return ovs_dp_upcall(dp, skb, key, , cutlen);
 }
 
+/* When 'last' is true, sample() should always consume the 'skb'.
+ * Otherwise, sample() should keep 'skb' intact regardless what
+ * actions are executed within sample().
+ */
 static int sample(struct datapath *dp, struct sk_buff *skb,
  struct sw_flow_key *key, const struct nlattr *attr,
- const struct nlattr *actions, int actions_len)
+ bool last)
 {
-   const struct nlattr *acts_list = NULL;
-   const struct nlattr *a;
-   int rem;
-   u32 cutlen = 0;
+   struct nlattr *actions;
+   struct nlattr *sample_arg;
+   struct sw_flow_key *orig_key = key;
+   int rem = nla_len(attr);
+   int err = 0;
+   const struct sample_arg *arg;
 
-   for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
-a = nla_next(a, )) {
-   u32 probability;
+   /* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
+   sample_arg = nla_data(attr);
+   arg = nla_data(sample_arg);
+   actions = nla_next(sample_arg, );
 
-   switch (nla_type(a)) {
-   case OVS_SAMPLE_ATTR_PROBABILITY:
-   probability = nla_get_u32(a);
-   if (!probability || prandom_u32() > probability)
-   return 0;
-   break;
-
-   case OVS_SAMPLE_ATTR_ACTIONS:
-   acts_list = a;
-   break;
-   }
+   if ((arg->probability != U32_MAX) &&
+   (!arg->probability || prandom_u32() > arg->probability)) {
+   if (last)
+   consume_skb(skb);
+   return 0;
}
 
-   rem = nla_len(acts_list);
-   a = nla_data(acts_list);
-
-   /* Actions

Re: [PATCH net-next v5 2/3] Add a eBPF helper function to retrieve socket uid

2017-03-20 Thread Daniel Borkmann


On 03/20/2017 07:41 PM, Chenbo Feng wrote:

From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.

Signed-off-by: Chenbo Feng 
---
  include/uapi/linux/bpf.h   |  9 -
  net/core/filter.c  | 22 ++
  tools/include/uapi/linux/bpf.h |  3 ++-
  3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dc81a9f..ff42111 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -462,6 +462,12 @@ union bpf_attr {
   * @skb: pointer to skb
   * Return: 8 Bytes non-decreasing number on success or 0 if the socket
   * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
   */
  #define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -510,7 +516,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
   * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b65ae3..a7c25c1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,24 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
  };

+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   kuid_t kuid;
+   struct sock *sk = sk_to_full_sk(skb->sk);


Minor nit, please change the order into:

struct sock *sk = sk_to_full_sk(skb->sk);
kuid_t kuid;


+   if (!sk || !sk_fullsock(sk))
+   return overflowuid;
+   kuid = sock_net_uid(sock_net(sk), sk);
+   return from_kuid_munged(current_user_ns(), kuid);
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+


Rest looks good, thanks.

Acked-by: Daniel Borkmann

Re: [PATCH net-next v5 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-03-20 Thread Daniel Borkmann


On 03/20/2017 07:41 PM, Chenbo Feng wrote:

From: Chenbo Feng 

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptables
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Signed-off-by: Chenbo Feng 

[...]

+int main(int argc, char *argv[])
+{
+   if (argc > 2) {
+   printf("Too many argument provided\n");
+   return 1;

[...]

+
+   return 0;
+}
+
+
+
+
+
+


Did these bits slip in accidentally?


diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 3705fba..8ab36a0 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -135,6 +135,16 @@ struct bpf_insn;
.off   = OFF,   \
.imm   = 0 })


[...]

Re: [PATCH net-next v5 1/3] Add a helper function to get socket cookie in eBPF

2017-03-20 Thread Daniel Borkmann


On 03/20/2017 07:41 PM, Chenbo Feng wrote:

From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Signed-off-by: Chenbo Feng 


Acked-by: Daniel Borkmann

Re: [PATCH net 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread Eric Dumazet

On Tue, 2017-03-21 at 05:30 +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> When user_mss is zero, it means use the default value. But the current
> codes don't permit user set TCP_MAXSEG to the default value.
> It would return the -EINVAL when val is zero.
> 
> Signed-off-by: Gao Feng 
> ---
>  net/ipv4/tcp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1e319a5..dd5e8e2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2470,7 +2470,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>   /* Values greater than interface MTU won't take effect. However
>* at the point when this call is done we typically don't yet
>* know which interface is going to be used */
> - if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
> + if (val < 0 || (val > 0 && val < TCP_MIN_MSS) ||
> + val > MAX_TCP_WINDOW) {
>   err = -EINVAL;
>   break;
>   }

This is a convoluted way to express that val == 0 is accepted ...

What about :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
1e319a525d51b0b603a5ccc5143381c752b9f2c7..7db78d72896ac7c4befba5966704ed18ecbac409
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2469,8 +2469,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_MAXSEG:
/* Values greater than interface MTU won't take effect. However
 * at the point when this call is done we typically don't yet
-* know which interface is going to be used */
-   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
+* know which interface is going to be used.
+*/
+   if (val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) {
err = -EINVAL;
break;
}

[PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread PJ Waskiewicz

From: PJ Waskiewicz 

The permanent MAC address is useful to store for things like ethtool,
and when bonding with modes such as active/passive or LACP.  This
follows the model of other Ethernet drivers, such as ixgbe.

This was verified on a C220 chassis with the Cisco VNIC Ethernet device.

Signed-off-by: PJ Waskiewicz 
---
 drivers/net/ethernet/cisco/enic/enic_main.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index 4b87bee..8bb2114 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -964,6 +964,16 @@ void enic_reset_addr_lists(struct enic *enic)
enic->flags = 0;
 }
 
+static int enic_set_perm_mac_addr(struct net_device *netdev, char *addr)
+{
+   if (!is_valid_ether_addr(addr) && !is_zero_ether_addr(addr))
+   return -EADDRNOTAVAIL;
+
+   memcpy(netdev->perm_addr, addr, netdev->addr_len);
+
+   return 0;
+}
+
 static int enic_set_mac_addr(struct net_device *netdev, char *addr)
 {
struct enic *enic = netdev_priv(netdev);
@@ -2872,6 +2882,14 @@ static int enic_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err_out_dev_deinit;
}
 
+   /* Store off permanent MAC address
+*/
+   err = enic_set_perm_mac_addr(netdev, enic->mac_addr);
+   if (err) {
+   dev_err(dev, "Invalid MAC address, aborting\n");
+   goto err_out_dev_deinit;
+   }
+
enic->tx_coalesce_usecs = enic->config.intr_timer_usec;
/* rx coalesce time already got initialized. This gets used
 * if adaptive coal is turned off
-- 
2.10.2

Re: [net-next PATCH 1/2] net: Busy polling should ignore sender CPUs

2017-03-20 Thread Eric Dumazet

On Mon, 2017-03-20 at 14:48 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> This patch is a cleanup/fix for NAPI IDs following the changes that made it
> so that sender_cpu and napi_id were doing a better job of sharing the same
> location in the sk_buff.
> 
> One issue I found is that we weren't validating the napi_id as being valid
> before we started trying to setup the busy polling.  This change corrects
> that by using the MIN_NAPI_ID value that is now used in both allocating the
> NAPI IDs, as well as validating them.
> 
> Fixes: 52bd2d62ce675 ("net: better skb->sender_cpu and skb->napi_id 
> cohabitation")
> Signed-off-by: Alexander Duyck 
> ---

This Fixes: tag seems not really needed here.

If really busy polling is attempted to a socket with a  napi id,
nothing bad happens. This fits the advisory model of busy polling...

Otherwise, your patch would be a candidate for net tree.

Also note that as soon as sk_can_busy_loop(sk) returns some status,
another cpu might already have changed sk->sk_napi_id to something else,
possibly with a  napi id again.

If your upcoming code depends on sk->sk_napi_id being verified, then
you need to read it once.

Re: [PATCH v2 03/20] ARM: sun8i: dt: Add DT bindings documentation for Allwinner dwmac-sun8i

2017-03-20 Thread Rob Herring

On Tue, Mar 14, 2017 at 03:18:39PM +0100, Corentin Labbe wrote:
> This patch adds documentation for Device-Tree bindings for the
> Allwinner dwmac-sun8i driver.

"dt-bindings: net: ..." is the preferred subject prefix if you respin.
 
> 
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/net/dwmac-sun8i.txt| 77 
> ++
>  1 file changed, 77 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/dwmac-sun8i.txt

Acked-by: Rob Herring

Re: [PATCH net-next 10/12] dt-bindings: net: document bcmgenet WoL interrupt

2017-03-20 Thread Rob Herring

On Mon, Mar 13, 2017 at 05:41:40PM -0700, Doug Berger wrote:
> A third interrupt cell can be provided to optionally specify
> the interrupt used for handling Wake on LAN events.
> 
> Typically the wake up handling uses a separate interrupt
> controller, so the interrupts-extended property is used to
> accommodate this.

Using interrupts vs. interrupts-extended is beyond the scope of the 
binding doc. IOW, documenting interrupts is enough and using
interrupts-extended is allowed.

> 
> Signed-off-by: Doug Berger 
> ---
>  Documentation/devicetree/bindings/net/brcm,bcmgenet.txt | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt 
> b/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt
> index 10587bdadbbe..01a70463cbc5 100644
> --- a/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt
> +++ b/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt
> @@ -4,9 +4,12 @@ Required properties:
>  - compatible: should contain one of "brcm,genet-v1", "brcm,genet-v2",
>"brcm,genet-v3", "brcm,genet-v4".
>  - reg: address and length of the register set for the device
> -- interrupts: must be two cells, the first cell is the general purpose
> -  interrupt line, while the second cell is the interrupt for the ring
> -  RX and TX queues operating in ring mode
> +- interrupts and/or interrupts-extended: must be two cells, the first cell
> +  is the general purpose interrupt line, while the second cell is the
> +  interrupt for the ring RX and TX queues operating in ring mode.  An
> +  optional third interrupt cell for Wake-on-LAN can be specified.

Does the generic wakeup source work for this?

> +  See Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
> +  for information on the property specifics.
>  - phy-mode: see ethernet.txt file in the same directory
>  - #address-cells: should be 1
>  - #size-cells: should be 1
> -- 
> 2.11.1
>

Re: [PATCH net-next] stmmac: call stmmac_init_phy from stmmac_dvr_probe

2017-03-20 Thread Florian Fainelli

On 03/20/2017 11:34 AM, Niklas Cassel wrote:
> On 03/20/2017 06:43 PM, Florian Fainelli wrote:
>> On 03/20/2017 10:29 AM, Niklas Cassel wrote:
>>> From: Niklas Cassel 
>>>
>>> It is usually possible to do
>>> ethtool -s autoneg on
>>> so that you trigger an autoneg before calling
>>> ip link set dev eth0 up
>> This is completely driver specific and there is no guarantee for this to
>> work universally across all device drivers because when your interface
>> is brought down, the most sensible thing to expect in return is that
>> your PHY is powered down (unless your interface participates in
>> Wake-on-LAN).
>>
>>> However, stmmac returns -EBUSY if !netif_running.
>>> The only reason for this appears to be that stmmac_init_phy
>>> is called from stmmac_open instead of from stmmac_dvr_probe.
>>>
>>> Move stmmac_init_phy to stmmac_dvr_probe so that ethool
>>> works as soon as register_netdev has been called.
>>> stmmac_check_ether_addr was also moved to probe,
>>> so that the ordering doesn't change.
>> Are you sure this is a good idea? There are many drivers that moved the
>> PHY probe into ndo_open() for mainly two things:
>>
>> - phy_connect() starts the PHY state machine and starting the state
>> machine without a network device running is kind of wasting cycles
>>
>> - if the interface is probed, but not used, you are keeping the Ethernet
>> link running without being able to service packets, which is at best a
>> waste of power
> 
> Hello Florian
> 
> Thank you for your input.
> I can see the point in keeping phy_connect in ndo_open.
> 
> What I dislike is the -EBUSY from stmmac_ethtool_get_link_ksettings,
> since this will create warnings in user space by our favorite monolith.
> (Please don't flame me, I dislike it as much as you guys.)
> 
> [ WARNING ] systemd-udevd[236]: link_config: could not get ethtool features 
> for eth0
> [ WARNING ] systemd-udevd[236]: Could not set offload features of eth0: 
> Device or resource busy

Then let's silence the warning, because it's not really helpful in the
first place.

> 
> 
> However, it is kind of sad that drivers are so inconsistent of what goes
> in probe and what goes in ndo_open...which is tied together with the
> whole mess of when certain ethtool commands work or do not work.

Well, inconsistent here is kind of big statement, what I meant to say is
that your proposed change actually makes thing inconsistent.

> 
> Do you know of a good way to avoid the -EBUSY in 
> stmmac_ethtool_get_link_ksettings,
> but still keep phy_connect in ndo_open?

Let me rephrase this: doing an ethtool operation on an interface that is
not open is not a defined behavior which adheres to a contract with the
kernel. ethtool operations on interfaces that are DOWN, may succeed if
they do not involve a piece of hardware that needs to be active (e.g:
the PHY) but there is no guarantee that a) the change is immediately
applied, or b) that the results are consistent.

The best thing to do IMHO is just silence the warning, return an error
coed, and come back configuring the interface at a later time when it is
guaranteed to be operational.

> The current code checks netif_running(), which checks __LINK_STATE_START,
> which gets set by __dev_open().
> stmmac_ethtool_get_link_ksettings also returns -ENODEV if ndev->phydev == 
> NULL.
> 


-- 
Florian

Re: [PATCH 1/7] Documentation: dt: net: Update the ath9k binding for SoC devices

2017-03-20 Thread Rob Herring

On Mon, Mar 13, 2017 at 10:05:09PM +0100, Alban wrote:
> The current binding only cover PCI devices so extend it for SoC devices.
> 
> Most SoC platforms use an MTD partition for the calibration data
> instead of an EEPROM. The qca,no-eeprom property was added to allow
> loading the EEPROM content using firmware loading. This new binding
> replace this hack with NVMEM cells, so we also mark the qca,no-eeprom
> property as deprecated in case anyone ever used it.
> 
> Signed-off-by: Alban 
> ---
>  .../devicetree/bindings/net/wireless/qca,ath9k.txt | 41 
> --
>  1 file changed, 38 insertions(+), 3 deletions(-)

For the subject, "dt-bindings: net: ..." and one nit below, otherwise:

Acked-by: Rob Herring 

> 
> diff --git a/Documentation/devicetree/bindings/net/wireless/qca,ath9k.txt 
> b/Documentation/devicetree/bindings/net/wireless/qca,ath9k.txt
> index b7396c8..61f5f6d 100644
> --- a/Documentation/devicetree/bindings/net/wireless/qca,ath9k.txt
> +++ b/Documentation/devicetree/bindings/net/wireless/qca,ath9k.txt
> @@ -27,16 +27,34 @@ Required properties:
>   - 0034 for AR9462
>   - 0036 for AR9565
>   - 0037 for AR9485
> + For SoC devices the compatible should be "qca,-wmac"
> + and one of the following fallbacks:
> + - "qca,ar9100-wmac"
> + - "qca,ar9330-wmac"
> + - "qca,ar9340-wmac"
> + - "qca,qca9550-wmac"
> + - "qca,qca9530-wmac"
>  - reg: Address and length of the register set for the device.
>  
> +Required properties for SoC devices:
> +- interrupt-parent: phandle of the parent interrupt controller.
> +- interrupts: Interrupt specifier for the controllers interrupt.
> +
>  Optional properties:
> +- mac-address: See ethernet.txt in the parent directory
> +- local-mac-address: See ethernet.txt in the parent directory
> +- clock-names: has to be "ref"
> +- clocks: phandle of the reference clock
> +- resets: phandle of the reset line
> +- nvmem-cell-names: has to be "eeprom" and/or "address"
> +- nvmem-cells: phandle to the eeprom nvmem cell and/or to the mac address
> + nvmem cell.
> +
> +Deprecated properties:
>  - qca,no-eeprom: Indicates that there is no physical EEPROM connected to the
>   ath9k wireless chip (in this case the calibration /
>   EEPROM data will be loaded from userspace using the
>   kernel firmware loader).
> -- mac-address: See ethernet.txt in the parent directory
> -- local-mac-address: See ethernet.txt in the parent directory
> -
>  
>  In this example, the node is defined as child node of the PCI controller:
>   {
> @@ -46,3 +64,20 @@ In this example, the node is defined as child node of the 
> PCI controller:
>   qca,no-eeprom;
>   };
>  };
> +
> +In this example it is defined as a SoC device:
> + wmac@180c {

wifi@...

> + compatible = "qca,ar9132-wmac", "qca,ar9100-wmac";
> + reg = <0x180c 0x3>;
> +
> + interrupt-parent = <>;
> + interrupts = <2>;
> +
> + clock-names = "ref";
> + clocks = <>;
> +
> + nvmem-cell-names = "eeprom", "address";
> + nvmem-cells = <_eeprom>, <_address>;
> +
> + resets = < 22>;
> + };
> -- 
> 2.7.4
>

Re: [net-next PATCH 2/2] tcp: Record Rx hash and NAPI ID in tcp_child_process

2017-03-20 Thread Eric Dumazet

On Mon, Mar 20, 2017 at 2:48 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> While working on some recent busy poll changes we found that child sockets
> were being instantiated without NAPI ID being set.  In our first attempt to
> fix it, it was suggested that we should just pull programming the NAPI ID
> into the function itself since all callers will need to have it set.
>
> In addition to NAPI ID I have decided to also pull in populating the Rx
> hash since it likely has the same problem as NAPI ID but just doesn't have
> the visibility.

It looks like Rx hash was initialized elsewhere (
tcp_get_cookie_sock() & tcp_check_req())

So this probably could be cleaned up, if done at the proper place ;)

Re: [PATCH net v2] net: Do not hold the reference for the same sk_rx_dst

2017-03-20 Thread Kaiwen Xu

On Sun, Mar 19, 2017 at 09:09:38PM -0700, Cong Wang wrote:
> On Sat, Mar 18, 2017 at 9:03 PM, Kaiwen Xu  wrote:
> > On Sat, Mar 18, 2017 at 08:49:43PM -0700, Cong Wang wrote:
> >> On Sat, Mar 18, 2017 at 6:48 PM, Kevin Xu  wrote:
> >> > In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> >> > on the same dst, causing double refcounting. Eventually, it
> >> > prevents net_device to be destroyed. The bug manifested as
> >> >
> >> > unregister_netdevice: waiting for lo to become free. Usage count = 1
> >> >
> >> > in the kernel log, preventing new network namespace creation.
> >> >
> >> > Signed-off-by: Kevin Xu 
> >>
> >> Don't know why you don't follow the discussion on your v1...
> >>
> >> It is protected by bh_lock_sock(), so your patch is not needed
> >> at all.
> >>
> >> Read net/ipv4/udp.c:
> >>
> >> 1762 /* For TCP sockets, sk_rx_dst is protected by socket lock
> >> 1763  * For UDP, we use xchg() to guard against concurrent changes.
> >> 1764  */
> >
> > I probably misunderstood. Do you mean v2 patch is actually not needed or
> > the whole workaround is not necessary?
> 
> Your patch, no matter v1 or v2, is not needed because we use
> bh_lock_sock() to serialize inet_sk_rx_dst_set(), unless you find
> a case where we miss the bh_lock_sock(), but you don't say it in
> your changelog. "some rare cases" is not enough to justify this bug.

I see, thanks for your explanation! I will try to dig in more to see if
I can find the root cause.

linux-next-20170320 breaks stmmac on meson (Amlogic S905GXBB)

2017-03-20 Thread Heiner Kallweit

As reported by Corentin Labbe before:
stmmac in the latest next kernel is broken also on meson8b.

The following commit seems to create the trouble:
6deee2221e11 "net: stmmac: prepare dma op mode config for multiple queues"

I also get queue timeout errors.

Rgds, Heiner

[net-next PATCH 0/2] NAPI ID fixups related to busy polling

2017-03-20 Thread Alexander Duyck

These two patches are a couple of minor clean-ups related to busy polling.
The first one addresses the fact that we were trying to busy poll on
sender_cpu values instead of true NAPI IDs.  The second addresses the fact
that there were a few paths where TCP sockets were being instanciated based
on a received patcket, but not recording the hash or NAPI ID of the packet
that was used to instanciate them.

---

Alexander Duyck (2):
  net: Busy polling should ignore sender CPUs
  tcp: Record Rx hash and NAPI ID in tcp_child_process


 include/net/busy_poll.h  |   11 +--
 net/core/dev.c   |6 +++---
 net/ipv4/tcp_ipv4.c  |2 --
 net/ipv4/tcp_minisocks.c |5 +
 net/ipv6/tcp_ipv6.c  |2 --
 5 files changed, 17 insertions(+), 9 deletions(-)

--

[net-next PATCH 1/2] net: Busy polling should ignore sender CPUs

2017-03-20 Thread Alexander Duyck

From: Alexander Duyck 

This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.

One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling.  This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.

Fixes: 52bd2d62ce675 ("net: better skb->sender_cpu and skb->napi_id 
cohabitation")
Signed-off-by: Alexander Duyck 
---
 include/net/busy_poll.h |   11 +--
 net/core/dev.c  |6 +++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c0452de83086..edf1310212a1 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -35,6 +35,12 @@
 extern unsigned int sysctl_net_busy_read __read_mostly;
 extern unsigned int sysctl_net_busy_poll __read_mostly;
 
+/* 0 - Reserved to indicate value not set
+ * 1..NR_CPUS - Reserved for sender_cpu
+ *  NR_CPUS+1..~0 - Region available for NAPI IDs
+ */
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+
 static inline bool net_busy_loop_on(void)
 {
return sysctl_net_busy_poll;
@@ -58,10 +64,11 @@ static inline unsigned long busy_loop_end_time(void)
 
 static inline bool sk_can_busy_loop(const struct sock *sk)
 {
-   return sk->sk_ll_usec && sk->sk_napi_id && !signal_pending(current);
+   return sk->sk_ll_usec &&
+  (sk->sk_napi_id >= MIN_NAPI_ID) &&
+  !signal_pending(current);
 }
 
-
 static inline bool busy_loop_timeout(unsigned long end_time)
 {
unsigned long now = busy_loop_us_clock();
diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3837ca..5bbe30c08a5b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5143,10 +5143,10 @@ static void napi_hash_add(struct napi_struct *napi)
 
spin_lock(_hash_lock);
 
-   /* 0..NR_CPUS+1 range is reserved for sender_cpu use */
+   /* 0..NR_CPUS range is reserved for sender_cpu use */
do {
-   if (unlikely(++napi_gen_id < NR_CPUS + 1))
-   napi_gen_id = NR_CPUS + 1;
+   if (unlikely(++napi_gen_id < MIN_NAPI_ID))
+   napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
napi->napi_id = napi_gen_id;

[net-next PATCH 2/2] tcp: Record Rx hash and NAPI ID in tcp_child_process

2017-03-20 Thread Alexander Duyck

From: Alexander Duyck 

While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set.  In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to NAPI ID I have decided to also pull in populating the Rx
hash since it likely has the same problem as NAPI ID but just doesn't have
the visibility.

Reported-by: Sridhar Samudrala 
Suggested-by: Eric Dumazet 
Signed-off-by: Alexander Duyck 
---
 net/ipv4/tcp_ipv4.c  |2 --
 net/ipv4/tcp_minisocks.c |5 +
 net/ipv6/tcp_ipv6.c  |2 --
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7482b5d11861..20cbd2f07f28 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1409,8 +1409,6 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (!nsk)
goto discard;
if (nsk != sk) {
-   sock_rps_save_rxhash(nsk, skb);
-   sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 692f974e5abe..245b63856c04 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_tcp_abort_on_overflow __read_mostly;
 
@@ -798,6 +799,10 @@ int tcp_child_process(struct sock *parent, struct sock 
*child,
int ret = 0;
int state = child->sk_state;
 
+   /* record Rx hash and NAPI ID of child */
+   sock_rps_save_rxhash(child, skb);
+   sk_mark_napi_id(child, skb);
+
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
ret = tcp_rcv_state_process(child, skb);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0f08d718a002..ee13e380c0dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1293,8 +1293,6 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
goto discard;
 
if (nsk != sk) {
-   sock_rps_save_rxhash(nsk, skb);
-   sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb))
goto reset;
if (opt_skb)

Re: [PATCH] net: veth: use new api ethtool_{get|set}_link_ksettings

2017-03-20 Thread Philippe Reynes

Hi Xin Long,

On 3/20/17, Xin Long  wrote:
> On Sat, Mar 18, 2017 at 7:13 PM, Philippe Reynes  wrote:
>> The ethtool api {get|set}_settings is deprecated.
>> We move this driver to new api {get|set}_link_ksettings.
>>
>> Signed-off-by: Philippe Reynes 
>> ---
>>  drivers/net/veth.c |   22 ++
>>  1 files changed, 10 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 8c39d6d..730b133 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -45,18 +45,16 @@ struct veth_priv {
>> { "peer_ifindex" },
>>  };
>>
>> -static int veth_get_settings(struct net_device *dev, struct ethtool_cmd
>> *cmd)
>> +static int veth_get_link_ksettings(struct net_device *dev,
>> +  struct ethtool_link_ksettings *cmd)
>>  {
>> -   cmd->supported  = 0;
>> -   cmd->advertising= 0;
>> -   ethtool_cmd_speed_set(cmd, SPEED_1);
>> -   cmd->duplex = DUPLEX_FULL;
>> -   cmd->port   = PORT_TP;
>> -   cmd->phy_address= 0;
>> -   cmd->transceiver= XCVR_INTERNAL;
>> -   cmd->autoneg= AUTONEG_DISABLE;
>> -   cmd->maxtxpkt   = 0;
>> -   cmd->maxrxpkt   = 0;
>> +   ethtool_link_ksettings_zero_link_mode(cmd, supported);
>> +   ethtool_link_ksettings_zero_link_mode(cmd, advertising);
>> +   cmd->base.speed = SPEED_1;
>> +   cmd->base.duplex= DUPLEX_FULL;
>> +   cmd->base.port  = PORT_TP;
>> +   cmd->base.phy_address   = 0;
> It seem always:
> memset(_ksettings, 0, sizeof(link_ksettings));
> err = dev->ethtool_ops->get_link_ksettings(dev, _ksettings);

You're right.

> do we really need:
>ethtool_link_ksettings_zero_link_mode(cmd, supported);
>ethtool_link_ksettings_zero_link_mode(cmd, advertising);
>cmd->base.phy_address   = 0;
> ?

As this code is just an api change, I prefer to keep this code.
Of course, if David prefer to remove this code, I'll remove it.

But you're right, there is a lot of function "get_link_ksettings"
that set to 0 some variable. It's useless as done just before
calling the callback.



Regards,
Philippe

Re: mlx5e backports for v4.9 -stable

2017-03-20 Thread Saeed Mahameed



On 03/17/2017 02:06 AM, David Miller wrote:
> 
> Commits:
> 
> 
> From b0d4660b4cc52e6477ca3a43435351d565dfcedc Mon Sep 17 00:00:00 2001
> From: Tariq Toukan 
> Date: Wed, 22 Feb 2017 17:20:14 +0200
> Subject: [PATCH] net/mlx5e: Fix broken CQE compression initialization
> 
> 
> and
> 
> 
> From 6dc4b54e77282caf17f0ff72aa32dd296037fbc0 Mon Sep 17 00:00:00 2001
> From: Saeed Mahameed 
> Date: Wed, 22 Feb 2017 17:20:15 +0200
> Subject: [PATCH] net/mlx5e: Update MPWQE stride size when modifying CQE
>  compress state
> 
> 
> do not apply even closely to v4.9 while I was working on -stable backports.
> 
> Please provide proper backports of these two patches if you want them to
> show up in v4.9 -stable.
> 

Hi Dave,

thank you for trying, we will provide the patches, but I don't know what is the 
right procedure
to do so.

is it ok to post the patches applied on top tag v4.9.16  of 
kernel/git/stable/linux-stable.git ?
to whom should I send them ? 

thanks,
Saeed.

[PATCH net 1/1] net: tcp: Permit user set TCP_MAXSEG to default value

2017-03-20 Thread fgao

From: Gao Feng 

When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.

Signed-off-by: Gao Feng 
---
 net/ipv4/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1e319a5..dd5e8e2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2470,7 +2470,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
/* Values greater than interface MTU won't take effect. However
 * at the point when this call is done we typically don't yet
 * know which interface is going to be used */
-   if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
+   if (val < 0 || (val > 0 && val < TCP_MIN_MSS) ||
+   val > MAX_TCP_WINDOW) {
err = -EINVAL;
break;
}
-- 
1.9.1

Re: [net-next sample action optimization v3 4/4] Openvswitch: Refactor sample and recirc actions implementation

2017-03-20 Thread Andy Zhou

On Sat, Mar 18, 2017 at 12:22 PM, Pravin Shelar  wrote:
> On Thu, Mar 16, 2017 at 3:48 PM, Andy Zhou  wrote:
>> Added clone_execute() that both the sample and the recirc
>> action implementation can use.
>>
>> Signed-off-by: Andy Zhou 
>> ---
>>  net/openvswitch/actions.c | 175 
>> --
>>  1 file changed, 92 insertions(+), 83 deletions(-)
>>
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index 3529f7b..e38fa7f 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -44,10 +44,6 @@
>>  #include "conntrack.h"
>>  #include "vport.h"
>>
>> -static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>> - struct sw_flow_key *key,
>> - const struct nlattr *attr, int len);
>> -
>>  struct deferred_action {
>> struct sk_buff *skb;
>> const struct nlattr *actions;
>> @@ -166,6 +162,12 @@ static bool is_flow_key_valid(const struct sw_flow_key 
>> *key)
>> return !(key->mac_proto & SW_FLOW_KEY_INVALID);
>>  }
>>
>> +static int clone_execute(struct datapath *dp, struct sk_buff *skb,
>> +struct sw_flow_key *key,
>> +u32 recirc_id,
>> +const struct nlattr *actions, int len,
>> +bool last, bool clone_flow_key);
>> +
> With this function the diff stat looks much better.
>
> ...
> ...
>> +/* Execute the actions on the clone of the packet. The effect of the
>> + * execution does not affect the original 'skb' nor the original 'key'.
>> + *
>> + * The execution may be deferred in case the actions can not be executed
>> + * immediately.
>> + */
>> +static int clone_execute(struct datapath *dp, struct sk_buff *skb,
>> +struct sw_flow_key *key, u32 recirc_id,
>> +const struct nlattr *actions, int len,
>> +bool last, bool clone_flow_key)
>> +{
>> +   bool is_sample = actions;
> Standard practice is use !! to convert pointer to boolean.
> I think this function does not need to know about sample action. So we
> can rename the boolean to have_actions or something similar.
O.K.  We can just check for actions pointer.

However, it will be obvious we are actually checking for sample or recirc
action. I will add some comments.

>
>> +   struct deferred_action *da;
>> +   struct sw_flow_key *clone;
>> +   int err = 0;
>> +
>> +   skb = last ? skb : skb_clone(skb, GFP_ATOMIC);
>> +   if (!skb) {
>> +   /* Out of memory, skip this action.
>> +*/
>> +   return 0;
>> +   }
>> +
>> +   /* In case the sample actions won't change the 'key',
>> +* current key can be used directly to execute sample actions.
>> +* Otherwise, allocate a new key from the
>> +* next recursion level of 'flow_keys'. If
>> +* successful, execute the sample actions without
>> +* deferring.
>> +*/
>> +   if (is_sample && clone_flow_key)
>> +   __this_cpu_inc(exec_actions_level);
>> +
> There is no need to increment actions level up here. it is only
> required for do_execute_actions(). ovs_dp_process_packet() already
> does it.

Right, that's why it is only done for 'is_sample'. There is a bug though,
the 'inc' needs to be done after clone.  I will rearrange the code to move 'inc'
only above the do_execute_actions().
>
>
>> +   clone = clone_flow_key ? clone_key(key) : key;
>> +   if (clone) {
>> +   if (is_sample) {
>> +   err = do_execute_actions(dp, skb, clone,
>> +actions, len);
>> +   } else {
>> +   clone->recirc_id = recirc_id;
>> +   ovs_dp_process_packet(skb, clone);
>> +   }
>> +   }
> wont this execute action twice, once here and again in deferred actions list?

Right, the return is missing for the if (clone) case.  I will post v4
soon. Thanks for the review
and comments.
>
>> +
>> +   if (is_sample && clone_flow_key)
>> +   __this_cpu_dec(exec_actions_level);
>> +
>> +   /* Out of 'flow_keys' space. Defer them */
>> +   da = add_deferred_actions(skb, key, actions, len);
>> +   if (da) {
>> +   if (!is_sample) {
>> +   key = >pkt_key;
>> +   key->recirc_id = recirc_id;
>> +   }
>> +   } else {
>> +   /* Drop the SKB and log an error. */
>> +   kfree_skb(skb);
>> +
>> +   if (net_ratelimit()) {
>> +   if (is_sample) {
>> +   pr_warn("%s: deferred action limit reached, 
>> drop sample action\n",
>> +   ovs_dp_name(dp));
>> +   } else {
>> +

Re: [PATCH net-next 0/2] netvsc: performance regressions in net-next

2017-03-20 Thread Stephen Hemminger

On Mon, 20 Mar 2017 13:28:03 -0700
Stephen Hemminger  wrote:

> Fix for performance regression introduced with NAPI change;
> and followup cleanup of vmbus core code.
> 
> Stephen Hemminger (2):
>   netvsc: fix NAPI performance regression
>   vmbus: no longer expose iterator internals
> 
>  drivers/hv/ring_buffer.c| 51 
> +
>  drivers/net/hyperv/hyperv_net.h |  1 +
>  drivers/net/hyperv/netvsc.c | 41 +++--
>  include/linux/hyperv.h  | 22 +-
>  4 files changed, 46 insertions(+), 69 deletions(-)
> 

Drop this version.  The code is correct the second patch causes
clashes with some later patches. Will resend.

[PATCH net-next 2/2] vmbus: no longer expose iterator internals

2017-03-20 Thread Stephen Hemminger

Since NAPI no longer needs to jump out of iterator early,
no need to expose the internal iterator steps.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/ring_buffer.c | 51 
 include/linux/hyperv.h   | 22 +
 2 files changed, 27 insertions(+), 46 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index c3f1a9e33cef..280e2010913f 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -458,13 +458,31 @@ struct vmpacket_descriptor *hv_pkt_iter_first(struct 
vmbus_channel *channel)
 EXPORT_SYMBOL_GPL(hv_pkt_iter_first);
 
 /*
+ * Update host ring buffer after iterating over packets.
+ */
+static void hv_pkt_iter_close(struct vmbus_channel *channel)
+{
+   struct hv_ring_buffer_info *rbi = >inbound;
+
+   /*
+* Make sure all reads are done before we update the read index since
+* the writer may start writing to the read area once the read index
+* is updated.
+*/
+   virt_rmb();
+   rbi->ring_buffer->read_index = rbi->priv_read_index;
+
+   hv_signal_on_read(channel);
+}
+
+/*
  * Get next vmbus packet from ring buffer.
  *
  * Advances the current location (priv_read_index) and checks for more
- * data. If the end of the ring buffer is reached, then return NULL.
+ * data. If at end of list, return NULL and update host.
  */
 struct vmpacket_descriptor *
-__hv_pkt_iter_next(struct vmbus_channel *channel,
+hv_pkt_iter_next(struct vmbus_channel *channel,
   const struct vmpacket_descriptor *desc)
 {
struct hv_ring_buffer_info *rbi = >inbound;
@@ -476,29 +494,12 @@ __hv_pkt_iter_next(struct vmbus_channel *channel,
if (rbi->priv_read_index >= dsize)
rbi->priv_read_index -= dsize;
 
-   /* more data? */
-   if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor))
+   /* if no more data? */
+   if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor)) {
+   hv_pkt_iter_close(channel);
return NULL;
-   else
-   return hv_get_ring_buffer(rbi) + rbi->priv_read_index;
-}
-EXPORT_SYMBOL_GPL(__hv_pkt_iter_next);
-
-/*
- * Update host ring buffer after iterating over packets.
- */
-void hv_pkt_iter_close(struct vmbus_channel *channel)
-{
-   struct hv_ring_buffer_info *rbi = >inbound;
-
-   /*
-* Make sure all reads are done before we update the read index since
-* the writer may start writing to the read area once the read index
-* is updated.
-*/
-   virt_rmb();
-   rbi->ring_buffer->read_index = rbi->priv_read_index;
+   }
 
-   hv_signal_on_read(channel);
+   return hv_get_ring_buffer(rbi) + rbi->priv_read_index;
 }
-EXPORT_SYMBOL_GPL(hv_pkt_iter_close);
+EXPORT_SYMBOL_GPL(hv_pkt_iter_next);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 36162485d663..7df6ab5b3067 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1550,32 +1550,12 @@ static inline u32 hv_pkt_datalen(const struct 
vmpacket_descriptor *desc)
return (desc->len8 << 3) - (desc->offset8 << 3);
 }
 
-
 struct vmpacket_descriptor *
 hv_pkt_iter_first(struct vmbus_channel *channel);
 
 struct vmpacket_descriptor *
-__hv_pkt_iter_next(struct vmbus_channel *channel,
-  const struct vmpacket_descriptor *pkt);
-
-void hv_pkt_iter_close(struct vmbus_channel *channel);
-
-/*
- * Get next packet descriptor from iterator
- * If at end of list, return NULL and update host.
- */
-static inline struct vmpacket_descriptor *
 hv_pkt_iter_next(struct vmbus_channel *channel,
-const struct vmpacket_descriptor *pkt)
-{
-   struct vmpacket_descriptor *nxt;
-
-   nxt = __hv_pkt_iter_next(channel, pkt);
-   if (!nxt)
-   hv_pkt_iter_close(channel);
-
-   return nxt;
-}
+  const struct vmpacket_descriptor *pkt);
 
 #define foreach_vmbus_pkt(pkt, channel) \
for (pkt = hv_pkt_iter_first(channel); pkt; \
-- 
2.11.0

[PATCH net-next 1/2] netvsc: fix NAPI performance regression

2017-03-20 Thread Stephen Hemminger

When using NAPI, the single stream performance declined signifcantly
because the poll routine was updating host after every burst
of packets. This excess signalling caused host throttling.

This fix restores the old behavior. Host is only signalled
after the ring has been emptied.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c | 41 ++---
 2 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 6b5f75217694..a33f2ee86044 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -723,6 +723,7 @@ struct net_device_context {
 /* Per channel data */
 struct netvsc_channel {
struct vmbus_channel *channel;
+   const struct vmpacket_descriptor *desc;
struct napi_struct napi;
struct multi_send_data msd;
struct multi_recv_comp mrc;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 989b7cd99380..727762d0f13b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1173,7 +1173,6 @@ static int netvsc_process_raw_pkt(struct hv_device 
*device,
  struct vmbus_channel *channel,
  struct netvsc_device *net_device,
  struct net_device *ndev,
- u64 request_id,
  const struct vmpacket_descriptor *desc)
 {
struct net_device_context *net_device_ctx = netdev_priv(ndev);
@@ -1195,7 +1194,7 @@ static int netvsc_process_raw_pkt(struct hv_device 
*device,
 
default:
netdev_err(ndev, "unhandled packet type %d, tid %llx\n",
-  desc->type, request_id);
+  desc->type, desc->trans_id);
break;
}
 
@@ -1222,28 +1221,20 @@ int netvsc_poll(struct napi_struct *napi, int budget)
u16 q_idx = channel->offermsg.offer.sub_channel_index;
struct net_device *ndev = hv_get_drvdata(device);
struct netvsc_device *net_device = net_device_to_netvsc_device(ndev);
-   const struct vmpacket_descriptor *desc;
int work_done = 0;
 
-   desc = hv_pkt_iter_first(channel);
-   while (desc) {
-   int count;
+   /* If starting a new interval */
+   if (!nvchan->desc)
+   nvchan->desc = hv_pkt_iter_first(channel);
 
-   count = netvsc_process_raw_pkt(device, channel, net_device,
-  ndev, desc->trans_id, desc);
-   work_done += count;
-   desc = __hv_pkt_iter_next(channel, desc);
-
-   /* If receive packet budget is exhausted, reschedule */
-   if (work_done >= budget) {
-   work_done = budget;
-   break;
-   }
+   while (nvchan->desc && work_done < budget) {
+   work_done += netvsc_process_raw_pkt(device, channel, net_device,
+   ndev, nvchan->desc);
+   nvchan->desc = hv_pkt_iter_next(channel, nvchan->desc);
}
-   hv_pkt_iter_close(channel);
 
-   /* If budget was not exhausted and
-* not doing busy poll
+   /* If receive ring was exhausted
+* and not doing busy poll
 * then re-enable host interrupts
 *  and reschedule if ring is not empty.
 */
@@ -1253,7 +1244,9 @@ int netvsc_poll(struct napi_struct *napi, int budget)
napi_reschedule(napi);
 
netvsc_chk_recv_comp(net_device, channel, q_idx);
-   return work_done;
+
+   /* Driver may overshoot since multiple packets per descriptor */
+   return min(work_done, budget);
 }
 
 /* Call back when data is available in host ring buffer.
@@ -1263,10 +1256,12 @@ void netvsc_channel_cb(void *context)
 {
struct netvsc_channel *nvchan = context;
 
-   /* disable interupts from host */
-   hv_begin_read(>channel->inbound);
+   if (napi_schedule_prep(>napi)) {
+   /* disable interupts from host */
+   hv_begin_read(>channel->inbound);
 
-   napi_schedule(>napi);
+   __napi_schedule(>napi);
+   }
 }
 
 /*
-- 
2.11.0

[PATCH net-next 0/2] netvsc: performance regressions in net-next

2017-03-20 Thread Stephen Hemminger

Fix for performance regression introduced with NAPI change;
and followup cleanup of vmbus core code.

Stephen Hemminger (2):
  netvsc: fix NAPI performance regression
  vmbus: no longer expose iterator internals

 drivers/hv/ring_buffer.c| 51 +
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c | 41 +++--
 include/linux/hyperv.h  | 22 +-
 4 files changed, 46 insertions(+), 69 deletions(-)

-- 
2.11.0

Re: [PATCH net-next v4 2/2]L2TP:Adjust intf MTU, add underlay L3, L2 hdrs

2017-03-20 Thread R. Parameswaran



Hi James,

Please see inline:

On Mon, 20 Mar 2017, James Chapman wrote:

> I suggest change the wording of the first paragraph in the patch comment
> to better represent why the changes are being made. Perhaps something
> like the following?
> 
> "Existing L2TP kernel code does not derive the optimal MTU for Ethernet
> pseudowires and instead leaves this to a userspace L2TP daemon or
> operator. If an MTU is not specified, the existing kernel code chooses
> an MTU that does not take account of all tunnel header overheads, which
> can lead to unwanted IP fragmentation. When L2TP is used without a
> control plane (userspace daemon), we would prefer that the kernel does a
> better job of choosing a default pseudowire MTU, taking account of all
> tunnel header overheads, including IP header options, if any. This patch
> addresses this."
> 

This reads quite a bit better, thanks for suggesting this. I will
pick it up. Plan to  retain the second paragraph while removing the 1/2, 
2/2 references, while keeping the patch rev at v4. 
I'll also respond to your email on the other patch in a bit, with suggested 
text which you could review/comment on. I'll re-post with changes after 
that. 

thanks,

Ramkumar

> 
> On 18/03/17 02:00, R. Parameswaran wrote:
> > In existing kernel code, when setting up the L2TP interface, all of the
> > tunnel encapsulation headers are not taken into account when setting
> > up the MTU on the  L2TP logical interface device. Due to this, the
> > packets created by the applications on top of the L2TP layer are larger
> > than they ought to be, relative to the underlay MTU, which leads to
> > needless fragmentation once the L2TP packet is encapsulated in an outer IP
> > packet.  Specifically, the MTU calculation  does not take into account the
> > (outer) IP header imposed on the encapsulated L2TP packet, and the Layer 2
> > header imposed on the inner L2TP packet prior to encapsulation.
> >
> > Change-set here (2/2) uses the new kernel API to compute the IP overhead
> > on an IPv4 or IPv6 socket, introduced in 1/2, in the L2TP Eth device setup
> > to factor the additional encap overheads from the underlay IP header and
> > Ethernet header on overlay (inner packet), to size the MTU on the L2TP
> > logical device to its correct value.
> >
> > Signed-off-by: R. Parameswaran 
> > ---
> >  net/l2tp/l2tp_eth.c | 55 
> > +
> >  1 file changed, 51 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> > index 8bf18a5..f143fa4 100644
> > --- a/net/l2tp/l2tp_eth.c
> > +++ b/net/l2tp/l2tp_eth.c
> > @@ -30,6 +30,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> >  
> >  #include "l2tp_core.h"
> >  
> > @@ -204,6 +207,53 @@ static void l2tp_eth_show(struct seq_file *m, void 
> > *arg)
> >  }
> >  #endif
> >  
> > +static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
> > +   struct l2tp_session *session,
> > +   struct net_device *dev)
> > +{
> > +   unsigned int overhead = 0;
> > +   struct dst_entry *dst;
> > +   u32 l3_overhead = 0;
> > +
> > +   /* if the encap is UDP, account for UDP header size */
> > +   if (tunnel->encap == L2TP_ENCAPTYPE_UDP) {
> > +   overhead += sizeof(struct udphdr);
> > +   dev->needed_headroom += sizeof(struct udphdr);
> > +   }
> > +   if (session->mtu != 0) {
> > +   dev->mtu = session->mtu;
> > +   dev->needed_headroom += session->hdr_len;
> > +   return;
> > +   }
> > +   l3_overhead = kernel_sock_ip_overhead(tunnel->sock);
> > +   if (l3_overhead == 0) {
> > +   /* L3 Overhead couldn't be identified, this could be
> > +* because tunnel->sock was NULL or the socket's
> > +* address family was not IPv4 or IPv6,
> > +* dev mtu stays at 1500.
> > +*/
> > +   return;
> > +   }
> > +   /* Adjust MTU, factor overhead - underlay L3, overlay L2 hdr
> > +* UDP overhead, if any, was already factored in above.
> > +*/
> > +   overhead += session->hdr_len + ETH_HLEN + l3_overhead;
> > +
> > +   /* If PMTU discovery was enabled, use discovered MTU on L2TP device */
> > +   dst = sk_dst_get(tunnel->sock);
> > +   if (dst) {
> > +   /* dst_mtu will use PMTU if found, else fallback to intf MTU */
> > +   u32 pmtu = dst_mtu(dst);
> > +
> > +   if (pmtu != 0)
> > +   dev->mtu = pmtu;
> > +   dst_release(dst);
> > +   }
> > +   session->mtu = dev->mtu - overhead;
> > +   dev->mtu = session->mtu;
> > +   dev->needed_headroom += session->hdr_len;
> > +}
> > +
> >  static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, 
> > u32 peer_session_id, struct l2tp_session_cfg *cfg)
> >  {
> > struct net_device *dev;
> > @@ -253,13 +303,10 @@ static int l2tp_eth_create(struct net *net, u32 
> >

linux-next-20170320 break stmmac on dwmac-sunxi

2017-03-20 Thread Corentin Labbe

Hello

Just pushed next-20170320 to my boards and stmmac stop working on both intree 
dwmac-sunxi and my dev dwmac-sun8i.
It seems that interrupts never fire, and transmit queue timeout.
I will try to bisect this problem but perhaps other people could try to 
reproduce it.

Regards
Corentin Labbe

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-03-20 Thread Marcelo Ricardo Leitner

On Sun, Mar 19, 2017 at 12:20:26PM -0700, Eric Dumazet wrote:
> On Sun, 2017-03-19 at 13:14 +0100, Markus Trippelsdorf wrote:
> > On 2017.02.06 at 19:12 -0200, Marcelo Ricardo Leitner wrote:
> > > On Fri, Feb 03, 2017 at 06:47:33AM -0800, Eric Dumazet wrote:
> > > > On Fri, 2017-02-03 at 12:28 -0200, Marcelo Ricardo Leitner wrote:
> > > > 
> > > > > Aren't you mixing the endpoints here? MSS is the largest amount of 
> > > > > data
> > > > > that the peer can receive in a single segment, and not how much it 
> > > > > will
> > > > > send. For the sending part, that depends on what the other peer
> > > > > announced, and we can have 2 different MSS in a single connection, one
> > > > > for each peer.
> > > > > 
> > > > > If a peer later wants to send larger segments, it can, but it must
> > > > > respect the mss advertised by the other peer during handshake.
> > > > > 
> > > > 
> > > > I am not mixing endpoints, you are.
> > > > 
> > > > If you need to be convinced, please grab :
> > > > https://patchwork.ozlabs.org/patch/723028/
> > > > 
> > > > And just watch "ss -temoi ..." 
> > > 
> > > I still don't get it, but I also hit the warning on my laptop, using
> > > iwlwifi. Not sure what I did in order to trigger it, it was by accident.
> > 
> > After many weeks without any warning, I've hit the issue again today:

Nice!

> > 
> >  TCP: eth0: Driver has suspect GRO implementation, TCP performance may be 
> > compromised. rcv_mss:1448 advmss:1448 len:1460
> > 
> 
> It is very possible the sender suddenly forgot to use TCP timestamps.

By those 12 bytes, seems so, yes.

> This warning is a hint, and can not assume senders are not dumb.

Agreed. But we can make it consider such cases. What about the following
patch? (untested)

I think we can directly account for the size of the timestamps in there,
as that won't make a difference to congestion control in case it's
wrong, and also validate against MTU if we have it. I didn't subtract
the headers from MTU on purpose, as dealing with ipv4/ipv6 there is
not worth for the same reason.

This should silent this false-positive.

---8<---

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96b67a8b18c3..96a99446ddce 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -126,7 +126,8 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 #define REXMIT_LOST1 /* retransmit packets marked lost */
 #define REXMIT_NEW 2 /* FRTO-style transmit of unsent/new packets */
 
-static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb)
+static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
+unsigned int len)
 {
static bool __once __read_mostly;
 
@@ -137,8 +138,9 @@ static void tcp_gro_dev_warn(struct sock *sk, const struct 
sk_buff *skb)
 
rcu_read_lock();
dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
-   pr_warn("%s: Driver has suspect GRO implementation, TCP 
performance may be compromised.\n",
-   dev ? dev->name : "Unknown driver");
+   if (!dev || len >= dev->mtu)
+   pr_warn("%s: Driver has suspect GRO implementation, TCP 
performance may be compromised.\n",
+   dev ? dev->name : "Unknown driver");
rcu_read_unlock();
}
 }
@@ -161,8 +163,9 @@ static void tcp_measure_rcv_mss(struct sock *sk, const 
struct sk_buff *skb)
if (len >= icsk->icsk_ack.rcv_mss) {
icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
   tcp_sk(sk)->advmss);
-   if (unlikely(icsk->icsk_ack.rcv_mss != len))
-   tcp_gro_dev_warn(sk, skb);
+   /* The + 12 accounts for the possible lack of timestamps */
+   if (unlikely(icsk->icsk_ack.rcv_mss + 12 < len))
+   tcp_gro_dev_warn(sk, skb, len);
} else {
/* Otherwise, we make more careful check taking into account,
 * that SACKs block is variable.

[PATCH] sock: introduce SO_MEMINFO getsockopt

2017-03-20 Thread Josh Hunt

Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.

Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().

Suggested by Eric Dumazet.

Signed-off-by: Josh Hunt 
Reviewed-by: Jason Baron 
Acked-by: Eric Dumazet 
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/avr32/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h |  2 ++
 arch/ia64/include/uapi/asm/socket.h|  2 ++
 arch/m32r/include/uapi/asm/socket.h|  2 ++
 arch/mips/include/uapi/asm/socket.h|  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/powerpc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h|  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  2 ++
 net/core/sock.c| 30 ++
 net/core/sock_diag.c   | 10 +-
 16 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index afc901b..089db42 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -99,4 +99,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h 
b/arch/avr32/include/uapi/asm/socket.h
index 5a65042..6eabcbd 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 81e0353..bd497f8 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -92,5 +92,7 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 57feb0c..f1bb546 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 5853f8e9..459c460 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 566ecdc..688c18d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -110,4 +110,7 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index 0e12527..312d2c4 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 7a109b7..b98ec38 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 0x402F
 
+#define SO_MEMINFO 0x4030
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 44583a5..099a889 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -99,4 +99,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index b24a64c..6199bb3 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#defineSO_MEMINFO  55
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index a25dc32..12cd8c2 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -88,6 +88,8 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS

Re: [PATCH v4 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-20 Thread Akshay Bhat

Hi Wolfgang,

On 03/20/2017 12:46 PM, Wolfgang Grandegger wrote:
..snip..
>>
>> The top 3 bits of HI3110_READ_ERR (BUSOFF, TXERRP, RXERRP) are valid
>> even if HI3110_INT_BUSERR is not set.
> 
> I'm confused! If you disable BUSERR interrupts, you do not get the
> status bits any longer, you said. But the manual says: "Bits 4:0 in the
> ERR register can be read to determine the source of the error.", which
> excludes the above bits... but obviously the controller does it that way.
> 

I agree this feature could use better documentation.

Based on testing:

If BUSERRIE bit in INTE is clear,
On cable disconnect: No interrupts related to bus errors are generated.
So status change messages do not go out.

If BUSERRIE bit in INTE is set,
On cable disconnect: Interrupts are generated due to ACKERR. The
interrupt routine reads the BUSOFF/TXERRP/RXERRP bits of ERR register
and reports the state. If CAN_CTRLMODE_BERR_REPORTING is set, then
protocol errors are also reported by the interrupt.
On cable re-connect: Interrupts are generated due to TXCPLT. The
interrupt routine reads the BUSOFF/TXERRP/RXERRP bits of ERR register
and reports the state.

Let me know if this helps clear things up :)

Re: [PATCH net] sctp: out_qlen should be updated when pruning unsent queue

2017-03-20 Thread Marcelo Ricardo Leitner

On Sat, Mar 18, 2017 at 08:03:59PM +0800, Xin Long wrote:
> This patch is to fix the issue that sctp_prsctp_prune_sent forgot
> to update q->out_qlen when removing a chunk from unsent queue.
> 
> Fixes: 8dbdf1f5b09c ("sctp: implement prsctp PRIO policy")
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/outqueue.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index db352e5..025ccff 100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -382,17 +382,18 @@ static int sctp_prsctp_prune_sent(struct 
> sctp_association *asoc,
>  }
>  
>  static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
> - struct sctp_sndrcvinfo *sinfo,
> - struct list_head *queue, int msg_len)
> + struct sctp_sndrcvinfo *sinfo, int msg_len)
>  {
> + struct sctp_outq *q = >outqueue;
>   struct sctp_chunk *chk, *temp;
>  
> - list_for_each_entry_safe(chk, temp, queue, list) {
> + list_for_each_entry_safe(chk, temp, >out_chunk_list, list) {
>   if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
>   chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive)
>   continue;
>  
>   list_del_init(>list);
> + q->out_qlen -= chk->skb->len;
>   asoc->sent_cnt_removable--;
>   asoc->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
>  
> @@ -431,9 +432,7 @@ void sctp_prsctp_prune(struct sctp_association *asoc,
>   return;
>   }
>  
> - sctp_prsctp_prune_unsent(asoc, sinfo,
> -  >outqueue.out_chunk_list,
> -  msg_len);
> + sctp_prsctp_prune_unsent(asoc, sinfo, msg_len);
>  }
>  
>  /* Mark all the eligible packets on a transport for retransmission.  */
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[PATCH net-next v5 1/3] Add a helper function to get socket cookie in eBPF

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Signed-off-by: Chenbo Feng 
---
 include/linux/sock_diag.h  |  1 +
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 17 +
 net/core/sock_diag.c   |  2 +-
 tools/include/uapi/linux/bpf.h |  3 ++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index a0596ca0..a2f8109 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h);
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 
+u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0539a0c..dc81a9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -456,6 +456,12 @@ union bpf_attr {
  * Return:
  *   > 0 length of the string including the trailing NUL on success
  *   < 0 error
+ *
+ * u64 bpf_bpf_get_socket_cookie(skb)
+ * Get the cookie for the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: 8 Bytes non-decreasing number on success or 0 if the socket
+ * field is missing inside sk_buff
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -503,7 +509,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index ebaeaf2..5b65ae3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2599,6 +2600,18 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
+{
+   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
+}
+
+static const struct bpf_func_proto bpf_get_socket_cookie_proto = {
+   .func   = bpf_get_socket_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
switch (func_id) {
case BPF_FUNC_skb_load_bytes:
return _skb_load_bytes_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _get_smp_processor_id_proto;
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..acd2a6c 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct 
nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
-static u64 sock_gen_cookie(struct sock *sk)
+u64 sock_gen_cookie(struct sock *sk)
 {
while (1) {
u64 res = atomic64_read(>sk_cookie);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0539a0c..a94bdd3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -503,7 +503,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4

[PATCH net-next v5 2/3] Add a eBPF helper function to retrieve socket uid

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.

Signed-off-by: Chenbo Feng 
---
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 22 ++
 tools/include/uapi/linux/bpf.h |  3 ++-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dc81a9f..ff42111 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -462,6 +462,12 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: 8 Bytes non-decreasing number on success or 0 if the socket
  * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -510,7 +516,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b65ae3..a7c25c1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,24 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   kuid_t kuid;
+   struct sock *sk = sk_to_full_sk(skb->sk);
+
+   if (!sk || !sk_fullsock(sk))
+   return overflowuid;
+   kuid = sock_net_uid(sock_net(sk), sk);
+   return from_kuid_munged(current_user_ns(), kuid);
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2648,6 +2666,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
return _skb_load_bytes_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2709,6 +2729,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _skb_under_cgroup_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a94bdd3..4a2d56d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -504,7 +504,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4

[PATCH net-next v5 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptables
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Signed-off-by: Chenbo Feng 
---
 samples/bpf/Makefile |   3 +
 samples/bpf/cookie_uid_helper_example.c  | 223 +++
 samples/bpf/libbpf.h |  10 ++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 4 files changed, 250 insertions(+)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 09e9d53..f803f51 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -34,6 +34,7 @@ hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
+hostprogs-y += per_socket_stats_example
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -72,6 +73,7 @@ sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
+per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -105,6 +107,7 @@ always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
 always += xdp_tx_iptunnel_kern.o
+always += cookie_uid_helper_example.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
new file mode 100644
index 000..c4e36ad
--- /dev/null
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -0,0 +1,223 @@
+/* This test is a demo of using get_socket_uid and get_socket_cookie
+ * helper function to do per socket based network traffic monitoring.
+ * It requires iptables version higher then 1.6.1. to load pinned eBPF
+ * program into the xt_bpf match.
+ *
+ * TEST:
+ * ./run_cookie_uid_helper_example.sh
+ * Then generate some traffic in variate ways. ping 0 -c 10 would work
+ * but the cookie and uid in this case could both be 0. A sample output
+ * with some traffic generated by web browser is shown below:
+ *
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286
+ * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726
+ * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712
+ * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70
+ *
+ * Clean up: if using shell script, the script file will delete the iptables
+ * rule and unmount the bpf program when exit. Else the iptables rule need
+ * to be deleted by hand, see run_cookie_uid_helper_example.sh for detail.
+ */
+
+#define _GNU_SOURCE
+
+#define offsetof(type, member) __builtin_offsetof(type, member)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+struct stats {
+   uint32_t uid;
+   uint64_t packets;
+   uint64_t bytes;
+};
+
+static int map_fd, prog_fd;
+
+static void maps_create(void)
+{
+   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t),
+   sizeof(struct stats), 100, 0);
+   if (map_fd < 0)
+   error(1, errno, "map create failed!\n");
+}
+
+static void prog_load(void)
+{
+   static char log_buf[1 << 16];
+
+   struct bpf_insn prog[] = {
+   /*
+* Save sk_buff for future usage. value stored in R6 to R10 will
+* not be reset after a bpf helper function call.
+*/
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   /*
+* pc1: BPF_FUNC_get_socket_cookie takes one parameter,
+* R1: sk_buff
+*/
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_cookie),
+   /* pc2-4: save  to r7 for future usage*/
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
+   /*
+*

[PATCH net-next v5 0/3] net: core: Two Helper function about socket information

2017-03-20 Thread Chenbo Feng

From: Chenbo Feng 

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty. These helper functions
can be used on socket and uid based traffic monitoring programs.

Change since V4:
* Using current user namespace to get uid instead of using init_ns.
* Add compiling setup of example program in to Makefile.
* Change the name style of the example program binaries.

Change since V3:
* Fixed some typos and incorrect comments in sample program
* replaced raw insns with BPF_STX_XADD and add it to libbpf.h
* Use a temp dir as mount point instead and added a check for
  the user input string.
* Make the get uid helper function returns the user namespace uid
  instead of kuid.
* Return a overflowuid instead of 0 when no uid information is found.

Change since V2:
* Add a sample program to demostrate the usage of the helper function.
* Moved the helper function proto invoking place.
* Add function header into tools/include
* Apply sk_to_full_sk() before getting uid.

Change since V1:
* Removed the unnecessary declarations and export command
* resolved conflict with master branch.
* Examine if the socket is a full socket before getting the uid.


Chenbo Feng (3):
  Add a helper function to get socket cookie in eBPF
  Add a eBPF helper function to retrieve socket uid
  A Sample of using socket cookie and uid for traffic monitoring

 include/linux/sock_diag.h|   1 +
 include/uapi/linux/bpf.h |  16 +-
 net/core/filter.c|  39 +
 net/core/sock_diag.c |   2 +-
 samples/bpf/Makefile |   3 +
 samples/bpf/cookie_uid_helper_example.c  | 223 +++
 samples/bpf/libbpf.h |  10 ++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 tools/include/uapi/linux/bpf.h   |   4 +-
 9 files changed, 309 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

-- 
2.7.4

SCTP MSG_MORE code

2017-03-20 Thread David Laight

Something needs to be done with SCTP MSG_MORE before the end of the rc cycle.
The current code is definitely broken.

I objected to the last 'fix' patch because it clears the flag is a place where
I don't think it is necessary to do so - so could generate extra ethernet 
frames.

David

Re: [PATCH net-next] stmmac: call stmmac_init_phy from stmmac_dvr_probe

2017-03-20 Thread Niklas Cassel

On 03/20/2017 06:43 PM, Florian Fainelli wrote:
> On 03/20/2017 10:29 AM, Niklas Cassel wrote:
>> From: Niklas Cassel 
>>
>> It is usually possible to do
>> ethtool -s autoneg on
>> so that you trigger an autoneg before calling
>> ip link set dev eth0 up
> This is completely driver specific and there is no guarantee for this to
> work universally across all device drivers because when your interface
> is brought down, the most sensible thing to expect in return is that
> your PHY is powered down (unless your interface participates in
> Wake-on-LAN).
>
>> However, stmmac returns -EBUSY if !netif_running.
>> The only reason for this appears to be that stmmac_init_phy
>> is called from stmmac_open instead of from stmmac_dvr_probe.
>>
>> Move stmmac_init_phy to stmmac_dvr_probe so that ethool
>> works as soon as register_netdev has been called.
>> stmmac_check_ether_addr was also moved to probe,
>> so that the ordering doesn't change.
> Are you sure this is a good idea? There are many drivers that moved the
> PHY probe into ndo_open() for mainly two things:
>
> - phy_connect() starts the PHY state machine and starting the state
> machine without a network device running is kind of wasting cycles
>
> - if the interface is probed, but not used, you are keeping the Ethernet
> link running without being able to service packets, which is at best a
> waste of power

Hello Florian

Thank you for your input.
I can see the point in keeping phy_connect in ndo_open.

What I dislike is the -EBUSY from stmmac_ethtool_get_link_ksettings,
since this will create warnings in user space by our favorite monolith.
(Please don't flame me, I dislike it as much as you guys.)

[ WARNING ] systemd-udevd[236]: link_config: could not get ethtool features for 
eth0
[ WARNING ] systemd-udevd[236]: Could not set offload features of eth0: Device 
or resource busy

However, it is kind of sad that drivers are so inconsistent of what goes
in probe and what goes in ndo_open...which is tied together with the
whole mess of when certain ethtool commands work or do not work.

Do you know of a good way to avoid the -EBUSY in 
stmmac_ethtool_get_link_ksettings,
but still keep phy_connect in ndo_open?
The current code checks netif_running(), which checks __LINK_STATE_START,
which gets set by __dev_open().
stmmac_ethtool_get_link_ksettings also returns -ENODEV if ndev->phydev == NULL.

[PATCH net-next 1/2] net: vrf: performance improvements for IPv4

2017-03-20 Thread David Ahern

The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.

The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.

This patch avoids the redirect for IPv4 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.

Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.

The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) to ~3% for UDP which has a lookup per packet and
< 1% overhead for connected sockets that leverage early demux and
avoid FIB lookups.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 106 --
 1 file changed, 96 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7f28021d9d93..cdf7253ae89e 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -104,6 +104,23 @@ static void vrf_get_stats64(struct net_device *dev,
}
 }
 
+/* by default VRF devices do not have a qdisc and are expected
+ * to be created with only a single queue.
+ */
+static bool qdisc_tx_is_default(const struct net_device *dev)
+{
+   struct netdev_queue *txq;
+   struct Qdisc *qdisc;
+
+   if (dev->num_tx_queues > 1)
+   return false;
+
+   txq = netdev_get_tx_queue(dev, 0);
+   qdisc = rcu_access_pointer(txq->qdisc);
+
+   return !qdisc->enqueue;
+}
+
 /* Local traffic destined to local address. Reinsert the packet to rx
  * path, similar to loopback handling.
  */
@@ -357,6 +374,29 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
return ret;
 }
 
+static int vrf_finish_direct(struct net *net, struct sock *sk,
+struct sk_buff *skb)
+{
+   struct net_device *vrf_dev = skb->dev;
+
+   if (!list_empty(_dev->ptype_all) &&
+   likely(skb_headroom(skb) >= ETH_HLEN)) {
+   struct ethhdr *eth = (struct ethhdr *)skb_push(skb, ETH_HLEN);
+
+   ether_addr_copy(eth->h_source, vrf_dev->dev_addr);
+   eth_zero_addr(eth->h_dest);
+   eth->h_proto = skb->protocol;
+
+   rcu_read_lock_bh();
+   dev_queue_xmit_nit(skb, vrf_dev);
+   rcu_read_unlock_bh();
+
+   skb_pull(skb, ETH_HLEN);
+   }
+
+   return 1;
+}
+
 #if IS_ENABLED(CONFIG_IPV6)
 /* modelled after ip6_finish_output2 */
 static int vrf_finish_output6(struct net *net, struct sock *sk,
@@ -607,18 +647,13 @@ static int vrf_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
  * packet to go through device based features such as qdisc, netfilter
  * hooks and packet sockets with skb->dev set to vrf device.
  */
-static struct sk_buff *vrf_ip_out(struct net_device *vrf_dev,
- struct sock *sk,
- struct sk_buff *skb)
+static struct sk_buff *vrf_ip_out_redirect(struct net_device *vrf_dev,
+  struct sk_buff *skb)
 {
struct net_vrf *vrf = netdev_priv(vrf_dev);
struct dst_entry *dst = NULL;
struct rtable *rth;
 
-   /* don't divert multicast */
-   if (ipv4_is_multicast(ip_hdr(skb)->daddr))
-   return skb;
-
rcu_read_lock();
 
rth = rcu_dereference(vrf->rth);
@@ -640,6 +675,55 @@ static struct sk_buff *vrf_ip_out(struct net_device 
*vrf_dev,
return skb;
 }
 
+static int vrf_output_direct(struct net *net, struct sock *sk,
+struct sk_buff *skb)
+{
+   skb->protocol = htons(ETH_P_IP);
+
+   return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
+   net, sk, skb, NULL, skb->dev,
+   vrf_finish_direct,
+   !(IPCB(skb)->flags & IPSKB_REROUTED));
+}
+
+static struct sk_buff *vrf_ip_out_direct(struct net_device *vrf_dev,
+struct sock *sk,
+struct sk_buff *skb)
+{
+   struct net *net =

[PATCH net-next 0/2] net: vrf: performance improvements

2017-03-20 Thread David Ahern

Device based features for VRF such as qdisc, netfilter and packet
captures are implemented by switching the dst on skbuffs to its per-VRF
dst. This has the effect of controlling the output function which points
a function in the VRF driver. [1] The skb proceeds down the stack with
dst->dev pointing to the VRF device. Netfilter, qdisc and tc rules and
network taps are evaluated based on this device. Finally, the skb makes
it to the vrf_xmit function which resets the dst based on a FIB lookup.

The feature comes at cost - between 5 and 10% depending on test (TCP vs
UDP, stream vs RR and IPv4 vs IPv6). The main cost is requiring a FIB
lookup in the VRF driver for each packet sent through it. The FIB lookup
is required because the real dst gets dropped so that the skb can
traverse the stack with dst->dev set to the VRF device.

All of that is really driven by the qdisc and not replicating the
processing of __dev_queue_xmit if a qdisc is set up on the device. But,
VRF devices by default do not have a qdisc and really have no need for
multiple Tx queues. This means the performance overhead is inflicted upon
all users for the potential use case of a qdisc being configured.

The overhead can be avoided by checking if the default configuration
applies to a specific VRF device before switching the dst. If a device
does not have a qdisc, the pass through netfilter hooks and packet taps
can be done inline without dropping the dst and thus avoiding the
performance penalty. With this change performance overhead of VRF drops
to neglible (difference with run-over-run variance) to 3% depending on
test type.

netperf performance comparison for 3 cases:
1. L3_MASTER_DEVICE compiled out
2. VRF with this patch set
3. current VRF code

IPv4

   no-l3mdev new-vrf old-vrf
TCP_RR   2877828938*   27169
TCP_CRR  1070610490 9770
UDP_RR   307502981329256

* Although higher in the final run used for submitting this patch set, I
  think what this really represents is a neglible performance overhead for
  VRF with this change (i.e, within the +-1% variance of runs). Most
  notably the FIB lookups in the Tx path are avoided for TCP_RR.

IPv6

   no-l3mdev new-vrf old-vrf
TCP_RR   2949529432   27794
TCP_CRR  10520103389870
UDP_RR   2613727019*  26511

* UDP is consistently better with VRF for two reasons:
  1. Source address selection with L3 domains is considering fewer
 addresses since only addresses on interfaces in the domain are
 considered for the selection. Specifically, perf-top shows
 shows ipv6_get_saddr_eval, ipv6_dev_get_saddr and __ipv6_dev_get_saddr
 running much lower with vrf than without.

  2. The VRF table contains all routes (i.e, there are no separate local
 and main tables per VRF). That means ip6_pol_route_output only has 1
 lookup for VRF where it does 2 without it (1 in the local table and 1
 in the main table).

[1] http://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

David Ahern (2):
  net: vrf: performance improvements for IPv4
  net: vrf: performance improvements for IPv6

 drivers/net/vrf.c | 172 +++---
 1 file changed, 152 insertions(+), 20 deletions(-)

-- 
2.1.4

[PATCH net-next 2/2] net: vrf: performance improvements for IPv6

2017-03-20 Thread David Ahern

The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.

The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.

This patch avoids the redirect for IPv6 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.

Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.

The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) from a +3% improvement for UDP which has a lookup
per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR
which connects a socket for each request-response.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 66 ++-
 1 file changed, 56 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index cdf7253ae89e..4140ff878d63 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -445,18 +445,13 @@ static int vrf_output6(struct net *net, struct sock *sk, 
struct sk_buff *skb)
  * packet to go through device based features such as qdisc, netfilter
  * hooks and packet sockets with skb->dev set to vrf device.
  */
-static struct sk_buff *vrf_ip6_out(struct net_device *vrf_dev,
-  struct sock *sk,
-  struct sk_buff *skb)
+static struct sk_buff *vrf_ip6_out_redirect(struct net_device *vrf_dev,
+   struct sk_buff *skb)
 {
struct net_vrf *vrf = netdev_priv(vrf_dev);
struct dst_entry *dst = NULL;
struct rt6_info *rt6;
 
-   /* don't divert link scope packets */
-   if (rt6_need_strict(_hdr(skb)->daddr))
-   return skb;
-
rcu_read_lock();
 
rt6 = rcu_dereference(vrf->rt6);
@@ -478,6 +473,55 @@ static struct sk_buff *vrf_ip6_out(struct net_device 
*vrf_dev,
return skb;
 }
 
+static int vrf_output6_direct(struct net *net, struct sock *sk,
+ struct sk_buff *skb)
+{
+   skb->protocol = htons(ETH_P_IPV6);
+
+   return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
+   net, sk, skb, NULL, skb->dev,
+   vrf_finish_direct,
+   !(IPCB(skb)->flags & IPSKB_REROUTED));
+}
+
+static struct sk_buff *vrf_ip6_out_direct(struct net_device *vrf_dev,
+ struct sock *sk,
+ struct sk_buff *skb)
+{
+   struct net *net = dev_net(vrf_dev);
+   int err;
+
+   skb->dev = vrf_dev;
+
+   err = nf_hook(NFPROTO_IPV6, NF_INET_LOCAL_OUT, net, sk,
+ skb, NULL, vrf_dev, vrf_output6_direct);
+
+   if (likely(err == 1))
+   err = vrf_output6_direct(net, sk, skb);
+
+   /* reset skb device */
+   if (likely(err == 1))
+   nf_reset(skb);
+   else
+   skb = NULL;
+
+   return skb;
+}
+
+static struct sk_buff *vrf_ip6_out(struct net_device *vrf_dev,
+  struct sock *sk,
+  struct sk_buff *skb)
+{
+   /* don't divert link scope packets */
+   if (rt6_need_strict(_hdr(skb)->daddr))
+   return skb;
+
+   if (qdisc_tx_is_default(vrf_dev))
+   return vrf_ip6_out_direct(vrf_dev, sk, skb);
+
+   return vrf_ip6_out_redirect(vrf_dev, skb);
+}
+
 /* holding rtnl */
 static void vrf_rt6_release(struct net_device *dev, struct net_vrf *vrf)
 {
@@ -1064,9 +1108,11 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device 
*vrf_dev,
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
 
-   skb_push(skb, skb->mac_len);
-   dev_queue_xmit_nit(skb, vrf_dev);
-   skb_pull(skb, skb->mac_len);
+   if (!list_empty(_dev->ptype_all)) {
+   skb_push(skb, skb->mac_len);
+   dev_queue_xmit_nit(skb, vrf_dev);
+   skb_pull(skb, skb->mac_len);
+   }

[PATCH 4.10 34/63] team: use ETH_MAX_MTU as max mtu

2017-03-20 Thread Greg Kroah-Hartman

4.10-stable review patch.  If anyone has any objections, please let me know.

--

From: Jarod Wilson 


[ Upstream commit 3331aa378e9bcbd0d16de9034b0c20f4050e26b4 ]

This restores the ability to set a team device's mtu to anything higher
than 1500. Similar to the reported issue with bonding, the team driver
calls ether_setup(), which sets an initial max_mtu of 1500, while the
underlying hardware can handle something much larger. Just set it to
ETH_MAX_MTU to support all possible values, and the limitations of the
underlying devices will prevent setting anything too large.

Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
CC: Cong Wang 
CC: Jiri Pirko 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/net/team/team.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2075,6 +2075,7 @@ static int team_dev_type_check_change(st
 static void team_setup(struct net_device *dev)
 {
ether_setup(dev);
+   dev->max_mtu = ETH_MAX_MTU;
 
dev->netdev_ops = _netdev_ops;
dev->ethtool_ops = _ethtool_ops;

1 2 3 >

1 - 100 of 210 matches

Mail list logo