date:20170516

Re: [patch net-next v3 06/10] net: sched: introduce helpers to work with filter chains

2017-05-16 Thread Jiri Pirko

Wed, May 17, 2017 at 12:17:11AM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>> +static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain_info 
>> *chain_info)
>> +{
>> +   return rtnl_dereference(*chain_info->pprev);
>> +}
>> +
>> +static void tcf_chain_tp_insert(struct tcf_chain *chain,
>> +   struct tcf_chain_info *chain_info,
>> +   struct tcf_proto *tp)
>> +{
>> +   if (chain->p_filter_chain &&
>> +   *chain_info->pprev == chain->filter_chain)
>> +   *chain->p_filter_chain = tp;
>> +   RCU_INIT_POINTER(tp->next, rtnl_dereference(*chain_info->pprev));
>
>Use tcf_chain_tp_prev()?

Ok. Will do that.

>
>
>> +   rcu_assign_pointer(*chain_info->pprev, tp);
>> +}
>> +
>> +static void tcf_chain_tp_remove(struct tcf_chain *chain,
>> +   struct tcf_chain_info *chain_info,
>> +   struct tcf_proto *tp)
>> +{
>> +   struct tcf_proto *next = rtnl_dereference(chain_info->next);
>> +
>> +   if (chain->p_filter_chain && tp == chain->filter_chain)
>> +   *chain->p_filter_chain = next;
>> +   RCU_INIT_POINTER(*chain_info->pprev, next);
>> +}
>> +
>> +static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
>> +  struct tcf_chain_info *chain_info,
>> +  u32 protocol, u32 prio,
>> +  bool prio_allocate)
>> +{
>> +   struct tcf_proto **pprev;
>> +   struct tcf_proto *tp;
>> +
>> +   /* Check the chain for existence of proto-tcf with this priority */
>> +   for (pprev = >filter_chain;
>> +(tp = rtnl_dereference(*pprev)); pprev = >next) {
>
>Use tcf_chain_tp_prev()?

Can't be done.

Re: [patch net-next v3 05/10] net: sched: move TC_H_MAJ macro call into tcf_auto_prio

2017-05-16 Thread Jiri Pirko

Wed, May 17, 2017 at 12:38:08AM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 2:03 PM, Jiri Pirko  wrote:
>> Tue, May 16, 2017 at 11:01:52PM CEST, xiyou.wangc...@gmail.com wrote:
>>>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
 From: Jiri Pirko 

 Call the helper from the function rather than to always adjust the
 return value of the function.
>>>
>>>And rename the function name to reflect this change?
>>
>> ? What do you suggest?
>
>tcf_auto_major_prio()?

That makes no sense. prio is passed from user in upper 2 bytes (god
knows why but that is how it is). The helper returns prio, that's it.
Nothing major about it...

Re: [patch net-next v3 02/10] net: sched: introduce tcf block infractructure

2017-05-16 Thread Jiri Pirko

Wed, May 17, 2017 at 12:34:04AM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 2:34 PM, David Miller  wrote:
>> From: Cong Wang 
>> Date: Tue, 16 May 2017 13:51:30 -0700
>>
>>> On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
 +int tcf_block_get(struct tcf_block **p_block,
 + struct tcf_proto __rcu **p_filter_chain)
 +{
 +   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
 +
 +   if (!block)
 +   return -ENOMEM;
 +   block->p_filter_chain = p_filter_chain;
 +   *p_block = block;
 +   return 0;
 +}
 +EXPORT_SYMBOL(tcf_block_get);
>>>
>>>
>>> XXX_get() is usually for refcnt'ing, here you only allocate
>>> a block, so please rename it to tcf_block_alloc().
>>
>> Later in the series he adds refcounting to these objects.
>>
>> He explained this to Jamal too.
>
>I have read all patches, unless I miss something, block itself
>is not refcn'ted, only chains are, so it makes no sense to get
>a block, right?

It's not in this series. I just prepare the design so later on I can
easily add the block sharing between qdiscs.

[PATCH net-next] net: fix __skb_try_recv_from_queue to return the old behavior

2017-05-16 Thread Andrei Vagin

This function has to return NULL on a error case, because there is a
separate error variable.

The offset has to be changed only if skb is returned

Cc: Paolo Abeni 
Cc: Eric Dumazet 
Cc: David S. Miller 
Fixes: 65101aeca522 ("net/sock: factor out dequeue/peek with offset cod")
Signed-off-by: Andrei Vagin 
---
 net/core/datagram.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index a4592b4..bc46118 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -170,20 +170,21 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
  struct sk_buff **last)
 {
struct sk_buff *skb;
+   int _off = *off;
 
*last = queue->prev;
skb_queue_walk(queue, skb) {
if (flags & MSG_PEEK) {
-   if (*off >= skb->len && (skb->len || *off ||
+   if (_off >= skb->len && (skb->len || _off ||
 skb->peeked)) {
-   *off -= skb->len;
+   _off -= skb->len;
continue;
}
if (!skb->len) {
skb = skb_set_peeked(skb);
if (unlikely(IS_ERR(skb))) {
*err = PTR_ERR(skb);
-   return skb;
+   return NULL;
}
}
*peeked = 1;
@@ -193,6 +194,7 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
if (destructor)
destructor(sk, skb);
}
+   *off = _off;
return skb;
}
return NULL;
@@ -253,8 +255,6 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
 
*peeked = 0;
do {
-   int _off = *off;
-
/* Again only user level code calls this function, so nothing
 * interrupt level will suddenly eat the receive_queue.
 *
@@ -263,8 +263,10 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
 */
spin_lock_irqsave(>lock, cpu_flags);
skb = __skb_try_recv_from_queue(sk, queue, flags, destructor,
-   peeked, &_off, err, last);
+   peeked, off, , last);
spin_unlock_irqrestore(>lock, cpu_flags);
+   if (error)
+   goto no_packet;
if (skb)
return skb;
 
-- 
2.9.3

[PATCH net-next V5 0/9] vhost_net rx batch dequeuing

2017-05-16 Thread Jason Wang

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying. This is also the requirement for more batching
implemention on rx path.

Tests shows at most 7.56% improvment bon rx pps on top of batch
zeroing and no obvious changes for TCP_STREAM/TCP_RR result.

Please review.

Thanks

Changes from V4:
- drop batch zeroing patch
- renew the performance numbers
- move skb pointer array out of vhost_net structure

Changes from V3:
- add batch zeroing patch to fix the build warnings

Changes from V2:
- rebase to net-next HEAD
- use unconsume helpers to put skb back on releasing
- introduce and use vhost_net internal buffer helpers
- renew performance numbers on top of batch zeroing

Changes from V1:
- switch to use for() in __ptr_ring_consume_batched()
- rename peek_head_len_batched() to fetch_skbs()
- use skb_array_consume_batched() instead of
  skb_array_consume_batched_bh() since no consumer run in bh
- drop the lockless peeking patch since skb_array could be resized, so
  it's not safe to call lockless one

Jason Wang (8):
  skb_array: introduce skb_array_unconsume
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array

Michael S. Tsirkin (1):
  ptr_ring: add ptr_ring_unconsume

 drivers/net/tap.c |  25 +++--
 drivers/net/tun.c |  31 ---
 drivers/vhost/net.c   | 128 +++---
 include/linux/if_tap.h|   5 ++
 include/linux/if_tun.h|   5 ++
 include/linux/ptr_ring.h  | 120 +++
 include/linux/skb_array.h |  31 +++
 7 files changed, 327 insertions(+), 18 deletions(-)

-- 
2.7.4

[PATCH net-next V5 2/9] skb_array: introduce skb_array_unconsume

2017-05-16 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 include/linux/skb_array.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..79850b6 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -156,6 +156,12 @@ static void __skb_array_destroy_skb(void *ptr)
kfree_skb(ptr);
 }
 
+static inline void skb_array_unconsume(struct skb_array *a,
+  struct sk_buff **skbs, int n)
+{
+   ptr_ring_unconsume(>ring, (void **)skbs, n, __skb_array_destroy_skb);
+}
+
 static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)
 {
return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb);
-- 
2.7.4

[PATCH net-next V5 4/9] skb_array: introduce batch dequeuing

2017-05-16 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 include/linux/skb_array.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 79850b6..35226cd 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct 
skb_array *a)
return ptr_ring_consume(>ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+   struct sk_buff **array, int n)
+{
+   return ptr_ring_consume_batched(>ring, (void **)array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
return ptr_ring_consume_irq(>ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+   struct sk_buff **array, int n)
+{
+   return ptr_ring_consume_batched_irq(>ring, (void **)array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
return ptr_ring_consume_any(>ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+   struct sk_buff **array, int n)
+{
+   return ptr_ring_consume_batched_any(>ring, (void **)array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
return ptr_ring_consume_bh(>ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+  struct sk_buff **array, int n)
+{
+   return ptr_ring_consume_batched_bh(>ring, (void **)array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {
-- 
2.7.4

[PATCH net-next V5 5/9] tun: export skb_array

2017-05-16 Thread Jason Wang

This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c  | 13 +
 include/linux/if_tun.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bbd707b..3cbfc5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2626,6 +2626,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+   struct tun_file *tfile;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   tfile = file->private_data;
+   if (!tfile)
+   return ERR_PTR(-EBADFD);
+   return >tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4

[PATCH net-next V5 9/9] vhost_net: try batch dequing from skb array

2017-05-16 Thread Jason Wang

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying. Batch dequeuing is also the requirement for more
batching improvement on receive path.

Tests were done by pktgen on tap with XDP1 in guest. Host is Intel(R)
Xeon(R) CPU E5-2650 0 @ 2.00GHz.

rx batch | pps

0   2.25Mpps
1   2.33Mpps (+3.56%)
4   2.33Mpps (+3.56%)
16  2.35Mpps (+4.44%)
64  2.42Mpps (+7.56%) <- Default rx batching
128 2.40Mpps (+6.67%)
256 2.38Mpps (+5.78%)

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 128 +---
 1 file changed, 122 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f61f852..e3d7ea1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -85,6 +87,13 @@ struct vhost_net_ubuf_ref {
struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
+struct vhost_net_buf {
+   struct sk_buff **queue;
+   int tail;
+   int head;
+};
+
 struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;
@@ -99,6 +108,8 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   struct vhost_net_buf rxq;
 };
 
 struct vhost_net {
@@ -117,6 +128,71 @@ struct vhost_net {
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
 
+static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq)
+{
+   if (rxq->tail != rxq->head)
+   return rxq->queue[rxq->head];
+   else
+   return NULL;
+}
+
+static int vhost_net_buf_get_size(struct vhost_net_buf *rxq)
+{
+   return rxq->tail - rxq->head;
+}
+
+static int vhost_net_buf_is_empty(struct vhost_net_buf *rxq)
+{
+   return rxq->tail == rxq->head;
+}
+
+static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
+{
+   void *ret = vhost_net_buf_get_ptr(rxq);
+   ++rxq->head;
+   return ret;
+}
+
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+{
+   struct vhost_net_buf *rxq = >rxq;
+
+   rxq->head = 0;
+   rxq->tail = skb_array_consume_batched(nvq->rx_array, rxq->queue,
+ VHOST_RX_BATCH);
+   return rxq->tail;
+}
+
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+{
+   struct vhost_net_buf *rxq = >rxq;
+
+   if (nvq->rx_array && !vhost_net_buf_is_empty(rxq)) {
+   skb_array_unconsume(nvq->rx_array, rxq->queue + rxq->head,
+   vhost_net_buf_get_size(rxq));
+   rxq->head = rxq->tail = 0;
+   }
+}
+
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+{
+   struct vhost_net_buf *rxq = >rxq;
+
+   if (!vhost_net_buf_is_empty(rxq))
+   goto out;
+
+   if (!vhost_net_buf_produce(nvq))
+   return 0;
+
+out:
+   return __skb_array_len_with_tag(vhost_net_buf_get_ptr(rxq));
+}
+
+static void vhost_net_buf_init(struct vhost_net_buf *rxq)
+{
+   rxq->head = rxq->tail = 0;
+}
+
 static void vhost_net_enable_zcopy(int vq)
 {
vhost_net_zcopy_mask |= 0x1 << vq;
@@ -201,6 +277,7 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   vhost_net_buf_init(>vqs[i].rxq);
}
 
 }
@@ -503,15 +580,14 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
-   struct socket *sock = sk->sk_socket;
struct sk_buff *head;
int len = 0;
unsigned long flags;
 
-   if (sock->ops->peek_len)
-   return sock->ops->peek_len(sock);
+   if (rvq->rx_array)
+   return vhost_net_buf_peek(rvq);
 
spin_lock_irqsave(>sk_receive_queue.lock, flags);
head = skb_peek(>sk_receive_queue);
@@ -537,10 +613,11 @@ static int sk_has_rx_data(struct sock *sk)
 
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
+   struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
-   int len = peek_head_len(sk);
+   int len = peek_head_len(rvq, sk);
 
if (!len &&

[PATCH net-next V5 7/9] tun: support receiving skb through msg_control

2017-05-16 Thread Jason Wang

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3cbfc5c..f8041f9c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1510,9 +1510,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file 
*tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
-   struct sk_buff *skb;
ssize_t ret;
int err;
 
@@ -1521,10 +1520,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
if (!iov_iter_count(to))
return 0;
 
-   /* Read frames from ring */
-   skb = tun_ring_recv(tfile, noblock, );
-   if (!skb)
-   return err;
+   if (!skb) {
+   /* Read frames from ring */
+   skb = tun_ring_recv(tfile, noblock, );
+   if (!skb)
+   return err;
+   }
 
ret = tun_put_user(tun, tfile, skb, to);
if (unlikely(ret < 0))
@@ -1544,7 +1545,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
 
if (!tun)
return -EBADFD;
-   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1646,7 +1647,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr 
*m, size_t total_len,
 SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
-   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > (ssize_t)total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

[PATCH net-next V5 3/9] ptr_ring: introduce batch dequeuing

2017-05-16 Thread Jason Wang

This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 65 
 1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 796b90f..d8c97ec 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -278,6 +278,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
return ptr;
 }
 
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+void **array, int n)
+{
+   void *ptr;
+   int i;
+
+   for (i = 0; i < n; i++) {
+   ptr = __ptr_ring_consume(r);
+   if (!ptr)
+   break;
+   array[i] = ptr;
+   }
+
+   return i;
+}
+
 /*
  * Note: resize (below) nests producer lock within consumer lock, so if you
  * call this in interrupt or BH context, you must disable interrupts/BH when
@@ -328,6 +344,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
return ptr;
 }
 
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock_irq(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irq(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+  void **array, int n)
+{
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(>consumer_lock, flags);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irqrestore(>consumer_lock, flags);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+ void **array, int n)
+{
+   int ret;
+
+   spin_lock_bh(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_bh(>consumer_lock);
+
+   return ret;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.7.4

[PATCH net-next V5 8/9] tap: support receiving skb from msg_control

2017-05-16 Thread Jason Wang

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang 
---
 drivers/net/tap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..9af3239 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
DEFINE_WAIT(wait);
-   struct sk_buff *skb;
ssize_t ret = 0;
 
if (!iov_iter_count(to))
return 0;
 
+   if (skb)
+   goto put;
+
while (1) {
if (!noblock)
prepare_to_wait(sk_sleep(>sk), ,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
if (!noblock)
finish_wait(sk_sleep(>sk), );
 
+put:
if (skb) {
ret = tap_put_user(q, skb, to);
if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
struct tap_queue *q = file->private_data;
ssize_t len = iov_iter_count(to), ret;
 
-   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr 
*m,
int ret;
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
return -EINVAL;
-   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

[PATCH net-next V5 6/9] tap: export skb_array

2017-05-16 Thread Jason Wang

This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang 
---
 drivers/net/tap.c  | 13 +
 include/linux/if_tap.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+   struct tap_queue *q;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   q = file->private_data;
+   if (!q)
+   return ERR_PTR(-EBADFD);
+   return >skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include 
-- 
2.7.4

[PATCH net-next V5 1/9] ptr_ring: add ptr_ring_unconsume

2017-05-16 Thread Jason Wang

From: "Michael S. Tsirkin" 

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally can't do this and have to drop entries, but this implies ring
is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 55 
 1 file changed, 55 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6b2e0dd..796b90f 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -403,6 +403,61 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
 }
 
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(>consumer_lock, flags);
+   spin_lock(>producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   goto done;
+   }
+   r->queue[head] = batch[--n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n)
+   destroy(batch[--n]);
+   spin_unlock(>producer_lock);
+   spin_unlock_irqrestore(>consumer_lock, flags);
+}
+
 static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))
-- 
2.7.4

[PATCH net v3] net: x25: fix one potential use-after-free issue

2017-05-16 Thread linzhang

The function x25_init is not properly unregister related resources
on error handler.It is will result in kernel oops if x25_init init
failed, so add properly unregister call on error handler.

Also, i adjust the coding style and make x25_register_sysctl properly
return failure.

Signed-off-by: linzhang 
---

changelog:

v1 -> v2:
* make x25_register_sysctl properly return failure

v2 -> v3:
* keep the same lables as v1
* fix missing semicolon

---
 include/net/x25.h|  4 ++--
 net/x25/af_x25.c | 24 
 net/x25/sysctl_net_x25.c |  5 -
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/net/x25.h b/include/net/x25.h
index c383aa4..6d30a01 100644
--- a/include/net/x25.h
+++ b/include/net/x25.h
@@ -298,10 +298,10 @@ int x25_decode(struct sock *, struct sk_buff *, int *, 
int *, int *, int *,
 
 /* sysctl_net_x25.c */
 #ifdef CONFIG_SYSCTL
-void x25_register_sysctl(void);
+int x25_register_sysctl(void);
 void x25_unregister_sysctl(void);
 #else
-static inline void x25_register_sysctl(void) {};
+static inline int x25_register_sysctl(void) { return 0; };
 static inline void x25_unregister_sysctl(void) {};
 #endif /* CONFIG_SYSCTL */
 
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 8b911c2..5a1a98d 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1791,32 +1791,40 @@ void x25_kill_by_neigh(struct x25_neigh *nb)
 
 static int __init x25_init(void)
 {
-   int rc = proto_register(_proto, 0);
+   int rc;
 
-   if (rc != 0)
+   rc = proto_register(_proto, 0);
+   if (rc)
goto out;
 
rc = sock_register(_family_ops);
-   if (rc != 0)
+   if (rc)
goto out_proto;
 
dev_add_pack(_packet_type);
 
rc = register_netdevice_notifier(_dev_notifier);
-   if (rc != 0)
+   if (rc)
goto out_sock;
 
-   pr_info("Linux Version 0.2\n");
+   rc = x25_register_sysctl();
+   if (rc)
+   goto out_dev;
 
-   x25_register_sysctl();
rc = x25_proc_init();
-   if (rc != 0)
-   goto out_dev;
+   if (rc)
+   goto out_sysctl;
+
+   pr_info("Linux Version 0.2\n");
+
 out:
return rc;
+out_sysctl:
+   x25_unregister_sysctl();
 out_dev:
unregister_netdevice_notifier(_dev_notifier);
 out_sock:
+   dev_remove_pack(_packet_type);
sock_unregister(AF_X25);
 out_proto:
proto_unregister(_proto);
diff --git a/net/x25/sysctl_net_x25.c b/net/x25/sysctl_net_x25.c
index a06dfe1..ba078c8 100644
--- a/net/x25/sysctl_net_x25.c
+++ b/net/x25/sysctl_net_x25.c
@@ -73,9 +73,12 @@
{ },
 };
 
-void __init x25_register_sysctl(void)
+int __init x25_register_sysctl(void)
 {
x25_table_header = register_net_sysctl(_net, "net/x25", x25_table);
+   if (!x25_table_header)
+   return -ENOMEM;
+   return 0;
 }
 
 void x25_unregister_sysctl(void)
-- 
1.8.3.1

RE: Donation

2017-05-16 Thread Mayrhofer Family

Good Day,

My wife and I have awarded you with a donation of $ 1,000,000.00 Dollars from 
part of our Jackpot Lottery of 50 Million Dollars, respond with your details 
for claims.

We await your earliest response and God Bless you.

Friedrich And Annand Mayrhofer.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: [net-next,v2,1/3] net/sock: factor out dequeue/peek with offset code

2017-05-16 Thread Andrei Vagin

On Tue, May 16, 2017 at 11:20:13AM +0200, Paolo Abeni wrote:
> And update __sk_queue_drop_skb() to work on the specified queue.
> This will help the udp protocol to use an additional private
> rx queue in a later patch.

CRIU tests fails with this patch:

recvmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="packet dgram 
right\0", iov_len=212960}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 
MSG_PEEK|MSG_DONTWAIT) = 19 <0.48>
writev(9, [{iov_base="\4\0\0\0", iov_len=4}, {iov_base="\10\t\20\23", 
iov_len=4}], 2) = 8 <0.85>
write(9, "packet dgram right\0", 19)= 19 <0.62>
recvmsg(14, {msg_namelen=0}, MSG_PEEK|MSG_DONTWAIT) = -1 EFAULT (Bad address) 
<0.46>

without this patch, strace looks like this:
g(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="packet dgram right\0", 
iov_len=212960}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 
MSG_PEEK|MSG_DONTWAIT) = 19 <0.24>
writev(9, [{iov_base="\4\0\0\0", iov_len=4}, {iov_base="\10\t\20\23", 
iov_len=4}], 2) = 8 <0.37>
write(9, "packet dgram right\0", 19)= 19 <0.30>
recvmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="packet dgram 
left\0", iov_len=212960}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 
MSG_PEEK|MSG_DONTWAIT) = 18 <0.23>
writev(9, [{iov_base="\4\0\0\0", iov_len=4}, {iov_base="\10\t\20\22", 
iov_len=4}], 2) = 8 <0.30>
write(9, "packet dgram left\0", 18) = 18 <0.30>
recvmsg(14, {msg_namelen=0}, MSG_PEEK|MSG_DONTWAIT) = -1 EAGAIN (Resource 
temporarily unavailable) <0.23>


https://travis-ci.org/avagin/criu/jobs/232990442

> 
> Signed-off-by: Paolo Abeni 
> Acked-by: Eric Dumazet 
> ---
>  include/linux/skbuff.h |  7 
>  include/net/sock.h |  4 +--
>  net/core/datagram.c| 90 
> --
>  3 files changed, 60 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a098d95..bfc7892 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff 
> *skb)
>  
>  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
>   const struct sk_buff *skb);
> +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> +   struct sk_buff_head *queue,
> +   unsigned int flags,
> +   void (*destructor)(struct sock *sk,
> +struct sk_buff *skb),
> +   int *peeked, int *off, int *err,
> +   struct sk_buff **last);
>  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
>   void (*destructor)(struct sock *sk,
>  struct sk_buff *skb),
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 66349e4..49d226f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list 
> *timer,
>  
>  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
>  
> -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> - unsigned int flags,
> +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> + struct sk_buff *skb, unsigned int flags,
>   void (*destructor)(struct sock *sk,
>  struct sk_buff *skb));
>  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index db1866f2..a4592b4 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff 
> *skb)
>   return skb;
>  }
>  
> +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> +   struct sk_buff_head *queue,
> +   unsigned int flags,
> +   void (*destructor)(struct sock *sk,
> +struct sk_buff *skb),
> +   int *peeked, int *off, int *err,
> +   struct sk_buff **last)
> +{
> + struct sk_buff *skb;
> +
> + *last = queue->prev;
> + skb_queue_walk(queue, skb) {
> + if (flags & MSG_PEEK) {
> + if (*off >= skb->len && (skb->len || *off ||
> +  skb->peeked)) {
> + *off -= skb->len;
> + continue;
> + }
> + if (!skb->len) {
> + skb =

[PATCH net-next 6/6] net: phy: marvell: checkpatch - Fix remaining long lines

2017-05-16 Thread Andrew Lunn

Fold lines longer than 80 characters

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index d510eda92af5..88cd97b44ba6 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -217,9 +217,11 @@ static int marvell_config_intr(struct phy_device *phydev)
int err;
 
if (phydev->interrupts == PHY_INTERRUPT_ENABLED)
-   err = phy_write(phydev, MII_M1011_IMASK, MII_M1011_IMASK_INIT);
+   err = phy_write(phydev, MII_M1011_IMASK,
+   MII_M1011_IMASK_INIT);
else
-   err = phy_write(phydev, MII_M1011_IMASK, MII_M1011_IMASK_CLEAR);
+   err = phy_write(phydev, MII_M1011_IMASK,
+   MII_M1011_IMASK_CLEAR);
 
return err;
 }
@@ -1394,7 +1396,8 @@ static int m88e1121_did_interrupt(struct phy_device 
*phydev)
return 0;
 }
 
-static void m88e1318_get_wol(struct phy_device *phydev, struct ethtool_wolinfo 
*wol)
+static void m88e1318_get_wol(struct phy_device *phydev,
+struct ethtool_wolinfo *wol)
 {
wol->supported = WAKE_MAGIC;
wol->wolopts = 0;
@@ -1410,7 +1413,8 @@ static void m88e1318_get_wol(struct phy_device *phydev, 
struct ethtool_wolinfo *
return;
 }
 
-static int m88e1318_set_wol(struct phy_device *phydev, struct ethtool_wolinfo 
*wol)
+static int m88e1318_set_wol(struct phy_device *phydev,
+   struct ethtool_wolinfo *wol)
 {
int err, oldpage, temp;
 
-- 
2.11.0

[PATCH net-next 0/6] net: phy: marvell: Checkpatch cleanup

2017-05-16 Thread Andrew Lunn

I will be contributing a few new features to the Marvell PHY driver
soon. Start by making the code mostly checkpatch clean. There should
not be any functional changes. Just comments set into the correct
format, missing blank lines, turn some comparisons around, and
refactoring to reduce indentation depth.

There is still one camel in the code, but it actually makes sense, so
leave it in piece.


Andrew Lunn (6):
  net: phy: Marvell: checkpatch - Comments
  net: phy: marvell: Checkpatch - Missing or extra blank lines
  net: phy: marvell: Checkpatch - assignments and comparisons
  net: phy: marvell: Refactor some bigger functions
  net: phy: marvell: Add helpers to get/set page
  net: phy: marvell: checkpatch - Fix remaining long lines

 drivers/net/phy/marvell.c | 636 +-
 1 file changed, 352 insertions(+), 284 deletions(-)

-- 
2.11.0

[PATCH net-next 3/6] net: phy: marvell: Checkpatch - assignments and comparisons

2017-05-16 Thread Andrew Lunn

Avoid multiple assignments
Comparisons should place the constant on the right side of the test

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index f52656ec618f..e9632f576a24 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -1101,7 +1101,7 @@ static int marvell_read_status_page(struct phy_device 
*phydev, int page)
if (err)
return err;
 
-   if (AUTONEG_ENABLE == phydev->autoneg) {
+   if (phydev->autoneg == AUTONEG_ENABLE) {
status = phy_read(phydev, MII_M1011_PHY_STATUS);
if (status < 0)
return status;
@@ -1126,7 +1126,8 @@ static int marvell_read_status_page(struct phy_device 
*phydev, int page)
phydev->duplex = DUPLEX_HALF;
 
status = status & MII_M1011_PHY_STATUS_SPD_MASK;
-   phydev->pause = phydev->asym_pause = 0;
+   phydev->pause = 0;
+   phydev->asym_pause = 0;
 
switch (status) {
case MII_M1011_PHY_STATUS_1000:
@@ -1185,7 +1186,8 @@ static int marvell_read_status_page(struct phy_device 
*phydev, int page)
else
phydev->speed = SPEED_10;
 
-   phydev->pause = phydev->asym_pause = 0;
+   phydev->pause = 0;
+   phydev->asym_pause = 0;
phydev->lp_advertising = 0;
}
 
-- 
2.11.0

[PATCH net-next 1/6] net: phy: Marvell: checkpatch - Comments

2017-05-16 Thread Andrew Lunn

Use net style comment blocks, and wrap one block with long lines.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 272b051a0199..2aacbf8e0eb3 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -257,7 +257,8 @@ static int marvell_config_aneg(struct phy_device *phydev)
 
/* The Marvell PHY has an errata which requires
 * that certain registers get written in order
-* to restart autonegotiation */
+* to restart autonegotiation
+*/
err = phy_write(phydev, MII_BMCR, BMCR_RESET);
 
if (err < 0)
@@ -299,8 +300,7 @@ static int marvell_config_aneg(struct phy_device *phydev)
if (phydev->autoneg != AUTONEG_ENABLE) {
int bmcr;
 
-   /*
-* A write to speed/duplex bits (that is performed by
+   /* A write to speed/duplex bits (that is performed by
 * genphy_config_aneg() call above) must be followed by
 * a software reset. Otherwise, the write has no effect.
 */
@@ -359,8 +359,7 @@ static int m88e_config_aneg(struct phy_device *phydev)
 }
 
 #ifdef CONFIG_OF_MDIO
-/*
- * Set and/or override some configuration registers based on the
+/* Set and/or override some configuration registers based on the
  * marvell,reg-init property stored in the of_node for the phydev.
  *
  * marvell,reg-init = ,...;
@@ -1057,7 +1056,8 @@ static int marvell_update_link(struct phy_device *phydev, 
int fiber)
int status;
 
/* Use the generic register for copper link, or specific
-* register for fiber case */
+* register for fiber case
+*/
if (fiber) {
status = phy_read(phydev, MII_M1011_PHY_STATUS);
if (status < 0)
@@ -1092,7 +1092,8 @@ static int marvell_read_status_page(struct phy_device 
*phydev, int page)
int fiber;
 
/* Detect and update the link, but return if there
-* was an error */
+* was an error
+*/
if (page == MII_M_FIBER)
fiber = 1;
else
@@ -1217,12 +1218,13 @@ static int marvell_read_status(struct phy_device 
*phydev)
if (err < 0)
goto error;
 
-   /* If the fiber link is up, it is the selected and used link.
-* In this case, we need to stay in the fiber page.
-* Please to be careful about that, avoid to restore Copper page
-* in other functions which could break the behaviour
-* for some fiber phy like 88E1512.
-* */
+   /* If the fiber link is up, it is the selected and
+* used link. In this case, we need to stay in the
+* fiber page. Please to be careful about that, avoid
+* to restore Copper page in other functions which
+* could break the behaviour for some fiber phy like
+* 88E1512.
+*/
if (phydev->link)
return 0;
 
-- 
2.11.0

[PATCH net-next 4/6] net: phy: marvell: Refactor some bigger functions

2017-05-16 Thread Andrew Lunn

Break big functions up by using a number of smaller helper
function. Solves some of the over 80 lines warnings, by reducing the
indentation level.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 484 ++
 1 file changed, 271 insertions(+), 213 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index e9632f576a24..b84380db945e 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -695,102 +695,133 @@ static int m88e3016_config_init(struct phy_device 
*phydev)
return marvell_config_init(phydev);
 }
 
-static int m88e_config_init(struct phy_device *phydev)
+static int m88e_config_init_rgmii(struct phy_device *phydev)
 {
int err;
int temp;
 
-   if (phy_interface_is_rgmii(phydev)) {
-   temp = phy_read(phydev, MII_M_PHY_EXT_CR);
-   if (temp < 0)
-   return temp;
+   temp = phy_read(phydev, MII_M_PHY_EXT_CR);
+   if (temp < 0)
+   return temp;
 
-   if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID) {
-   temp |= (MII_M_RX_DELAY | MII_M_TX_DELAY);
-   } else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID) {
-   temp &= ~MII_M_TX_DELAY;
-   temp |= MII_M_RX_DELAY;
-   } else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID) {
-   temp &= ~MII_M_RX_DELAY;
-   temp |= MII_M_TX_DELAY;
-   }
+   if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID) {
+   temp |= (MII_M_RX_DELAY | MII_M_TX_DELAY);
+   } else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID) {
+   temp &= ~MII_M_TX_DELAY;
+   temp |= MII_M_RX_DELAY;
+   } else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID) {
+   temp &= ~MII_M_RX_DELAY;
+   temp |= MII_M_TX_DELAY;
+   }
 
-   err = phy_write(phydev, MII_M_PHY_EXT_CR, temp);
-   if (err < 0)
-   return err;
+   err = phy_write(phydev, MII_M_PHY_EXT_CR, temp);
+   if (err < 0)
+   return err;
 
-   temp = phy_read(phydev, MII_M_PHY_EXT_SR);
-   if (temp < 0)
-   return temp;
+   temp = phy_read(phydev, MII_M_PHY_EXT_SR);
+   if (temp < 0)
+   return temp;
 
-   temp &= ~(MII_M_HWCFG_MODE_MASK);
+   temp &= ~(MII_M_HWCFG_MODE_MASK);
 
-   if (temp & MII_M_HWCFG_FIBER_COPPER_RES)
-   temp |= MII_M_HWCFG_MODE_FIBER_RGMII;
-   else
-   temp |= MII_M_HWCFG_MODE_COPPER_RGMII;
+   if (temp & MII_M_HWCFG_FIBER_COPPER_RES)
+   temp |= MII_M_HWCFG_MODE_FIBER_RGMII;
+   else
+   temp |= MII_M_HWCFG_MODE_COPPER_RGMII;
 
-   err = phy_write(phydev, MII_M_PHY_EXT_SR, temp);
-   if (err < 0)
-   return err;
-   }
+   return phy_write(phydev, MII_M_PHY_EXT_SR, temp);
+}
 
-   if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
-   temp = phy_read(phydev, MII_M_PHY_EXT_SR);
-   if (temp < 0)
-   return temp;
+static int m88e_config_init_sgmii(struct phy_device *phydev)
+{
+   int err;
+   int temp;
 
-   temp &= ~(MII_M_HWCFG_MODE_MASK);
-   temp |= MII_M_HWCFG_MODE_SGMII_NO_CLK;
-   temp |= MII_M_HWCFG_FIBER_COPPER_AUTO;
+   temp = phy_read(phydev, MII_M_PHY_EXT_SR);
+   if (temp < 0)
+   return temp;
 
-   err = phy_write(phydev, MII_M_PHY_EXT_SR, temp);
-   if (err < 0)
-   return err;
+   temp &= ~(MII_M_HWCFG_MODE_MASK);
+   temp |= MII_M_HWCFG_MODE_SGMII_NO_CLK;
+   temp |= MII_M_HWCFG_FIBER_COPPER_AUTO;
 
-   /* make sure copper is selected */
-   err = phy_read(phydev, MII_M1145_PHY_EXT_ADDR_PAGE);
-   if (err < 0)
-   return err;
+   err = phy_write(phydev, MII_M_PHY_EXT_SR, temp);
+   if (err < 0)
+   return err;
 
-   err = phy_write(phydev, MII_M1145_PHY_EXT_ADDR_PAGE,
-   err & (~0xff));
-   if (err < 0)
-   return err;
-   }
+   /* make sure copper is selected */
+   err = phy_read(phydev, MII_M1145_PHY_EXT_ADDR_PAGE);
+   if (err < 0)
+   return err;
 
-   if (phydev->interface == PHY_INTERFACE_MODE_RTBI) {
-   temp = phy_read(phydev, MII_M_PHY_EXT_CR);
-   if (temp < 0)
-   return temp;
-

[PATCH net-next 2/6] net: phy: marvell: Checkpatch - Missing or extra blank lines

2017-05-16 Thread Andrew Lunn

Remove the extra blank lines, add one in where recommended.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 2aacbf8e0eb3..f52656ec618f 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -420,7 +420,6 @@ static int marvell_of_reg_init(struct phy_device *phydev)
ret = phy_write(phydev, reg, val);
if (ret < 0)
goto err;
-
}
 err:
if (current_page != saved_page) {
@@ -449,7 +448,6 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
return err;
 
if (phy_interface_is_rgmii(phydev)) {
-
mscr = phy_read(phydev, MII_88E1121_PHY_MSCR_REG) &
MII_88E1121_PHY_MSCR_DELAY_MASK;
 
@@ -703,7 +701,6 @@ static int m88e_config_init(struct phy_device *phydev)
int temp;
 
if (phy_interface_is_rgmii(phydev)) {
-
temp = phy_read(phydev, MII_M_PHY_EXT_CR);
if (temp < 0)
return temp;
@@ -968,6 +965,7 @@ static int m88e1145_config_init(struct phy_device *phydev)
 
if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID) {
int temp = phy_read(phydev, MII_M1145_PHY_EXT_CR);
+
if (temp < 0)
return temp;
 
@@ -1312,6 +1310,7 @@ static int marvell_resume(struct phy_device *phydev)
 static int marvell_aneg_done(struct phy_device *phydev)
 {
int retval = phy_read(phydev, MII_M1011_PHY_STATUS);
+
return (retval < 0) ? retval : (retval & MII_M1011_PHY_STATUS_RESOLVED);
 }
 
-- 
2.11.0

[PATCH net-next 5/6] net: phy: marvell: Add helpers to get/set page

2017-05-16 Thread Andrew Lunn

Makes the code a bit more readable, and solves quite a few checkpatch
warnings of lines longer than 80 characters.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 115 --
 1 file changed, 59 insertions(+), 56 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index b84380db945e..d510eda92af5 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -189,6 +189,16 @@ struct marvell_priv {
struct device *hwmon_dev;
 };
 
+static int marvell_get_page(struct phy_device *phydev)
+{
+   return phy_read(phydev, MII_MARVELL_PHY_PAGE);
+}
+
+static int marvell_set_page(struct phy_device *phydev, int page)
+{
+   return phy_write(phydev, MII_MARVELL_PHY_PAGE, page);
+}
+
 static int marvell_ack_interrupt(struct phy_device *phydev)
 {
int err;
@@ -385,7 +395,7 @@ static int marvell_of_reg_init(struct phy_device *phydev)
if (!paddr || len < (4 * sizeof(*paddr)))
return 0;
 
-   saved_page = phy_read(phydev, MII_MARVELL_PHY_PAGE);
+   saved_page = marvell_get_page(phydev);
if (saved_page < 0)
return saved_page;
current_page = saved_page;
@@ -393,15 +403,15 @@ static int marvell_of_reg_init(struct phy_device *phydev)
ret = 0;
len /= sizeof(*paddr);
for (i = 0; i < len - 3; i += 4) {
-   u16 reg_page = be32_to_cpup(paddr + i);
+   u16 page = be32_to_cpup(paddr + i);
u16 reg = be32_to_cpup(paddr + i + 1);
u16 mask = be32_to_cpup(paddr + i + 2);
u16 val_bits = be32_to_cpup(paddr + i + 3);
int val;
 
-   if (reg_page != current_page) {
-   current_page = reg_page;
-   ret = phy_write(phydev, MII_MARVELL_PHY_PAGE, reg_page);
+   if (page != current_page) {
+   current_page = page;
+   ret = marvell_set_page(phydev, page);
if (ret < 0)
goto err;
}
@@ -423,7 +433,7 @@ static int marvell_of_reg_init(struct phy_device *phydev)
}
 err:
if (current_page != saved_page) {
-   i = phy_write(phydev, MII_MARVELL_PHY_PAGE, saved_page);
+   i = marvell_set_page(phydev, saved_page);
if (ret == 0)
ret = i;
}
@@ -440,10 +450,9 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 {
int err, oldpage, mscr;
 
-   oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
+   oldpage = marvell_get_page(phydev);
 
-   err = phy_write(phydev, MII_MARVELL_PHY_PAGE,
-   MII_88E1121_PHY_MSCR_PAGE);
+   err = marvell_set_page(phydev, MII_88E1121_PHY_MSCR_PAGE);
if (err < 0)
return err;
 
@@ -464,7 +473,7 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
return err;
}
 
-   phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
+   marvell_set_page(phydev, oldpage);
 
err = phy_write(phydev, MII_BMCR, BMCR_RESET);
if (err < 0)
@@ -482,10 +491,9 @@ static int m88e1318_config_aneg(struct phy_device *phydev)
 {
int err, oldpage, mscr;
 
-   oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
+   oldpage = marvell_get_page(phydev);
 
-   err = phy_write(phydev, MII_MARVELL_PHY_PAGE,
-   MII_88E1121_PHY_MSCR_PAGE);
+   err = marvell_set_page(phydev, MII_88E1121_PHY_MSCR_PAGE);
if (err < 0)
return err;
 
@@ -496,7 +504,7 @@ static int m88e1318_config_aneg(struct phy_device *phydev)
if (err < 0)
return err;
 
-   err = phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
+   err = marvell_set_page(phydev, oldpage);
if (err < 0)
return err;
 
@@ -596,7 +604,7 @@ static int m88e1510_config_aneg(struct phy_device *phydev)
 {
int err;
 
-   err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
+   err = marvell_set_page(phydev, MII_M_COPPER);
if (err < 0)
goto error;
 
@@ -606,7 +614,7 @@ static int m88e1510_config_aneg(struct phy_device *phydev)
goto error;
 
/* Then the fiber link */
-   err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
+   err = marvell_set_page(phydev, MII_M_FIBER);
if (err < 0)
goto error;
 
@@ -614,10 +622,10 @@ static int m88e1510_config_aneg(struct phy_device *phydev)
if (err < 0)
goto error;
 
-   return phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
+   return marvell_set_page(phydev, MII_M_COPPER);
 
 error:
-   phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
+   marvell_set_page(phydev, MII_M_COPPER);
return

Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data

2017-05-16 Thread Luis R. Rodriguez

On Tue, May 16, 2017 at 10:41:08AM +0200, Arend Van Spriel wrote:
> On 16-5-2017 1:13, Luis R. Rodriguez wrote:
> > On Fri, May 12, 2017 at 11:02:26PM +0200, Arend Van Spriel wrote:
> >> try again.. replacing email address from Michał
> >> On 12-5-2017 22:55, Arend Van Spriel wrote:
> >>> Let me explain the idea to refresh your memory (and mine). It started
> >>> when we were working on adding driver support for OpenWrt in brcmfmac.
> >>> The driver requests for firmware calibration data, but on routers it is
> >>> stored in flash. So after failing on the firmware request we now call a
> >>> platform specific API. That was my itch, but it was not bad enough to go
> >>> and scratch. Now for N900 case there is a similar scenario alhtough it
> >>> has additional requirement to go to user-space due to need to use a
> >>> proprietary library to obtain the NVS calibration data. My thought: Why
> >>> should firmware_class care?
> > 
> > Agreed.
> > 
> >>> So the idea is that firmware_class provides
> >>> a registry for modules that can produce a certain firmware "file". Those
> >>> modules can do whatever is needed. If they need to use umh so be it.
> >>> They would only register themselves with firmware_class on platforms
> >>> that need them. It would basically be replacing the fallback mechanism
> >>> and only be effective on certain platforms.
> > 
> > Sure, so it sounds like the work that Daniel Wagner and Tom Gundersen worked
> > [0] on which provides a firmwared with two modes: best-effort, and 
> > final-mode,
> > would address what you are looking for but without requiring any upstream
> > changes, *and* it also helps solve the rootfs race remote-proc folks had
> > concerns over.
> > 
> > The other added gain over this solution is if folks need their own 
> > proprietary
> > concoction they can just fork firmwared and have that do whatever it needs
> > for the specific device on the specific rootfs. That is, firmwared can be 
> > the
> > upstream solution if folks need it, but if folks need something custom they 
> > can
> > just mimic the implementation: best-effort, and and final-mode.
> > 
> > Yet another added gain over this solution we can do *not* support the
> > custom fallback mechanism as its not needed, the udev event should suffice
> > to let userspace do what it needs.
> > 
> > Lastly, if we did not want to deal with timeouts for the way the driver data
> > API implements it I think we might be able to do away with them for for 
> > async
> > requests if we assume there will be a daemon that spawns in final-mode 
> > eventually,
> > and since it *knows* when the rootfs is ready it should be able to do a 
> > final
> > lookup, if it returns -ENOENT; then indeed we know we can give up. Now, 
> > perhaps
> > how and if we want to deal with timeouts when using the driver data API for
> > the fallback mechanism is worth considering given it does not have a 
> > fallback
> > mechanism support yet. If we *add* them it would seem this would also put an
> > implicit race against userspace finishing initialization and running 
> > firmwared
> > in final-mode.
> 
> Just to be clear. When you are saying "rootfs" in this story, you mean
> any (mounted) file-system which may hold the firmware. At least that was
> one of the arguments. In kernel space we can not know how the system is
> setup in terms of mount points, let alone on which mounted file-system
> the firmware resides.

Right, wherever the hell that thing is on, which could be on a crypic fuse
drive waiting for some bits to be decrypted from Elon Musk on a spaceship on his
way to Mars, and only userspace knows how to decrypt this thing through some
evil proprietary thing, way way after a full bootup.

> > Johannes, do you recall the corner cases we spoke about regarding timeouts?
> > Does this match what we spoke about?
> > 
> >>> Let me know if this idea is still of interest and I will rebase what I
> >>> have for an RFC round.
> > 
> > Since no upstream delta is needed for firmwared I'd like to first encourage
> > evaluating the above. While distributions don't carry it yet that may be 
> > seen as
> > an issue but since what we are looking for are corner cases, only folks 
> > needing
> > to deploy a specific solution would need it or a custom proprietary 
> > solution.
> 
> Ok. I will go try and run firmwared in OpenWrt on a router platform.
> Have to steal one from a colleague :-p Will study firmwared.

The finale-mode is the trick.

> > [0] https://github.com/teg/firmwared.git
> > 
> > PS.
> > 
> > Note that firmware signing will require an additional file, the detached
> > signature. The driver data API does not currently support the fallback
> > mechanism so we would not have to worry about that yet but once we add
> > fallback support we'd need to consider this.
> 
> Do you have references to the firmware signing design. Is the idea to
> have one signature and all "firmware files" need to be signed with it?

Nope, I'm afraid a lot has

Re: [PATCH v1] samples/bpf: Add a .gitignore for binaries

2017-05-16 Thread David Ahern

On 5/13/17 3:30 AM, Mickaël Salaün wrote:
> 
> On 13/02/2017 02:43, David Ahern wrote:
>> On 2/12/17 2:23 PM, Mickaël Salaün wrote:
>>> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
>>> new file mode 100644
>>> index ..a7562a5ef4c2
>>> --- /dev/null
>>> +++ b/samples/bpf/.gitignore
>>> @@ -0,0 +1,32 @@
>>> +fds_example
>>> +lathist
>>
>> ...
>>
>> Listing each target is going to be a PITA to maintain. It would be
>> better to put targets into a build directory (bin?) and ignore the
>> directory.
>>
> 
> It would require a lot of modifications to the Makefile and more
> complexity. It seems much more simple for everyone to stick to a simple
> gitignore file easily maintainable:
> $ awk '$1 == "hostprogs-y" { print $3 }' < Makefile > .gitignore
> 
> Alexei, Daniel, what do you think about this? Do you want me to send a
> v2 with the new tests?
> 

The problem stems from the fact that bpf samples do not really fall into
the 'hostprogs' category (see "4 Host Program support" in
Documentation/kbuild/makefiles.txt). Fixing samples/bpf to not rely on
it is the better long term solution. Building of tools/ for example does
not rely on it so there is an existing example of leveraging kernel
headers without the overhead.

Re: [PATCH net] selftests/bpf: fix broken build due to types.h

2017-05-16 Thread Yonghong Song




On 5/16/17 12:18 PM, David Miller wrote:


Please correct the address of the netdev list (it is just plain
'netdev' not 'linux-netdev').


Thanks. Shortly after my first email, I sent a corrected submit as well.
Sorry for the spam.



Secondly, __always_inline should not be defined by types.h

That has to come from linux/compiler.h which we have no reason
to define a private version of for eBPF clang compilation.

The problem is that via several layers of indirection, linux/types.h
eventually includes linux/compiler.h and that is probably the more
appropriate thing for you to do.


Right. I found out simply including string.h will eventually include
linux/compiler.h so that I do not need explicitly defining
__always_inline.

Will send a revised patch soon.

Thanks,

Yonghong

Re: [PATCH v2 1/3] bpf: Use 1<<16 as ceiling for immediate alignment in verifier.

2017-05-16 Thread Alexei Starovoitov


On 5/16/17 5:37 AM, Edward Cree wrote:

On 15/05/17 17:04, David Miller wrote:

If we use 1<<31, then sequences like:

R1 = 0
R1 <<= 2

do silly things.

Hmm.  It might be a bit late for this, but I wonder if, instead of handling
 alignments as (1 << align), you could store them as -(1 << align), i.e.
 leading 1s followed by 'align' 0s.
Now the alignment of 0 is 0 (really 1 << 32), which doesn't change when
 left-shifted some more.  Shifts of other numbers' alignments also do the
 right thing, e.g. align(6) << 2 = (-2) << 2 = -8 = align(6 << 2).  Of
 course you do all this in unsigned, to make sure right shifts work.
This also makes other arithmetic simple to track; for instance, align(a + b)
 is at worst align(a) | align(b).  (Of course, this bound isn't tight.)
A number is 2^(n+1)-aligned if the 2^n bit of its alignment is cleared.
Considered as unsigned numbers, smaller values are stricter alignments.


following this line of thinking it feels that it should be possible
to get rid of 'aux_off' and 'aux_off_align' and simplify the code.
I mean we can always do
dst_reg->min_align = min(dst_reg->min_align, src_reg->min_align);

and don't use 'off' as part of alignment checks at all.
So this bit:
if ((ip_align + reg_off + off) % size != 0) {
can be removed
and replaced with
a = alignof(ip_align)
a = min(a, reg->align)
if (a % size != 0)
and do this check always and not only after if (reg->id)

In check_packet_ptr_add():
- if (had_id)
-  dst_reg->aux_off_align = min(dst_reg->aux_off_align,
-   src_reg->min_align);
- else
-  dst_reg->aux_off_align = src_reg->min_align;

+ if (had_id)
+  dst_reg->min_align = min(dst_reg->min_align, src_reg->min_align);
+ else
+  dst_reg->min_align = src_reg->min_align;

in that sense packet_ptr_add() will be no different than
align logic we do in adjust_reg_min_max_vals()

Thoughts?

Re: [patch net-next v3 05/10] net: sched: move TC_H_MAJ macro call into tcf_auto_prio

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 2:03 PM, Jiri Pirko  wrote:
> Tue, May 16, 2017 at 11:01:52PM CEST, xiyou.wangc...@gmail.com wrote:
>>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Call the helper from the function rather than to always adjust the
>>> return value of the function.
>>
>>And rename the function name to reflect this change?
>
> ? What do you suggest?

tcf_auto_major_prio()?

Re: [PATCH v3 net-next 5/7] net: don't make false software transmit timestamps

2017-05-16 Thread Willem de Bruijn

On Tue, May 16, 2017 at 8:44 AM, Miroslav Lichvar  wrote:
> If software timestamping is enabled by the SO_TIMESTAMP(NS) option
> when a message without timestamp is already waiting in the queue, the
> __sock_recv_timestamp() function will read the current time to make a
> timestamp in order to always have something for the application.
>
> However, this applies also to outgoing packets looped back to the error
> queue when hardware timestamping is enabled by the SO_TIMESTAMPING
> option.

This is already the case for sockets that have both software receive
timestamps and hardware tx timestamps enabled, independent from
the new option SOF_TIMESTAMPING_OPT_TX_SWHW, right? If so,
then this behavior must remain.

Re: [patch net-next v3 02/10] net: sched: introduce tcf block infractructure

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 2:34 PM, David Miller  wrote:
> From: Cong Wang 
> Date: Tue, 16 May 2017 13:51:30 -0700
>
>> On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>>> +int tcf_block_get(struct tcf_block **p_block,
>>> + struct tcf_proto __rcu **p_filter_chain)
>>> +{
>>> +   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
>>> +
>>> +   if (!block)
>>> +   return -ENOMEM;
>>> +   block->p_filter_chain = p_filter_chain;
>>> +   *p_block = block;
>>> +   return 0;
>>> +}
>>> +EXPORT_SYMBOL(tcf_block_get);
>>
>>
>> XXX_get() is usually for refcnt'ing, here you only allocate
>> a block, so please rename it to tcf_block_alloc().
>
> Later in the series he adds refcounting to these objects.
>
> He explained this to Jamal too.

I have read all patches, unless I miss something, block itself
is not refcn'ted, only chains are, so it makes no sense to get
a block, right?

Re: [PATCH v3 net-next 6/7] net: allow simultaneous SW and HW transmit timestamping

2017-05-16 Thread Willem de Bruijn

On Tue, May 16, 2017 at 8:44 AM, Miroslav Lichvar  wrote:
> Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to
> be looped to the socket's error queue with a software timestamp even
> when a hardware transmit timestamp is expected to be provided by the
> driver.
>
> Applications using this option will receive two separate messages from
> the error queue, one with a software timestamp and the other with a
> hardware timestamp. As the hardware timestamp is saved to the shared skb
> info, which may happen before the first message with software timestamp
> is received by the application, the hardware timestamp is copied to the
> SCM_TIMESTAMPING control message only when the skb has no software
> timestamp or it is an incoming packet.
>
> While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as
> there are no other users.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 

Acked-by: Willem de Bruijn

Re: [PATCH v3 net-next 3/7] net: add function to retrieve original skb device using NAPI ID

2017-05-16 Thread Willem de Bruijn

On Tue, May 16, 2017 at 8:44 AM, Miroslav Lichvar  wrote:
> Since commit b68581778cd0 ("net: Make skb->skb_iif always track
> skb->dev") skbs don't have the original index of the interface which
> received the packet. This information is now needed for a new control
> message related to hardware timestamping.
>
> Instead of adding a new field to skb, we can find the device by the NAPI
> ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the
> driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to
> hide the CONFIG_NET_RX_BUSY_POLL ifdef.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 

Acked-by: Willem de Bruijn

Re: [PATCH v3 net-next 4/7] net: add new control message for incoming HW-timestamped packets

2017-05-16 Thread Willem de Bruijn

On Tue, May 16, 2017 at 8:44 AM, Miroslav Lichvar  wrote:
> Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
> for incoming packets with hardware timestamps. It contains the index of
> the real interface which received the packet and the length of the
> packet at layer 2.
>
> The index is useful with bonding, bridges and other interfaces, where
> IP_PKTINFO doesn't allow applications to determine which PHC made the
> timestamp. With the L2 length (and link speed) it is possible to
> transpose preamble timestamps to trailer timestamps, which are used in
> the NTP protocol.
>
> While this information could be provided by two new socket options
> independently from timestamping, it doesn't look like they would be very
> useful. With this option any performance impact is limited to hardware
> timestamping.
>
> Use dev_get_by_napi_id() to get the device and its index. On kernels
> with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero
> index will be returned in the control message.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 

Acked-by: Willem de Bruijn

Re: [patch net-next v3 06/10] net: sched: introduce helpers to work with filter chains

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
> +static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain_info *chain_info)
> +{
> +   return rtnl_dereference(*chain_info->pprev);
> +}
> +
> +static void tcf_chain_tp_insert(struct tcf_chain *chain,
> +   struct tcf_chain_info *chain_info,
> +   struct tcf_proto *tp)
> +{
> +   if (chain->p_filter_chain &&
> +   *chain_info->pprev == chain->filter_chain)
> +   *chain->p_filter_chain = tp;
> +   RCU_INIT_POINTER(tp->next, rtnl_dereference(*chain_info->pprev));

Use tcf_chain_tp_prev()?


> +   rcu_assign_pointer(*chain_info->pprev, tp);
> +}
> +
> +static void tcf_chain_tp_remove(struct tcf_chain *chain,
> +   struct tcf_chain_info *chain_info,
> +   struct tcf_proto *tp)
> +{
> +   struct tcf_proto *next = rtnl_dereference(chain_info->next);
> +
> +   if (chain->p_filter_chain && tp == chain->filter_chain)
> +   *chain->p_filter_chain = next;
> +   RCU_INIT_POINTER(*chain_info->pprev, next);
> +}
> +
> +static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
> +  struct tcf_chain_info *chain_info,
> +  u32 protocol, u32 prio,
> +  bool prio_allocate)
> +{
> +   struct tcf_proto **pprev;
> +   struct tcf_proto *tp;
> +
> +   /* Check the chain for existence of proto-tcf with this priority */
> +   for (pprev = >filter_chain;
> +(tp = rtnl_dereference(*pprev)); pprev = >next) {

Use tcf_chain_tp_prev()?

Re: [PATCH v8 3/5] rxrpc: check return value of skb_to_sgvec always

2017-05-16 Thread Jason A. Donenfeld

On Mon, May 15, 2017 at 3:11 PM, David Howells  wrote:
> skb_to_sgvec() can return -EMSGSIZE in some circumstances.  You shouldn't
> return -ENOMEM here in such a case.

Noted. I'll fix this up for the next round.

Re: [PATCH v8 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-05-16 Thread Jason A. Donenfeld

On Mon, May 15, 2017 at 3:12 PM, David Howells  wrote:
> Is there a reason you moved skb_to_sgvec() in the file rather than just moving
> the comment to it (since you moved the comment anyway)?

1) Because it's easier to understand skb_to_sgvec_nomark as a variant
of skb_to_sgvec, so I'd rather skb_to_sgvec to be first when reading.
2) Because skb_to_sgvec relies on the return value of __skb_to_sgvec,
and so when assessing it, it's sometimes nice to be able to look at
why it will return different things. In that case, it's easier to have
both functions within the same view without scrolling.

It's the little things that make life easier sometimes.

RE: Donation

2017-05-16 Thread Mayrhofer Family

Good Day,

My wife and I have awarded you with a donation of $ 1,000,000.00 Dollars from 
part of our Jackpot Lottery of 50 Million Dollars, respond with your details 
for claims.

We await your earliest response and God Bless you.

Friedrich And Annand Mayrhofer.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

[PATCH net-next] tcp: warn on negative reordering values

2017-05-16 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

Commit bafbb9c73241 ("tcp: eliminate negative reordering
in tcp_clean_rtx_queue") fixes an issue for negative
reordering metrics.

To be resilient to such errors, warn and return
when a negative metric is passed to tcp_update_reordering().

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f27dff64e59e..eb5eb87060a2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -886,6 +886,9 @@ static void tcp_update_reordering(struct sock *sk, const 
int metric,
struct tcp_sock *tp = tcp_sk(sk);
int mib_idx;
 
+   if (WARN_ON_ONCE(metric < 0))
+   return;
+
if (metric > tp->reordering) {
tp->reordering = min(sysctl_tcp_max_reordering, metric);
 
-- 
2.13.0.303.g4ebf302169-goog

Re: [patch net-next v3 02/10] net: sched: introduce tcf block infractructure

2017-05-16 Thread David Miller

From: Cong Wang 
Date: Tue, 16 May 2017 13:51:30 -0700

> On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>> +int tcf_block_get(struct tcf_block **p_block,
>> + struct tcf_proto __rcu **p_filter_chain)
>> +{
>> +   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
>> +
>> +   if (!block)
>> +   return -ENOMEM;
>> +   block->p_filter_chain = p_filter_chain;
>> +   *p_block = block;
>> +   return 0;
>> +}
>> +EXPORT_SYMBOL(tcf_block_get);
> 
> 
> XXX_get() is usually for refcnt'ing, here you only allocate
> a block, so please rename it to tcf_block_alloc().

Later in the series he adds refcounting to these objects.

He explained this to Jamal too.

[PATCH net-next 10/15] tcp: uses jiffies_32 to feed tp->chrono_start

2017-05-16 Thread Eric Dumazet

tcp_time_stamp will no longer be tied to jiffies.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c| 2 +-
 net/ipv4/tcp_output.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
d0bb61ee28bbceff8f2e27416ce87fec94935973..b85bfe7cb11dca68952cc4be19b169d893963fef
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2757,7 +2757,7 @@ static void tcp_get_info_chrono_stats(const struct 
tcp_sock *tp,
for (i = TCP_CHRONO_BUSY; i < __TCP_CHRONO_MAX; ++i) {
stats[i] = tp->chrono_stat[i - 1];
if (i == tp->chrono_type)
-   stats[i] += tcp_time_stamp - tp->chrono_start;
+   stats[i] += tcp_jiffies32 - tp->chrono_start;
stats[i] *= USEC_PER_SEC / HZ;
total += stats[i];
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
f0fd1b4fdb3291638fcdca613d826db2cd27f517..1011ea40c2ba4c12cce21149cab176e1fa4db583
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2202,7 +2202,7 @@ static bool tcp_small_queue_check(struct sock *sk, const 
struct sk_buff *skb,
 
 static void tcp_chrono_set(struct tcp_sock *tp, const enum tcp_chrono new)
 {
-   const u32 now = tcp_time_stamp;
+   const u32 now = tcp_jiffies32;
 
if (tp->chrono_type > TCP_CHRONO_UNSPEC)
tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 07/15] tcp: bic,cubic: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bic.c   |  6 +++---
 net/ipv4/tcp_cubic.c | 12 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
index 
36087bca9f489646c2ca5aae3111449a956dd33b..609965f0e29836ed95605a2c7f3170e67c641058
 100644
--- a/net/ipv4/tcp_bic.c
+++ b/net/ipv4/tcp_bic.c
@@ -84,14 +84,14 @@ static void bictcp_init(struct sock *sk)
 static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
 {
if (ca->last_cwnd == cwnd &&
-   (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
+   (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
return;
 
ca->last_cwnd = cwnd;
-   ca->last_time = tcp_time_stamp;
+   ca->last_time = tcp_jiffies32;
 
if (ca->epoch_start == 0) /* record the beginning of an epoch */
-   ca->epoch_start = tcp_time_stamp;
+   ca->epoch_start = tcp_jiffies32;
 
/* start off normal */
if (cwnd <= low_window) {
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 
2052ca740916d0872a41125ab61b769b334a314b..57ae5b5ae643efad106f5d6ac224ca54a52f9689
 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -231,21 +231,21 @@ static inline void bictcp_update(struct bictcp *ca, u32 
cwnd, u32 acked)
ca->ack_cnt += acked;   /* count the number of ACKed packets */
 
if (ca->last_cwnd == cwnd &&
-   (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
+   (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
return;
 
/* The CUBIC function can update ca->cnt at most once per jiffy.
 * On all cwnd reduction events, ca->epoch_start is set to 0,
 * which will force a recalculation of ca->cnt.
 */
-   if (ca->epoch_start && tcp_time_stamp == ca->last_time)
+   if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
goto tcp_friendliness;
 
ca->last_cwnd = cwnd;
-   ca->last_time = tcp_time_stamp;
+   ca->last_time = tcp_jiffies32;
 
if (ca->epoch_start == 0) {
-   ca->epoch_start = tcp_time_stamp;   /* record beginning */
+   ca->epoch_start = tcp_jiffies32;/* record beginning */
ca->ack_cnt = acked;/* start counting */
ca->tcp_cwnd = cwnd;/* syn with cubic */
 
@@ -276,7 +276,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 
cwnd, u32 acked)
 * if the cwnd < 1 million packets !!!
 */
 
-   t = (s32)(tcp_time_stamp - ca->epoch_start);
+   t = (s32)(tcp_jiffies32 - ca->epoch_start);
t += msecs_to_jiffies(ca->delay_min >> 3);
/* change the unit from HZ to bictcp_HZ */
t <<= BICTCP_HZ;
@@ -448,7 +448,7 @@ static void bictcp_acked(struct sock *sk, const struct 
ack_sample *sample)
return;
 
/* Discard delay samples right after fast recovery */
-   if (ca->epoch_start && (s32)(tcp_time_stamp - ca->epoch_start) < HZ)
+   if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
return;
 
delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 15/15] tcp: switch TCP TS option (RFC 7323) to 1ms clock

2017-05-16 Thread Eric Dumazet

TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] 
https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet 
---
 include/linux/skbuff.h   | 62 +-
 include/linux/tcp.h  | 22 -
 include/net/tcp.h| 59 
 net/ipv4/syncookies.c|  8 ++--
 net/ipv4/tcp.c   |  4 +-
 net/ipv4/tcp_bbr.c   | 22 -
 net/ipv4/tcp_input.c | 96 
 net/ipv4/tcp_ipv4.c  | 17 +++
 net/ipv4/tcp_lp.c| 12 ++---
 net/ipv4/tcp_minisocks.c |  4 +-
 net/ipv4/tcp_output.c| 16 +++
 net/ipv4/tcp_rate.c  | 16 +++
 net/ipv4/tcp_recovery.c  | 23 +-
 net/ipv4/tcp_timer.c |  8 ++--
 net/ipv6/syncookies.c|  2 +-
 net/ipv6/tcp_ipv6.c  |  4 +-
 net/netfilter/nf_synproxy_core.c |  2 +-
 17 files changed, 178 insertions(+), 199 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 
bfc7892f6c33c9fdfb7c0d8110f80cfb12d1ae61..7c0cb2ce8b01a9be366d8cdb7e3661f65ebff3c9
 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -506,66 +506,6 @@ typedef unsigned int sk_buff_data_t;
 typedef unsigned char *sk_buff_data_t;
 #endif
 
-/**
- * struct skb_mstamp - multi resolution time stamps
- * @stamp_us: timestamp in us resolution
- * @stamp_jiffies: timestamp in jiffies
- */
-struct skb_mstamp {
-   union {
-   u64 v64;
-   struct {
-   u32 stamp_us;
-   u32 stamp_jiffies;
-   };
-   };
-};
-
-/**
- * skb_mstamp_get - get current timestamp
- * @cl: place to store timestamps
- */
-static inline void skb_mstamp_get(struct skb_mstamp *cl)
-{
-   u64 val = local_clock();
-
-   do_div(val, NSEC_PER_USEC);
-   cl->stamp_us = (u32)val;
-   cl->stamp_jiffies = (u32)jiffies;
-}
-
-/**
- * skb_mstamp_delta - compute the difference in usec between two skb_mstamp
- * @t1: pointer to newest sample
- * @t0: pointer to oldest sample
- */
-static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
- const struct skb_mstamp *t0)
-{
-   s32 delta_us = t1->stamp_us - t0->stamp_us;
-   u32 delta_jiffies = t1->stamp_jiffies - t0->stamp_jiffies;
-
-   /* If delta_us is negative, this might be because interval is too big,
-* or local_clock() drift is too big : fallback using jiffies.
-*/
-   if (delta_us <= 0 ||
-   delta_jiffies >= (INT_MAX / (USEC_PER_SEC / HZ)))
-
-   delta_us = jiffies_to_usecs(delta_jiffies);
-
-   return delta_us;
-}
-
-static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
-   const struct skb_mstamp *t0)
-{
-   s32 diff = t1->stamp_jiffies - t0->stamp_jiffies;
-
-   if (!diff)
-   diff = t1->stamp_us - t0->stamp_us;
-   return diff > 0;
-}
-
 /** 
  * struct sk_buff - socket buffer
  * @next: Next buffer in list
@@ -646,7 +586,7 @@ struct sk_buff {
 
union {
ktime_t tstamp;
-   struct skb_mstamp skb_mstamp;
+   u64 skb_mstamp;
};
};
struct rb_node  rbnode; /* used in netem & tcp stack */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 
22854f0284347a3bb047709478525ee5a9dd9b36..542ca1ae02c4f64833b287c0fd744283ee518909
 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -123,7 +123,7 @@ struct tcp_request_sock_ops;
 struct tcp_request_sock {
struct inet_request_sockreq;
const struct tcp_request_sock_ops *af_specific;
-   struct skb_mstamp   snt_synack; /* first SYNACK sent time */
+   u64 snt_synack; /* first SYNACK sent time */
booltfo_listener;
u32 txhash;
u32 rcv_isn;
@@ -211,7 +211,7 @@ struct tcp_sock {
 
/* Information of the most recently

[PATCH net-next 06/15] tcp_bbr: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 
92b045c72163def1c1d6aa0f2002760186aa5dc3..40dc4fc5f6acba91634290e1cacde69a3584248f
 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -730,12 +730,12 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
bool filter_expired;
 
/* Track min RTT seen in the min_rtt_win_sec filter window: */
-   filter_expired = after(tcp_time_stamp,
+   filter_expired = after(tcp_jiffies32,
   bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
if (rs->rtt_us >= 0 &&
(rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
bbr->min_rtt_us = rs->rtt_us;
-   bbr->min_rtt_stamp = tcp_time_stamp;
+   bbr->min_rtt_stamp = tcp_jiffies32;
}
 
if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
@@ -754,7 +754,7 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
/* Maintain min packets in flight for max(200 ms, 1 round). */
if (!bbr->probe_rtt_done_stamp &&
tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
-   bbr->probe_rtt_done_stamp = tcp_time_stamp +
+   bbr->probe_rtt_done_stamp = tcp_jiffies32 +
msecs_to_jiffies(bbr_probe_rtt_mode_ms);
bbr->probe_rtt_round_done = 0;
bbr->next_rtt_delivered = tp->delivered;
@@ -762,8 +762,8 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
if (bbr->round_start)
bbr->probe_rtt_round_done = 1;
if (bbr->probe_rtt_round_done &&
-   after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) {
-   bbr->min_rtt_stamp = tcp_time_stamp;
+   after(tcp_jiffies32, bbr->probe_rtt_done_stamp)) {
+   bbr->min_rtt_stamp = tcp_jiffies32;
bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
bbr_reset_mode(sk);
}
@@ -810,7 +810,7 @@ static void bbr_init(struct sock *sk)
bbr->probe_rtt_done_stamp = 0;
bbr->probe_rtt_round_done = 0;
bbr->min_rtt_us = tcp_min_rtt(tp);
-   bbr->min_rtt_stamp = tcp_time_stamp;
+   bbr->min_rtt_stamp = tcp_jiffies32;
 
minmax_reset(>bw, bbr->rtt_cnt, 0);  /* init max bw to 0 */
 
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 12/15] tcp_westwood: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-16 Thread Eric Dumazet

This CC does not need 1 ms tcp_time_stamp and can use
the jiffy based 'timestamp'.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_westwood.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
index 
9775453b8d174c848dc09df83d1fa185422cd8cc..bec9cafbe3f92938e5d79d743d629b2f33464418
 100644
--- a/net/ipv4/tcp_westwood.c
+++ b/net/ipv4/tcp_westwood.c
@@ -68,7 +68,7 @@ static void tcp_westwood_init(struct sock *sk)
w->cumul_ack = 0;
w->reset_rtt_min = 1;
w->rtt_min = w->rtt = TCP_WESTWOOD_INIT_RTT;
-   w->rtt_win_sx = tcp_time_stamp;
+   w->rtt_win_sx = tcp_jiffies32;
w->snd_una = tcp_sk(sk)->snd_una;
w->first_ack = 1;
 }
@@ -116,7 +116,7 @@ static void tcp_westwood_pkts_acked(struct sock *sk,
 static void westwood_update_window(struct sock *sk)
 {
struct westwood *w = inet_csk_ca(sk);
-   s32 delta = tcp_time_stamp - w->rtt_win_sx;
+   s32 delta = tcp_jiffies32 - w->rtt_win_sx;
 
/* Initialize w->snd_una with the first acked sequence number in order
 * to fix mismatch between tp->snd_una and w->snd_una for the first
@@ -140,7 +140,7 @@ static void westwood_update_window(struct sock *sk)
westwood_filter(w, delta);
 
w->bk = 0;
-   w->rtt_win_sx = tcp_time_stamp;
+   w->rtt_win_sx = tcp_jiffies32;
}
 }
 
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 01/15] tcp: use tp->tcp_mstamp in output path

2017-05-16 Thread Eric Dumazet

Idea is to later convert tp->tcp_mstamp to a full u64 counter
using usec resolution, so that we can later have fine
grained TCP TS clock (RFC 7323), regardless of HZ value.

We try to refresh tp->tcp_mstamp only when necessary.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_ipv4.c |  1 +
 net/ipv4/tcp_output.c   | 21 +++--
 net/ipv4/tcp_recovery.c |  1 -
 net/ipv4/tcp_timer.c|  3 ++-
 4 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
5ab2aac5ca191075383fc75214da816873bb222c..d8fe25db79f223e3fde85882effd2ac6ec15f8ca
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -483,6 +483,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
skb = tcp_write_queue_head(sk);
BUG_ON(!skb);
 
+   skb_mstamp_get(>tcp_mstamp);
remaining = icsk->icsk_rto -
min(icsk->icsk_rto,
tcp_time_stamp - tcp_skb_timestamp(skb));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
a32172d69a03cbe76b45ec3094222f6c3a73e27d..4c8a6eaba6b39a2aea061dd6857ed8df954c5ca2
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -997,8 +997,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff 
*skb, int clone_it,
BUG_ON(!skb || !tcp_skb_pcount(skb));
tp = tcp_sk(sk);
 
+   skb->skb_mstamp = tp->tcp_mstamp;
if (clone_it) {
-   skb_mstamp_get(>skb_mstamp);
TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq
- tp->snd_una;
tcp_rate_skb_sent(sk, skb);
@@ -1906,7 +1906,6 @@ static bool tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb,
const struct inet_connection_sock *icsk = inet_csk(sk);
u32 age, send_win, cong_win, limit, in_flight;
struct tcp_sock *tp = tcp_sk(sk);
-   struct skb_mstamp now;
struct sk_buff *head;
int win_divisor;
 
@@ -1962,8 +1961,8 @@ static bool tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb,
}
 
head = tcp_write_queue_head(sk);
-   skb_mstamp_get();
-   age = skb_mstamp_us_delta(, >skb_mstamp);
+
+   age = skb_mstamp_us_delta(>tcp_mstamp, >skb_mstamp);
/* If next ACK is likely to come too late (half srtt), do not defer */
if (age < (tp->srtt_us >> 4))
goto send_now;
@@ -2280,6 +2279,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
}
 
max_segs = tcp_tso_segs(sk, mss_now);
+   skb_mstamp_get(>tcp_mstamp);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
 
@@ -2291,7 +2291,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 
if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) 
{
/* "skb_mstamp" is used as a start point for the 
retransmit timer */
-   skb_mstamp_get(>skb_mstamp);
+   skb->skb_mstamp = tp->tcp_mstamp;
goto repair; /* Skip network transmission */
}
 
@@ -2879,7 +2879,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
 skb_headroom(skb) >= 0x)) {
struct sk_buff *nskb;
 
-   skb_mstamp_get(>skb_mstamp);
+   skb->skb_mstamp = tp->tcp_mstamp;
nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
 -ENOBUFS;
@@ -3095,7 +3095,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority)
skb_reserve(skb, MAX_TCP_HEADER);
tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
 TCPHDR_ACK | TCPHDR_RST);
-   skb_mstamp_get(>skb_mstamp);
+   skb_mstamp_get(_sk(sk)->tcp_mstamp);
/* Send it off. */
if (tcp_transmit_skb(sk, skb, 0, priority))
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
@@ -3453,7 +3453,8 @@ int tcp_connect(struct sock *sk)
return -ENOBUFS;
 
tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
-   tp->retrans_stamp = tcp_time_stamp;
+   skb_mstamp_get(>tcp_mstamp);
+   tp->retrans_stamp = tp->tcp_mstamp.stamp_jiffies;
tcp_connect_queue_skb(sk, buff);
tcp_ecn_send_syn(sk, buff);
 
@@ -3572,7 +3573,6 @@ void tcp_send_ack(struct sock *sk)
skb_set_tcp_pure_ack(buff);
 
/* Send it off, this clears delayed acks for us. */
-   skb_mstamp_get(>skb_mstamp);
tcp_transmit_skb(sk, buff, 0, (__force gfp_t)0);
 }
 EXPORT_SYMBOL_GPL(tcp_send_ack);
@@ -3606,15 +3606,16 @@ static int tcp_xmit_probe_skb(struct sock *sk, int 
urgent, int mib)
 * send it.
 */
tcp_init_nondata_skb(skb,

[PATCH net-next 13/15] tcp_lp: cache tcp_time_stamp

2017-05-16 Thread Eric Dumazet

tcp_time_stamp will become slightly more expensive soon,
cache its value.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_lp.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_lp.c b/net/ipv4/tcp_lp.c
index 
d6fb6c067af4641f232b94e7c590c212648e8173..ef3122abb3734a63011fba035f7a7aae431da8de
 100644
--- a/net/ipv4/tcp_lp.c
+++ b/net/ipv4/tcp_lp.c
@@ -264,18 +264,19 @@ static void tcp_lp_pkts_acked(struct sock *sk, const 
struct ack_sample *sample)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct lp *lp = inet_csk_ca(sk);
+   u32 now = tcp_time_stamp;
u32 delta;
 
if (sample->rtt_us > 0)
tcp_lp_rtt_sample(sk, sample->rtt_us);
 
/* calc inference */
-   delta = tcp_time_stamp - tp->rx_opt.rcv_tsecr;
+   delta = now - tp->rx_opt.rcv_tsecr;
if ((s32)delta > 0)
lp->inference = 3 * delta;
 
/* test if within inference */
-   if (lp->last_drop && (tcp_time_stamp - lp->last_drop < lp->inference))
+   if (lp->last_drop && (now - lp->last_drop < lp->inference))
lp->flag |= LP_WITHIN_INF;
else
lp->flag &= ~LP_WITHIN_INF;
@@ -312,7 +313,7 @@ static void tcp_lp_pkts_acked(struct sock *sk, const struct 
ack_sample *sample)
tp->snd_cwnd = max(tp->snd_cwnd >> 1U, 1U);
 
/* record this drop time */
-   lp->last_drop = tcp_time_stamp;
+   lp->last_drop = now;
 }
 
 static struct tcp_congestion_ops tcp_lp __read_mostly = {
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 05/15] tcp: use tcp_jiffies32 to feed tp->snd_cwnd_stamp

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->snd_cwnd_stamp.

tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c   | 14 +++---
 net/ipv4/tcp_metrics.c |  2 +-
 net/ipv4/tcp_output.c  |  8 
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
c0b3f909df394214785749704f2760171fe9d160..6a15c9b80b09829799dc37d89ecdbf11ec9ff904
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -463,7 +463,7 @@ void tcp_init_buffer_space(struct sock *sk)
tp->window_clamp = max(2 * tp->advmss, maxwin - tp->advmss);
 
tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
 /* 5. Recalculate window clamp after socket hit its memory bounds. */
@@ -1954,7 +1954,7 @@ void tcp_enter_loss(struct sock *sk)
}
tp->snd_cwnd   = 1;
tp->snd_cwnd_cnt   = 0;
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
 
tp->retrans_out = 0;
tp->lost_out = 0;
@@ -2383,7 +2383,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool 
unmark_loss)
tcp_ecn_withdraw_cwr(tp);
}
}
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
tp->undo_marker = 0;
 }
 
@@ -2520,7 +2520,7 @@ static inline void tcp_end_cwnd_reduction(struct sock *sk)
if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
(tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
tp->snd_cwnd = tp->snd_ssthresh;
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
}
tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
 }
@@ -2590,7 +2590,7 @@ static void tcp_mtup_probe_success(struct sock *sk)
   tcp_mss_to_mtu(sk, tp->mss_cache) /
   icsk->icsk_mtup.probe_size;
tp->snd_cwnd_cnt = 0;
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
tp->snd_ssthresh = tcp_current_ssthresh(sk);
 
icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
@@ -2976,7 +2976,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 
acked)
const struct inet_connection_sock *icsk = inet_csk(sk);
 
icsk->icsk_ca_ops->cong_avoid(sk, ack, acked);
-   tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
+   tcp_sk(sk)->snd_cwnd_stamp = tcp_jiffies32;
 }
 
 /* Restart timer after forward progress on connection.
@@ -5019,7 +5019,7 @@ static void tcp_new_space(struct sock *sk)
 
if (tcp_should_expand_sndbuf(sk)) {
tcp_sndbuf_expand(sk);
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
}
 
sk->sk_write_space(sk);
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 
653bbd67e3a39b68d27d26d17571c00ce2854bfd..102b2c90bb807d3a88d31b59324baf72cf901cdf
 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -524,7 +524,7 @@ void tcp_init_metrics(struct sock *sk)
tp->snd_cwnd = 1;
else
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
be9f8f483e21bdbb4d944fcdae8560f3ae11ee64..4bd50f0b236ba23fe521a76dd9d35ee16acb061f
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -151,7 +151,7 @@ void tcp_cwnd_restart(struct sock *sk, s32 delta)
while ((delta -= inet_csk(sk)->icsk_rto) > 0 && cwnd > restart_cwnd)
cwnd >>= 1;
tp->snd_cwnd = max(cwnd, restart_cwnd);
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
tp->snd_cwnd_used = 0;
 }
 
@@ -1576,7 +1576,7 @@ static void tcp_cwnd_application_limited(struct sock *sk)
}
tp->snd_cwnd_used = 0;
}
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
 static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
@@ -1597,14 +1597,14 @@ static void tcp_cwnd_validate(struct sock *sk, bool 
is_cwnd_limited)
if (tcp_is_cwnd_limited(sk)) {
/* Network is feed fully. */
tp->snd_cwnd_used = 0;
-   tp->snd_cwnd_stamp = tcp_time_stamp;
+   tp->snd_cwnd_stamp = tcp_jiffies32;
} else {
/* Network starves. */
if (tp->packets_out > tp->snd_cwnd_used)
tp->snd_cwnd_used = tp->packets_out;

[PATCH net-next 08/15] tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtime

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h| 4 ++--
 net/ipv4/tcp_input.c | 6 +++---
 net/ipv4/tcp_minisocks.c | 2 +-
 net/ipv4/tcp_output.c| 2 +-
 net/ipv4/tcp_timer.c | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
feba4c0406e551d7e57da3411476735731b4d817..5b2932b8363fb8546322ebff7c74663139b3371d
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1307,8 +1307,8 @@ static inline u32 keepalive_time_elapsed(const struct 
tcp_sock *tp)
 {
const struct inet_connection_sock *icsk = >inet_conn;
 
-   return min_t(u32, tcp_time_stamp - icsk->icsk_ack.lrcvtime,
- tcp_time_stamp - tp->rcv_tstamp);
+   return min_t(u32, tcp_jiffies32 - icsk->icsk_ack.lrcvtime,
+ tcp_jiffies32 - tp->rcv_tstamp);
 }
 
 static inline int tcp_fin_time(const struct sock *sk)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
6a15c9b80b09829799dc37d89ecdbf11ec9ff904..eeb4967df25a8dc35128d0a0848b5ae7ee6d63e3
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -672,7 +672,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
 
tcp_rcv_rtt_measure(tp);
 
-   now = tcp_time_stamp;
+   now = tcp_jiffies32;
 
if (!icsk->icsk_ack.ato) {
/* The _first_ data packet received, initialize
@@ -3636,7 +3636,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
 */
sk->sk_err_soft = 0;
icsk->icsk_probes_out = 0;
-   tp->rcv_tstamp = tcp_time_stamp;
+   tp->rcv_tstamp = tcp_jiffies32;
if (!prior_packets)
goto no_queue;
 
@@ -5554,7 +5554,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
struct inet_connection_sock *icsk = inet_csk(sk);
 
tcp_set_state(sk, TCP_ESTABLISHED);
-   icsk->icsk_ack.lrcvtime = tcp_time_stamp;
+   icsk->icsk_ack.lrcvtime = tcp_jiffies32;
 
if (skb) {
icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 
717be4de53248352c758b50557987d898340dd4f..59c32e0086c0e46d7955dffe211ec03bb18dcb12
 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -447,7 +447,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
newicsk->icsk_rto = TCP_TIMEOUT_INIT;
-   newicsk->icsk_ack.lrcvtime = tcp_time_stamp;
+   newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
 
newtp->packets_out = 0;
newtp->retrans_out = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
4bd50f0b236ba23fe521a76dd9d35ee16acb061f..cbda5de164495cf318960489bd8edf98fe3a5033
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3324,7 +3324,7 @@ static void tcp_connect_init(struct sock *sk)
if (likely(!tp->repair))
tp->rcv_nxt = 0;
else
-   tp->rcv_tstamp = tcp_time_stamp;
+   tp->rcv_tstamp = tcp_jiffies32;
tp->rcv_wup = tp->rcv_nxt;
tp->copied_seq = tp->rcv_nxt;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
5f6f219a431e41a90b3c5d667a1a22b50f4464cf..9e0616cb8c17a6385ac97fc0cd657ef9413a1749
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -451,7 +451,7 @@ void tcp_retransmit_timer(struct sock *sk)
tp->snd_una, tp->snd_nxt);
}
 #endif
-   if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+   if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
tcp_write_err(sk);
goto out;
}
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 03/15] dccp: do not use tcp_time_stamp

2017-05-16 Thread Eric Dumazet

Use our own macro instead of abusing tcp_time_stamp

Signed-off-by: Eric Dumazet 
---
 net/dccp/ccids/ccid2.c | 8 
 net/dccp/ccids/ccid2.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 
5e3a7302f7747e4c4f3134eacab2f2c65b13402f..e1295d5f2c562e8785f59a0f5bd7064f471e85ab
 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -233,7 +233,7 @@ static void ccid2_hc_tx_packet_sent(struct sock *sk, 
unsigned int len)
 {
struct dccp_sock *dp = dccp_sk(sk);
struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
-   const u32 now = ccid2_time_stamp;
+   const u32 now = ccid2_jiffies32;
struct ccid2_seq *next;
 
/* slow-start after idle periods (RFC 2581, RFC 2861) */
@@ -466,7 +466,7 @@ static void ccid2_new_ack(struct sock *sk, struct ccid2_seq 
*seqp,
 * The cleanest solution is to not use the ccid2s_sent field at all
 * and instead use DCCP timestamps: requires changes in other places.
 */
-   ccid2_rtt_estimator(sk, ccid2_time_stamp - seqp->ccid2s_sent);
+   ccid2_rtt_estimator(sk, ccid2_jiffies32 - seqp->ccid2s_sent);
 }
 
 static void ccid2_congestion_event(struct sock *sk, struct ccid2_seq *seqp)
@@ -478,7 +478,7 @@ static void ccid2_congestion_event(struct sock *sk, struct 
ccid2_seq *seqp)
return;
}
 
-   hc->tx_last_cong = ccid2_time_stamp;
+   hc->tx_last_cong = ccid2_jiffies32;
 
hc->tx_cwnd  = hc->tx_cwnd / 2 ? : 1U;
hc->tx_ssthresh  = max(hc->tx_cwnd, 2U);
@@ -731,7 +731,7 @@ static int ccid2_hc_tx_init(struct ccid *ccid, struct sock 
*sk)
 
hc->tx_rto   = DCCP_TIMEOUT_INIT;
hc->tx_rpdupack  = -1;
-   hc->tx_last_cong = hc->tx_lsndtime = hc->tx_cwnd_stamp = 
ccid2_time_stamp;
+   hc->tx_last_cong = hc->tx_lsndtime = hc->tx_cwnd_stamp = 
ccid2_jiffies32;
hc->tx_cwnd_used = 0;
setup_timer(>tx_rtotimer, ccid2_hc_tx_rto_expire,
(unsigned long)sk);
diff --git a/net/dccp/ccids/ccid2.h b/net/dccp/ccids/ccid2.h
index 
18c97543e522a6b9a5c8a3c817d4b40224adde48..6e50ef2898fb9dd9080217cc167defea6a2e9021
 100644
--- a/net/dccp/ccids/ccid2.h
+++ b/net/dccp/ccids/ccid2.h
@@ -27,7 +27,7 @@
  * CCID-2 timestamping faces the same issues as TCP timestamping.
  * Hence we reuse/share as much of the code as possible.
  */
-#define ccid2_time_stamp   tcp_time_stamp
+#define ccid2_jiffies32((u32)jiffies)
 
 /* NUMDUPACK parameter from RFC 4341, p. 6 */
 #define NUMDUPACK  3
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 11/15] tcp: use tcp_jiffies32 in __tcp_oow_rate_limited()

2017-05-16 Thread Eric Dumazet

This place wants to use tcp_jiffies32, this is good enough.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
eeb4967df25a8dc35128d0a0848b5ae7ee6d63e3..85575888365a10643e096f9e019adaa3eda87d40
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3390,7 +3390,7 @@ static bool __tcp_oow_rate_limited(struct net *net, int 
mib_idx,
   u32 *last_oow_ack_time)
 {
if (*last_oow_ack_time) {
-   s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);
+   s32 elapsed = (s32)(tcp_jiffies32 - *last_oow_ack_time);
 
if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
NET_INC_STATS(net, mib_idx);
@@ -3398,7 +3398,7 @@ static bool __tcp_oow_rate_limited(struct net *net, int 
mib_idx,
}
}
 
-   *last_oow_ack_time = tcp_time_stamp;
+   *last_oow_ack_time = tcp_jiffies32;
 
return false;   /* not rate-limited: go ahead, send dupack now! */
 }
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 04/15] tcp: use tcp_jiffies32 to feed tp->lsndtime

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->lsndtime.

tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 2 +-
 net/ipv4/tcp.c| 2 +-
 net/ipv4/tcp_cubic.c  | 2 +-
 net/ipv4/tcp_input.c  | 4 ++--
 net/ipv4/tcp_output.c | 4 ++--
 net/ipv4/tcp_timer.c  | 4 ++--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
4b45be5708215bae4551a5430b63ab2777baf447..feba4c0406e551d7e57da3411476735731b4d817
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1245,7 +1245,7 @@ static inline void tcp_slow_start_after_idle_check(struct 
sock *sk)
if (!sysctl_tcp_slow_start_after_idle || tp->packets_out ||
ca_ops->cong_control)
return;
-   delta = tcp_time_stamp - tp->lsndtime;
+   delta = tcp_jiffies32 - tp->lsndtime;
if (delta > inet_csk(sk)->icsk_rto)
tcp_cwnd_restart(sk, delta);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
1e4c76d2b8278ba71d6cc2cf7ebfe483e241f76e..d0bb61ee28bbceff8f2e27416ce87fec94935973
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2841,7 +2841,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
 
-   now = tcp_time_stamp;
+   now = tcp_jiffies32;
info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
info->tcpi_last_data_recv = jiffies_to_msecs(now - 
icsk->icsk_ack.lrcvtime);
info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 
0683ba447d775b6101a929a6aca3eb255cff8932..2052ca740916d0872a41125ab61b769b334a314b
 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -155,7 +155,7 @@ static void bictcp_cwnd_event(struct sock *sk, enum 
tcp_ca_event event)
 {
if (event == CA_EVENT_TX_START) {
struct bictcp *ca = inet_csk_ca(sk);
-   u32 now = tcp_time_stamp;
+   u32 now = tcp_jiffies32;
s32 delta;
 
delta = now - tcp_sk(sk)->lsndtime;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
06e2dbc2b4a212a054fd88e57bb902c55a171b11..c0b3f909df394214785749704f2760171fe9d160
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5571,7 +5571,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
/* Prevent spurious tcp_cwnd_restart() on first data
 * packet.
 */
-   tp->lsndtime = tcp_time_stamp;
+   tp->lsndtime = tcp_jiffies32;
 
tcp_init_buffer_space(sk);
 
@@ -6008,7 +6008,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
tcp_update_pacing_rate(sk);
 
/* Prevent spurious tcp_cwnd_restart() on first data packet */
-   tp->lsndtime = tcp_time_stamp;
+   tp->lsndtime = tcp_jiffies32;
 
tcp_initialize_rcv_mss(sk);
tcp_fast_path_on(tp);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
4c8a6eaba6b39a2aea061dd6857ed8df954c5ca2..be9f8f483e21bdbb4d944fcdae8560f3ae11ee64
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -160,7 +160,7 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
-   const u32 now = tcp_time_stamp;
+   const u32 now = tcp_jiffies32;
 
if (tcp_packets_in_flight(tp) == 0)
tcp_ca_event(sk, CA_EVENT_TX_START);
@@ -1918,7 +1918,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb,
/* Avoid bursty behavior by allowing defer
 * only if the last write was recent.
 */
-   if ((s32)(tcp_time_stamp - tp->lsndtime) > 0)
+   if ((s32)(tcp_jiffies32 - tp->lsndtime) > 0)
goto send_now;
 
in_flight = tcp_packets_in_flight(tp);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
ec7c5473c788d77ae459b38492f2f2606d00d1ba..5f6f219a431e41a90b3c5d667a1a22b50f4464cf
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -63,7 +63,7 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
 
/* If peer does not open window for long time, or did not transmit
 * anything for long time, penalize it. */
-   if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+   if ((s32)(tcp_jiffies32 - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
shift++;
 
/* If some dubious ICMP arrived, penalize even more. */
@@ -73,7 +73,7 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
if (tcp_check_oom(sk, shift)) {
/* Catch exceptional cases, when connection requires reset.

[PATCH net-next 09/15] tcp: use tcp_jiffies32 to feed probe_timestamp

2017-05-16 Thread Eric Dumazet

Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 6 +++---
 net/ipv4/tcp_timer.c  | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
cbda5de164495cf318960489bd8edf98fe3a5033..f0fd1b4fdb3291638fcdca613d826db2cd27f517
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1475,7 +1475,7 @@ void tcp_mtup_init(struct sock *sk)
icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, 
net->ipv4.sysctl_tcp_base_mss);
icsk->icsk_mtup.probe_size = 0;
if (icsk->icsk_mtup.enabled)
-   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
@@ -1987,7 +1987,7 @@ static inline void tcp_mtu_check_reprobe(struct sock *sk)
s32 delta;
 
interval = net->ipv4.sysctl_tcp_probe_interval;
-   delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
+   delta = tcp_jiffies32 - icsk->icsk_mtup.probe_timestamp;
if (unlikely(delta >= interval * HZ)) {
int mss = tcp_current_mss(sk);
 
@@ -1999,7 +1999,7 @@ static inline void tcp_mtu_check_reprobe(struct sock *sk)
icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
 
/* Update probe time stamp */
-   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
}
 }
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
9e0616cb8c17a6385ac97fc0cd657ef9413a1749..6629f47aa7f0182ece7873afcc3daa6f0019e228
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -115,7 +115,7 @@ static void tcp_mtu_probing(struct inet_connection_sock 
*icsk, struct sock *sk)
if (net->ipv4.sysctl_tcp_mtu_probing) {
if (!icsk->icsk_mtup.enabled) {
icsk->icsk_mtup.enabled = 1;
-   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
} else {
struct net *net = sock_net(sk);
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 14/15] tcp: replace misc tcp_time_stamp to tcp_jiffies32

2017-05-16 Thread Eric Dumazet

After this patch, all uses of tcp_time_stamp will require
a change when we introduce 1 ms and/or 1 us TCP TS option.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c   | 2 +-
 net/ipv4/tcp_htcp.c  | 2 +-
 net/ipv4/tcp_input.c | 2 +-
 net/ipv4/tcp_minisocks.c | 2 +-
 net/ipv4/tcp_output.c| 4 ++--
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
b85bfe7cb11dca68952cc4be19b169d893963fef..85005480052626c5769ef100a868c88fad803f75
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -386,7 +386,7 @@ void tcp_init_sock(struct sock *sk)
 
icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
+   minmax_reset(>rtt_min, tcp_jiffies32, ~0U);
 
/* So many TCP implementations out there (incorrectly) count the
 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 
4a4d8e76738fa2831dcc3ecec5924dd3dfb7bf58..3eb78cde6ff0a22b7b411f0ae4258b6ef74ffe73
 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -104,7 +104,7 @@ static void measure_achieved_throughput(struct sock *sk,
const struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_sock *tp = tcp_sk(sk);
struct htcp *ca = inet_csk_ca(sk);
-   u32 now = tcp_time_stamp;
+   u32 now = tcp_jiffies32;
 
if (icsk->icsk_ca_state == TCP_CA_Open)
ca->pkts_acked = sample->pkts_acked;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
85575888365a10643e096f9e019adaa3eda87d40..10e6775464f647a65ea0d19c10b421f9cd38923d
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2911,7 +2911,7 @@ static void tcp_update_rtt_min(struct sock *sk, u32 
rtt_us)
struct tcp_sock *tp = tcp_sk(sk);
u32 wlen = sysctl_tcp_min_rtt_wlen * HZ;
 
-   minmax_running_min(>rtt_min, wlen, tcp_time_stamp,
+   minmax_running_min(>rtt_min, wlen, tcp_jiffies32,
   rtt_us ? : jiffies_to_usecs(1));
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 
59c32e0086c0e46d7955dffe211ec03bb18dcb12..6504f1082bdfda77bfc1b53d0d85928e5083a24e
 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -445,7 +445,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
newtp->srtt_us = 0;
newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
+   minmax_reset(>rtt_min, tcp_jiffies32, ~0U);
newicsk->icsk_rto = TCP_TIMEOUT_INIT;
newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
1011ea40c2ba4c12cce21149cab176e1fa4db583..65472e931a0b79f7078a4da7db802dfcc32c7621
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2418,10 +2418,10 @@ bool tcp_schedule_loss_probe(struct sock *sk)
timeout = max_t(u32, timeout, msecs_to_jiffies(10));
 
/* If RTO is shorter, just schedule TLP in its place. */
-   tlp_time_stamp = tcp_time_stamp + timeout;
+   tlp_time_stamp = tcp_jiffies32 + timeout;
rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
-   s32 delta = rto_time_stamp - tcp_time_stamp;
+   s32 delta = rto_time_stamp - tcp_jiffies32;
if (delta > 0)
timeout = delta;
}
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 02/15] tcp: introduce tcp_jiffies32

2017-05-16 Thread Eric Dumazet

We abuse tcp_time_stamp for two different cases :

1) base to generate TCP Timestamp options (RFC 7323)

2) A 32bit version of jiffies since some TCP fields
   are 32bit wide to save memory.

Since we want in the future to have 1ms TCP TS clock,
regardless of HZ value, we want to cleanup things.

tcp_jiffies32 is the truncated jiffies value,
which will be used only in places where we want a 'host'
timestamp.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
b4dc93dae98c2d175ccadce150083705d237555e..4b45be5708215bae4551a5430b63ab2777baf447
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -700,11 +700,14 @@ u32 __tcp_select_window(struct sock *sk);
 
 void tcp_send_window_probe(struct sock *sk);
 
-/* TCP timestamps are only 32-bits, this causes a slight
- * complication on 64-bit systems since we store a snapshot
- * of jiffies in the buffer control blocks below.  We decided
- * to use only the low 32-bits of jiffies and hide the ugly
- * casts with the following macro.
+/* TCP uses 32bit jiffies to save some space.
+ * Note that this is different from tcp_time_stamp, which
+ * historically has been the same until linux-4.13.
+ */
+#define tcp_jiffies32 ((u32)jiffies)
+
+/* Generator for TCP TS option (RFC 7323)
+ * Currently tied to 'jiffies' but will soon be driven by 1 ms clock.
  */
 #define tcp_time_stamp ((__u32)(jiffies))
 
-- 
2.13.0.303.g4ebf302169-goog

[PATCH net-next 00/15] tcp: TCP TS option use 1 ms clock

2017-05-16 Thread Eric Dumazet

TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffy' variable, because it had been a cheap and good enough
generator.

Unfortunately some distros use HZ=250 or even HZ=100 leading
to not very useful TCP timestamps.

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1].
RCVBUF autotuning is more precise.

This series converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

Kathleen Nichols [2] and others advocate for 1ms TS clocks for
network analysis. (1ms being the lowest value supported by RFC 7323.)

[1] 
https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
[2] http://netseminar.stanford.edu/seminars/02_02_17.pdf

Eric Dumazet (15):
  tcp: use tp->tcp_mstamp in output path
  tcp: introduce tcp_jiffies32
  dccp: do not use tcp_time_stamp
  tcp: use tcp_jiffies32 to feed tp->lsndtime
  tcp: use tcp_jiffies32 to feed tp->snd_cwnd_stamp
  tcp_bbr: use tcp_jiffies32 instead of tcp_time_stamp
  tcp: bic,cubic: use tcp_jiffies32 instead of tcp_time_stamp
  tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtime
  tcp: use tcp_jiffies32 to feed probe_timestamp
  tcp: uses jiffies_32 to feed tp->chrono_start
  tcp: use tcp_jiffies32 in __tcp_oow_rate_limited()
  tcp_westwood: use tcp_jiffies32 instead of tcp_time_stamp
  tcp_lp: cache tcp_time_stamp
  tcp: replace misc tcp_time_stamp to tcp_jiffies32
  tcp: switch TCP TS option (RFC 7323) to 1ms clock

 include/linux/skbuff.h   |  62 +--
 include/linux/tcp.h  |  22 +++
 include/net/tcp.h|  74 ++-
 net/dccp/ccids/ccid2.c   |   8 +--
 net/dccp/ccids/ccid2.h   |   2 +-
 net/ipv4/syncookies.c|   8 +--
 net/ipv4/tcp.c   |  10 ++--
 net/ipv4/tcp_bbr.c   |  34 +--
 net/ipv4/tcp_bic.c   |   6 +-
 net/ipv4/tcp_cubic.c |  14 ++---
 net/ipv4/tcp_htcp.c  |   2 +-
 net/ipv4/tcp_input.c | 126 +++
 net/ipv4/tcp_ipv4.c  |  16 ++---
 net/ipv4/tcp_lp.c|  17 +++---
 net/ipv4/tcp_metrics.c   |   2 +-
 net/ipv4/tcp_minisocks.c |   8 +--
 net/ipv4/tcp_output.c|  51 
 net/ipv4/tcp_rate.c  |  16 ++---
 net/ipv4/tcp_recovery.c  |  24 
 net/ipv4/tcp_timer.c |  17 +++---
 net/ipv4/tcp_westwood.c  |   6 +-
 net/ipv6/syncookies.c|   2 +-
 net/ipv6/tcp_ipv6.c  |   4 +-
 net/netfilter/nf_synproxy_core.c |   2 +-
 24 files changed, 259 insertions(+), 274 deletions(-)

-- 
2.13.0.303.g4ebf302169-goog

[PATCH] netlink: Change rtnl_dump_done to always show error

2017-05-16 Thread David Ahern

The original code which became rtnl_dump_done only shows netlink errors
if the protocol is NETLINK_SOCK_DIAG, but netlink dumps always appends
the length which contains any error encountered during the dump. Update
rtnl_dump_done to always show the error if there is one.

As an *example* without this patch, dumping a route object that exceeds
the internal buffer size terminates with no message to the user -- the
dump just ends because the NLMSG_DONE attribute was received. With this
patch the user at least gets a message that the dump was aborted.

$ ip ro ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
10.10.0.0/16 dev veth1 proto kernel scope link src 10.10.0.1
172.16.1.0/24 dev br0.11 proto kernel scope link src 172.16.1.1
Error: Buffer too small for object
Dump terminated

The point of this patch is to notify the user of a failure versus
silently exiting on a partial dump. Because the NLMSG_DONE attribute
was received, the entire dump needs to be restarted to use a larger
buffer for EMSGSIZE errors. That could be done automatically but it
has other user impacts (e.g., duplicate output if the dump is
restarted) and should be the subject of a different patch.

Signed-off-by: David Ahern 
---
 lib/libnetlink.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 5b75b2db4e0b..d4b831f67ea2 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -266,21 +266,27 @@ static int rtnl_dump_done(const struct rtnl_handle *rth,
 {
int len = *(int *)NLMSG_DATA(h);
 
-   if (rth->proto == NETLINK_SOCK_DIAG) {
-   if (h->nlmsg_len < NLMSG_LENGTH(sizeof(int))) {
-   fprintf(stderr, "DONE truncated\n");
-   return -1;
-   }
-
+   if (h->nlmsg_len < NLMSG_LENGTH(sizeof(int))) {
+   fprintf(stderr, "DONE truncated\n");
+   return -1;
+   }
 
-   if (len < 0) {
-   errno = -len;
-   if (errno == ENOENT || errno == EOPNOTSUPP)
-   return -1;
+   if (len < 0) {
+   errno = -len;
+   switch (errno) {
+   case ENOENT:
+   case EOPNOTSUPP:
+   return -1;
+   case EMSGSIZE:
+   fprintf(stderr,
+   "Error: Buffer too small for object.\n");
+   break;
+   default:
perror("RTNETLINK answers");
-   return len;
}
+   return len;
}
+
return 0;
 }
 
-- 
2.11.0 (Apple Git-81)

Re: [PATCH net-next] net: dsa: store CPU port pointer in the tree

2017-05-16 Thread Florian Fainelli

On 05/16/2017 11:10 AM, Vivien Didelot wrote:
> A dsa_switch_tree instance holds a dsa_switch pointer and a port index
> to identify the switch port to which the CPU is attached.
> 
> Now that the DSA layer has a dsa_port structure to hold this data, use
> it to point the switch CPU port.
> 
> This patch simply substitutes s/dst->cpu_switch/dst->cpu_dp->ds/ and
> s/dst->cpu_port/dst->cpu_dp->index/.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
--
Florian

Re: [PATCH net-next 0/2] net: dsa: Sort various lists

2017-05-16 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> As we gain more DSA drivers and tagging protocols, the lists are
> getting a bit unruly. Do some sorting.

I'm glad to see that I'm not the only one picky with alphabetically
ordering (when possible) files, function names and so on.

Thanks!

Vivien

Re: [PATCH net-next 2/2] drivers: net: DSA: Sort drivers

2017-05-16 Thread Florian Fainelli

On 05/16/2017 01:40 PM, Andrew Lunn wrote:
> With more drivers being added, it is time to sort the drivers to
> impose some order.
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next 2/2] drivers: net: DSA: Sort drivers

2017-05-16 Thread Vivien Didelot

Andrew Lunn  writes:

> With more drivers being added, it is time to sort the drivers to
> impose some order.
>
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot

Re: [PATCH net-next 1/2] net: dsa: Sort DSA tagging protocol drivers

2017-05-16 Thread Vivien Didelot

Andrew Lunn  writes:

> With more tag protocols being added, regain some order by sorting the
> entries in various places.
>
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot

Re: [patch net-next v3 01/10] net: sched: move tc_classify function to cls_api.c

2017-05-16 Thread Jiri Pirko

Tue, May 16, 2017 at 11:03:14PM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 2:00 PM, Jiri Pirko  wrote:
>> Tue, May 16, 2017 at 10:25:35PM CEST, xiyou.wangc...@gmail.com wrote:
>>>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
 From: Jiri Pirko 

 Move tc_classify function to cls_api.c where it belongs, rename it to
 fit the namespace.

>>>
>>>It is not a pure move, you silently remove the CONFIG_NET_CLS_ACT
>>>macros in tc_classify(). Probably not buggy, just redundancy when
>>>actions are not compiled.
>>
>> Plese see include/net/pkt_cls.h in this patch.
>>
>> If CONFIG_NET_CLS_ACT is not defined, there is a stub there.
>
>I am sure it is not NET_CLS_ACT:

Oh, will fix this. Thanks.

>
>
>#ifdef CONFIG_NET_CLS
> void tcf_destroy_chain(struct tcf_proto __rcu **fl);
>+int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>+struct tcf_result *res, bool compat_mode);
>+
> #else
> static inline void tcf_destroy_chain(struct tcf_proto __rcu **fl)
> {
> }
>+
>+static inline int tcf_classify(struct sk_buff *skb, const struct tcf_proto 
>*tp,
>+  struct tcf_result *res, bool compat_mode)
>+{
>+   return TC_ACT_UNSPEC;
>+}
> #endif

Re: [patch net-next v3 01/10] net: sched: move tc_classify function to cls_api.c

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 2:00 PM, Jiri Pirko  wrote:
> Tue, May 16, 2017 at 10:25:35PM CEST, xiyou.wangc...@gmail.com wrote:
>>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Move tc_classify function to cls_api.c where it belongs, rename it to
>>> fit the namespace.
>>>
>>
>>It is not a pure move, you silently remove the CONFIG_NET_CLS_ACT
>>macros in tc_classify(). Probably not buggy, just redundancy when
>>actions are not compiled.
>
> Plese see include/net/pkt_cls.h in this patch.
>
> If CONFIG_NET_CLS_ACT is not defined, there is a stub there.

I am sure it is not NET_CLS_ACT:


#ifdef CONFIG_NET_CLS
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
+int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+struct tcf_result *res, bool compat_mode);
+
 #else
 static inline void tcf_destroy_chain(struct tcf_proto __rcu **fl)
 {
 }
+
+static inline int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+  struct tcf_result *res, bool compat_mode)
+{
+   return TC_ACT_UNSPEC;
+}
 #endif

Re: [patch net-next v3 05/10] net: sched: move TC_H_MAJ macro call into tcf_auto_prio

2017-05-16 Thread Jiri Pirko

Tue, May 16, 2017 at 11:01:52PM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> Call the helper from the function rather than to always adjust the
>> return value of the function.
>
>And rename the function name to reflect this change?

? What do you suggest?

Re: [patch net-next v3 05/10] net: sched: move TC_H_MAJ macro call into tcf_auto_prio

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Call the helper from the function rather than to always adjust the
> return value of the function.

And rename the function name to reflect this change?

Re: [patch net-next v3 01/10] net: sched: move tc_classify function to cls_api.c

2017-05-16 Thread Jiri Pirko

Tue, May 16, 2017 at 10:25:35PM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> Move tc_classify function to cls_api.c where it belongs, rename it to
>> fit the namespace.
>>
>
>It is not a pure move, you silently remove the CONFIG_NET_CLS_ACT
>macros in tc_classify(). Probably not buggy, just redundancy when
>actions are not compiled.

Plese see include/net/pkt_cls.h in this patch.

If CONFIG_NET_CLS_ACT is not defined, there is a stub there.

Re: [patch net-next v3 02/10] net: sched: introduce tcf block infractructure

2017-05-16 Thread Jiri Pirko

Tue, May 16, 2017 at 10:51:30PM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
>> +int tcf_block_get(struct tcf_block **p_block,
>> + struct tcf_proto __rcu **p_filter_chain)
>> +{
>> +   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
>> +
>> +   if (!block)
>> +   return -ENOMEM;
>> +   block->p_filter_chain = p_filter_chain;
>> +   *p_block = block;
>> +   return 0;
>> +}
>> +EXPORT_SYMBOL(tcf_block_get);
>
>
>XXX_get() is usually for refcnt'ing, here you only allocate
>a block, so please rename it to tcf_block_alloc().

I already replied to the same Jamal's comment.


>
>
>> +
>> +void tcf_block_put(struct tcf_block *block)
>> +{
>> +   if (!block)
>> +   return;
>> +   tcf_destroy_chain(block->p_filter_chain);
>> +   kfree(block);
>> +}
>> +EXPORT_SYMBOL(tcf_block_put);
>
>Ditto, tcf_block_destroy().

Re: [patch net-next v3 02/10] net: sched: introduce tcf block infractructure

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
> +int tcf_block_get(struct tcf_block **p_block,
> + struct tcf_proto __rcu **p_filter_chain)
> +{
> +   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
> +
> +   if (!block)
> +   return -ENOMEM;
> +   block->p_filter_chain = p_filter_chain;
> +   *p_block = block;
> +   return 0;
> +}
> +EXPORT_SYMBOL(tcf_block_get);


XXX_get() is usually for refcnt'ing, here you only allocate
a block, so please rename it to tcf_block_alloc().


> +
> +void tcf_block_put(struct tcf_block *block)
> +{
> +   if (!block)
> +   return;
> +   tcf_destroy_chain(block->p_filter_chain);
> +   kfree(block);
> +}
> +EXPORT_SYMBOL(tcf_block_put);

Ditto, tcf_block_destroy().

Re: [PATCH v2 net-next 06/12] ep93xx_eth: add GRO support

2017-05-16 Thread Eric Dumazet

On Tue, May 16, 2017 at 1:41 PM, Alexander Sverdlin
 wrote:

> it turns out I've used this patch two weeks long already in 4.11; but I've 
> spent
> a couple of hours now torturing the new driver and was not able to provoke
> any inadequate behavior. It either receives all packets in time or not at all.
> If IRQs would be edge-triggered, I'd expect some stale packets, which do not
> arrive at first, but then appear with the packets coming next. This is not
> the case. I've used pktgen module for this, with minimal packets and
> different bursts.
>
> netperf shows 45Mbit/s on UDP_STREAM test, which is also fair amount for
> 200MHz CPU.
>
> So, I see no problems with the change.
>

Thanks a lot for testing !

Re: [PATCH net-next] net: dsa: store CPU port pointer in the tree

2017-05-16 Thread Andrew Lunn

On Tue, May 16, 2017 at 02:10:33PM -0400, Vivien Didelot wrote:
> A dsa_switch_tree instance holds a dsa_switch pointer and a port index
> to identify the switch port to which the CPU is attached.
> 
> Now that the DSA layer has a dsa_port structure to hold this data, use
> it to point the switch CPU port.
> 
> This patch simply substitutes s/dst->cpu_switch/dst->cpu_dp->ds/ and
> s/dst->cpu_port/dst->cpu_dp->index/.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew

[PATCH net-next 1/2] net: dsa: Sort DSA tagging protocol drivers

2017-05-16 Thread Andrew Lunn

With more tag protocols being added, regain some order by sorting the
entries in various places.

Signed-off-by: Andrew Lunn 
---
 include/net/dsa.h  |  8 
 net/dsa/Kconfig|  8 
 net/dsa/Makefile   |  6 +++---
 net/dsa/dsa.c  | 18 +-
 net/dsa/dsa_priv.h | 18 +-
 5 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 8e24677b1c62..f5b3ab645624 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -27,13 +27,13 @@ struct fixed_phy_status;
 
 enum dsa_tag_protocol {
DSA_TAG_PROTO_NONE = 0,
+   DSA_TAG_PROTO_BRCM,
DSA_TAG_PROTO_DSA,
-   DSA_TAG_PROTO_TRAILER,
DSA_TAG_PROTO_EDSA,
-   DSA_TAG_PROTO_BRCM,
-   DSA_TAG_PROTO_QCA,
-   DSA_TAG_PROTO_MTK,
DSA_TAG_PROTO_LAN9303,
+   DSA_TAG_PROTO_MTK,
+   DSA_TAG_PROTO_QCA,
+   DSA_TAG_PROTO_TRAILER,
DSA_TAG_LAST,   /* MUST BE LAST */
 };
 
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index 81a0868edb1d..297389b2ab35 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -25,16 +25,16 @@ config NET_DSA_TAG_DSA
 config NET_DSA_TAG_EDSA
bool
 
-config NET_DSA_TAG_TRAILER
+config NET_DSA_TAG_LAN9303
bool
 
-config NET_DSA_TAG_QCA
+config NET_DSA_TAG_MTK
bool
 
-config NET_DSA_TAG_MTK
+config NET_DSA_TAG_TRAILER
bool
 
-config NET_DSA_TAG_LAN9303
+config NET_DSA_TAG_QCA
bool
 
 endif
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index 0b747d75e65a..f8c0251d1f43 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -6,7 +6,7 @@ dsa_core-y += dsa.o slave.o dsa2.o switch.o legacy.o
 dsa_core-$(CONFIG_NET_DSA_TAG_BRCM) += tag_brcm.o
 dsa_core-$(CONFIG_NET_DSA_TAG_DSA) += tag_dsa.o
 dsa_core-$(CONFIG_NET_DSA_TAG_EDSA) += tag_edsa.o
-dsa_core-$(CONFIG_NET_DSA_TAG_TRAILER) += tag_trailer.o
-dsa_core-$(CONFIG_NET_DSA_TAG_QCA) += tag_qca.o
-dsa_core-$(CONFIG_NET_DSA_TAG_MTK) += tag_mtk.o
 dsa_core-$(CONFIG_NET_DSA_TAG_LAN9303) += tag_lan9303.o
+dsa_core-$(CONFIG_NET_DSA_TAG_MTK) += tag_mtk.o
+dsa_core-$(CONFIG_NET_DSA_TAG_QCA) += tag_qca.o
+dsa_core-$(CONFIG_NET_DSA_TAG_TRAILER) += tag_trailer.o
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 26130ae438da..c0a1307c87dd 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -40,26 +40,26 @@ static const struct dsa_device_ops none_ops = {
 };
 
 const struct dsa_device_ops *dsa_device_ops[DSA_TAG_LAST] = {
+#ifdef CONFIG_NET_DSA_TAG_BRCM
+   [DSA_TAG_PROTO_BRCM] = _netdev_ops,
+#endif
 #ifdef CONFIG_NET_DSA_TAG_DSA
[DSA_TAG_PROTO_DSA] = _netdev_ops,
 #endif
 #ifdef CONFIG_NET_DSA_TAG_EDSA
[DSA_TAG_PROTO_EDSA] = _netdev_ops,
 #endif
-#ifdef CONFIG_NET_DSA_TAG_TRAILER
-   [DSA_TAG_PROTO_TRAILER] = _netdev_ops,
+#ifdef CONFIG_NET_DSA_TAG_LAN9303
+   [DSA_TAG_PROTO_LAN9303] = _netdev_ops,
 #endif
-#ifdef CONFIG_NET_DSA_TAG_BRCM
-   [DSA_TAG_PROTO_BRCM] = _netdev_ops,
+#ifdef CONFIG_NET_DSA_TAG_MTK
+   [DSA_TAG_PROTO_MTK] = _netdev_ops,
 #endif
 #ifdef CONFIG_NET_DSA_TAG_QCA
[DSA_TAG_PROTO_QCA] = _netdev_ops,
 #endif
-#ifdef CONFIG_NET_DSA_TAG_MTK
-   [DSA_TAG_PROTO_MTK] = _netdev_ops,
-#endif
-#ifdef CONFIG_NET_DSA_TAG_LAN9303
-   [DSA_TAG_PROTO_LAN9303] = _netdev_ops,
+#ifdef CONFIG_NET_DSA_TAG_TRAILER
+   [DSA_TAG_PROTO_TRAILER] = _netdev_ops,
 #endif
[DSA_TAG_PROTO_NONE] = _ops,
 };
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index f4a88e485213..e9003b79cbbc 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -75,25 +75,25 @@ void dsa_slave_unregister_notifier(void);
 int dsa_switch_register_notifier(struct dsa_switch *ds);
 void dsa_switch_unregister_notifier(struct dsa_switch *ds);
 
+/* tag_brcm.c */
+extern const struct dsa_device_ops brcm_netdev_ops;
+
 /* tag_dsa.c */
 extern const struct dsa_device_ops dsa_netdev_ops;
 
 /* tag_edsa.c */
 extern const struct dsa_device_ops edsa_netdev_ops;
 
-/* tag_trailer.c */
-extern const struct dsa_device_ops trailer_netdev_ops;
+/* tag_lan9303.c */
+extern const struct dsa_device_ops lan9303_netdev_ops;
 
-/* tag_brcm.c */
-extern const struct dsa_device_ops brcm_netdev_ops;
+/* tag_mtk.c */
+extern const struct dsa_device_ops mtk_netdev_ops;
 
 /* tag_qca.c */
 extern const struct dsa_device_ops qca_netdev_ops;
 
-/* tag_mtk.c */
-extern const struct dsa_device_ops mtk_netdev_ops;
-
-/* tag_lan9303.c */
-extern const struct dsa_device_ops lan9303_netdev_ops;
+/* tag_trailer.c */
+extern const struct dsa_device_ops trailer_netdev_ops;
 
 #endif
-- 
2.11.0

Re: [PATCH v2 net-next 06/12] ep93xx_eth: add GRO support

2017-05-16 Thread Alexander Sverdlin

Hello all,

On 15/05/17 23:02, Alexander Sverdlin wrote:
>>> I don't know if we really care about this hardware anymore (I don't),
>>> but the ep93xx platform is still listed as being maintained in the
>>> MAINTAINERS file -- adding Ryan and Hartley.
>> I no longer have any ep93xx hardware to test with, and I never looked at
>> the Ethernet, so don't know the details. I think there are still a
>> handful of users. Adding Alexander who sent an ADC support series this
>> week, who might be able to test this?
> Yes, I very much care about ep93xx code being functional :)
> I'll test the patches tomorrow.

it turns out I've used this patch two weeks long already in 4.11; but I've spent
a couple of hours now torturing the new driver and was not able to provoke
any inadequate behavior. It either receives all packets in time or not at all.
If IRQs would be edge-triggered, I'd expect some stale packets, which do not
arrive at first, but then appear with the packets coming next. This is not
the case. I've used pktgen module for this, with minimal packets and
different bursts.

netperf shows 45Mbit/s on UDP_STREAM test, which is also fair amount for
200MHz CPU.

So, I see no problems with the change.

--
Alexander.

[PATCH net-next 0/2] net: dsa: Sort various lists

2017-05-16 Thread Andrew Lunn

As we gain more DSA drivers and tagging protocols, the lists are
getting a bit unruly. Do some sorting.

Andrew Lunn (2):
  net: dsa: Sort DSA tagging protocol drivers
  drivers: net: DSA: Sort drivers

 drivers/net/dsa/Kconfig  | 40 
 drivers/net/dsa/Makefile |  6 +++---
 include/net/dsa.h|  8 
 net/dsa/Kconfig  |  8 
 net/dsa/Makefile |  6 +++---
 net/dsa/dsa.c| 18 +-
 net/dsa/dsa_priv.h   | 18 +-
 7 files changed, 52 insertions(+), 52 deletions(-)

-- 
2.11.0

[PATCH net-next 2/2] drivers: net: DSA: Sort drivers

2017-05-16 Thread Andrew Lunn

With more drivers being added, it is time to sort the drivers to
impose some order.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/Kconfig  | 40 
 drivers/net/dsa/Makefile |  6 +++---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 862ee22303c2..68131a45ac5e 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -1,13 +1,7 @@
 menu "Distributed Switch Architecture drivers"
depends on HAVE_NET_DSA
 
-config NET_DSA_MV88E6060
-   tristate "Marvell 88E6060 ethernet switch chip support"
-   depends on NET_DSA
-   select NET_DSA_TAG_TRAILER
-   ---help---
- This enables support for the Marvell 88E6060 ethernet switch
- chip.
+source "drivers/net/dsa/b53/Kconfig"
 
 config NET_DSA_BCM_SF2
tristate "Broadcom Starfighter 2 Ethernet switch support"
@@ -21,19 +15,6 @@ config NET_DSA_BCM_SF2
  This enables support for the Broadcom Starfighter 2 Ethernet
  switch chips.
 
-source "drivers/net/dsa/b53/Kconfig"
-
-source "drivers/net/dsa/mv88e6xxx/Kconfig"
-
-config NET_DSA_QCA8K
-   tristate "Qualcomm Atheros QCA8K Ethernet switch family support"
-   depends on NET_DSA
-   select NET_DSA_TAG_QCA
-   select REGMAP
-   ---help---
- This enables support for the Qualcomm Atheros QCA8K Ethernet
- switch chips.
-
 config NET_DSA_LOOP
tristate "DSA mock-up Ethernet switch chip support"
depends on NET_DSA
@@ -50,6 +31,25 @@ config NET_DSA_MT7530
  This enables support for the Mediatek MT7530 Ethernet switch
  chip.
 
+config NET_DSA_MV88E6060
+   tristate "Marvell 88E6060 ethernet switch chip support"
+   depends on NET_DSA
+   select NET_DSA_TAG_TRAILER
+   ---help---
+ This enables support for the Marvell 88E6060 ethernet switch
+ chip.
+
+source "drivers/net/dsa/mv88e6xxx/Kconfig"
+
+config NET_DSA_QCA8K
+   tristate "Qualcomm Atheros QCA8K Ethernet switch family support"
+   depends on NET_DSA
+   select NET_DSA_TAG_QCA
+   select REGMAP
+   ---help---
+ This enables support for the Qualcomm Atheros QCA8K Ethernet
+ switch chips.
+
 config NET_DSA_SMSC_LAN9303
tristate
select NET_DSA_TAG_LAN9303
diff --git a/drivers/net/dsa/Makefile b/drivers/net/dsa/Makefile
index edd630361736..9613f36083a6 100644
--- a/drivers/net/dsa/Makefile
+++ b/drivers/net/dsa/Makefile
@@ -1,11 +1,11 @@
-obj-$(CONFIG_NET_DSA_MV88E6060) += mv88e6060.o
 obj-$(CONFIG_NET_DSA_BCM_SF2)  += bcm-sf2.o
 bcm-sf2-objs   := bcm_sf2.o bcm_sf2_cfp.o
-obj-$(CONFIG_NET_DSA_QCA8K)+= qca8k.o
+obj-$(CONFIG_NET_DSA_LOOP) += dsa_loop.o dsa_loop_bdinfo.o
 obj-$(CONFIG_NET_DSA_MT7530)   += mt7530.o
+obj-$(CONFIG_NET_DSA_MV88E6060) += mv88e6060.o
+obj-$(CONFIG_NET_DSA_QCA8K)+= qca8k.o
 obj-$(CONFIG_NET_DSA_SMSC_LAN9303) += lan9303-core.o
 obj-$(CONFIG_NET_DSA_SMSC_LAN9303_I2C) += lan9303_i2c.o
 obj-$(CONFIG_NET_DSA_SMSC_LAN9303_MDIO) += lan9303_mdio.o
 obj-y  += b53/
 obj-y  += mv88e6xxx/
-obj-$(CONFIG_NET_DSA_LOOP) += dsa_loop.o dsa_loop_bdinfo.o
-- 
2.11.0

[PATCH net 1/2] bnxt_en: Call bnxt_dcb_init() after getting firmware DCBX configuration.

2017-05-16 Thread Michael Chan

In the current code, bnxt_dcb_init() is called too early before we
determine if the firmware DCBX agent is running or not.  As a result,
we are not setting the DCB_CAP_DCBX_HOST and DCB_CAP_DCBX_LLD_MANAGED
flags properly to report to DCBNL.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b56c54d..03f55da 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7630,8 +7630,6 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->min_mtu = ETH_ZLEN;
dev->max_mtu = BNXT_MAX_MTU;
 
-   bnxt_dcb_init(bp);
-
 #ifdef CONFIG_BNXT_SRIOV
init_waitqueue_head(>sriov_cfg_wait);
 #endif
@@ -7669,6 +7667,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
bnxt_hwrm_func_qcfg(bp);
bnxt_hwrm_port_led_qcaps(bp);
bnxt_ethtool_init(bp);
+   bnxt_dcb_init(bp);
 
bnxt_set_rx_skb_mode(bp, false);
bnxt_set_tpa_flags(bp);
-- 
1.8.3.1

[PATCH net 2/2] bnxt_en: Check status of firmware DCBX agent before setting DCB_CAP_DCBX_HOST.

2017-05-16 Thread Michael Chan

Otherwise, all the host based DCBX settings from lldpad will fail if the
firmware DCBX agent is running.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
index 46de2f8..5c6dd0c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
@@ -553,8 +553,10 @@ static u8 bnxt_dcbnl_setdcbx(struct net_device *dev, u8 
mode)
if ((mode & DCB_CAP_DCBX_VER_CEE) || !(mode & DCB_CAP_DCBX_VER_IEEE))
return 1;
 
-   if ((mode & DCB_CAP_DCBX_HOST) && BNXT_VF(bp))
-   return 1;
+   if (mode & DCB_CAP_DCBX_HOST) {
+   if (BNXT_VF(bp) || (bp->flags & BNXT_FLAG_FW_LLDP_AGENT))
+   return 1;
+   }
 
if (mode == bp->dcbx_cap)
return 0;
-- 
1.8.3.1

[PATCH net 0/2] bnxt_en: DCBX fixes.

2017-05-16 Thread Michael Chan

2 bug fixes for the case where the NIC's firmware DCBX agent is enabled.
With these fixes, we will return the proper information to lldpad.

Michael Chan (2):
  bnxt_en: Call bnxt_dcb_init() after getting firmware DCBX
configuration.
  bnxt_en: Check status of firmware DCBX agent before setting
DCB_CAP_DCBX_HOST.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c | 6 --
 2 files changed, 5 insertions(+), 4 deletions(-)

-- 
1.8.3.1

[PATCH net] net: fix compile error in skb_orphan_partial()

2017-05-16 Thread Eric Dumazet

From: Eric Dumazet 

If CONFIG_INET is not set, net/core/sock.c can not compile :

net/core/sock.c: In function ‘skb_orphan_partial’:
net/core/sock.c:1810:2: error: implicit declaration of function
‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration]
  if (skb_is_tcp_pure_ack(skb))
  ^

Fix this by always including 

Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()")
Signed-off-by: Eric Dumazet 
Reported-by: Paul Gortmaker 
Reported-by: Randy Dunlap 
Reported-by: Stephen Rothwell 
---
 net/core/sock.c |3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 
e43e71d7856b385111cd4c4b1bd835a78c670c60..727f924b7f91f495d9e7a4e7297c9c937d3258ed
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -139,10 +139,7 @@
 
 #include 
 
-#ifdef CONFIG_INET
 #include 
-#endif
-
 #include 
 
 static DEFINE_MUTEX(proto_list_mutex);

Re: [patch net-next v3 01/10] net: sched: move tc_classify function to cls_api.c

2017-05-16 Thread Cong Wang

On Tue, May 16, 2017 at 10:27 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Move tc_classify function to cls_api.c where it belongs, rename it to
> fit the namespace.
>

It is not a pure move, you silently remove the CONFIG_NET_CLS_ACT
macros in tc_classify(). Probably not buggy, just redundancy when
actions are not compiled.

Re: switchdev offload & ecmp

2017-05-16 Thread Nicolas Dichtel

Le 16/05/2017 à 16:11, Ido Schimmel a écrit :
> On Tue, May 16, 2017 at 02:57:47PM +0200, Nicolas Dichtel wrote:
 I suspect that there can be scenarii where some packets of a flow are 
 forwarded
 by the driver and some other are forwarded by the kernel.
>>>
>>> Can you elaborate? The kernel only sees specific packets, which were
>>> trapped to the CPU. See:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#n2996
>> Ok, this part was not clear for me, thank you for the pointer.
>>
>> So, when an arp resolution is needed, the packets are not trapped to the CPU,
>> the device manages the queue itself?
> 
> There are two cases here. If you need an ARP resolution following a hit
> of a directly connected route and this neighbour isn't in the device's
> table, then packet is trapped (HOST_MISS_IPV4 in above list) to the CPU
> and triggers ARP resolution in the kernel. Eventually a NETEVENT will be
> sent and the neighbour will be programmed to the device.
> 
> If you need an ARP resolution of a nexthop, then this is a bit
> different. If you have an ECMP group with several nexthops, then once
> one of them is resolved, packets will be forwarded using it. To make
> sure other nexthops will also be resolved we try to periodically refresh
> them. Otherwise packets will always be forwarded using a single nexthop,
> as the kernel won't have motivation to resolve the others.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c#n987
> 
> In case no nexthops can be resolved, then packets will be trapped to the
> CPU (RTR_INGRESS0 in above list) and forwarded by the kernel.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c#n1896
> 
Ok, thank you for the details.

Regards,
Nicolas

Re: linux-next: Tree for May 16 (net/core)

2017-05-16 Thread Eric Dumazet

On Tue, May 16, 2017 at 12:44 PM, Paul Gortmaker
 wrote:
> On Tue, May 16, 2017 at 12:28 PM, Randy Dunlap  wrote:
>> On 05/15/17 18:21, Stephen Rothwell wrote:
>>> Hi all,
>>>
>>> Changes since 20170515:
>>>
>>
>> on i386 or x86_64:
>>
>> when CONFIG_INET is not enabled:
>>
>> ../net/core/sock.c: In function 'skb_orphan_partial':
>> ../net/core/sock.c:1810:2: error: implicit declaration of function 
>> 'skb_is_tcp_pure_ack' [-Werror=implicit-function-declaration]
>>   if (skb_is_tcp_pure_ack(skb))
>
> Automated bisect on an ARM build with the same issue reveals:
>
> f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 is the first bad commit
> commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9
> Author: Eric Dumazet 
> Date:   Thu May 11 15:24:41 2017 -0700
>
> netem: fix skb_orphan_partial()
>
> I should have known that lowering skb->truesize was dangerous :/
>
> In case packets are not leaving the host via a standard Ethernet device,
> but looped back to local sockets, bad things can happen, as reported
> by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
>
> So instead of tweaking skb->truesize, lets change skb->destructor
> and keep a reference on the owner socket via its sk_refcnt.
>
> Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
> Signed-off-by: Eric Dumazet 
> Reported-by: Michael Madsen 
> Signed-off-by: David S. Miller 
>
> :04 04 7bfb7a6f5e12373b1c50ede2455b6ddd6d79cee0
> b45b7255322f1dff5e3ab8d3d707cf38a91c76ce M  net
> bisect run success
>
> http://kisskb.ellerman.id.au/kisskb/buildresult/13033081/
>
> I'm guessing Eric already knows about this but I've Cc'd him just in case.

I was not aware of this, I will submit a fix, thanks.

Re: [PATCH net-next] geneve: add rtnl changelink support

2017-05-16 Thread Girish Moodalbail


On 5/16/17 12:31 PM, David Miller wrote:

From: Girish Moodalbail 
Date: Mon, 15 May 2017 10:47:04 -0700


if (data[IFLA_GENEVE_REMOTE]) {
-   info.key.u.ipv4.dst =
+   info->key.u.ipv4.dst =
nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);

-   if (IN_MULTICAST(ntohl(info.key.u.ipv4.dst))) {
+   if (IN_MULTICAST(ntohl(info->key.u.ipv4.dst))) {
netdev_dbg(dev, "multicast remote is unsupported\n");
return -EINVAL;
}
+   if (changelink &&
+   ip_tunnel_info_af(>info) == AF_INET6) {
+   info->mode &= ~IP_TUNNEL_INFO_IPV6;
+   info->key.tun_flags &= ~TUNNEL_CSUM;
+   *use_udp6_rx_checksums = false;
+   }
}


I don't understand this "changelink" guarded code, why do you need to
clear all of this state out if the existing tunnel type if AF_INET6
and only when doing a changelink?

In any event, I think you need to add a comment explaining it.



If geneve link was overlayed over IPv6 network and now the user modifies the 
link to be over IPv4 network by doing


# ip link set gen0 type geneve id 100 remote 192.168.13.2

Then we will need to

 - reset info->mode to be not IPv6 type
 - the default for UDP checksum over IPv4 is 'no', so reset that and
 - set use_udp6_rx_checksums to its default value which is false.

I will capture the above information concisely in a comment around that 
'changelink' guard.


thanks,
~Girish

Re: [PATCH net-next] geneve: add rtnl changelink support

2017-05-16 Thread Pravin Shelar

On Mon, May 15, 2017 at 10:47 AM, Girish Moodalbail
 wrote:
> This patch adds changelink rtnl operation support for geneve devices.
> Code changes involve:
>   - refactor geneve_newlink into geneve_nl2info to be used by both
> geneve_newlink and geneve_changelink
>   - geneve_nl2info takes a changelink boolean argument to isolate
> changelink checks and updates.
>   - Allow changing only a few attributes:
> - return -EOPNOTSUPP for attributes that cannot be changed for
>   now. Incremental patches can make the non-supported one
>   available in the future if needed.
>
Thanks for working on this.

> Signed-off-by: Girish Moodalbail 
> ---
>  drivers/net/geneve.c | 149 
> ---
>  1 file changed, 117 insertions(+), 32 deletions(-)
>
...
> @@ -1169,45 +1181,58 @@ static void init_tnl_info(struct ip_tunnel_info 
> *info, __u16 dst_port)
> info->key.tp_dst = htons(dst_port);
>  }
>
> -static int geneve_newlink(struct net *net, struct net_device *dev,
> - struct nlattr *tb[], struct nlattr *data[])
> +static int geneve_nl2info(struct net_device *dev, struct nlattr *tb[],
> + struct nlattr *data[], struct ip_tunnel_info *info,
> + bool *metadata, bool *use_udp6_rx_checksums,
> + bool changelink)
>  {
> -   bool use_udp6_rx_checksums = false;
> -   struct ip_tunnel_info info;
> -   bool metadata = false;
> +   struct geneve_dev *geneve = netdev_priv(dev);
>
> -   init_tnl_info(, GENEVE_UDP_PORT);
> +   if (changelink) {
> +   /* if changelink operation, start with old existing info */
> +   memcpy(info, >info, sizeof(*info));
> +   *metadata = geneve->collect_md;
> +   *use_udp6_rx_checksums = geneve->use_udp6_rx_checksums;
> +   } else {
> +   init_tnl_info(info, GENEVE_UDP_PORT);
> +   }
>
> if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6])
> return -EINVAL;
>
> if (data[IFLA_GENEVE_REMOTE]) {
> -   info.key.u.ipv4.dst =
> +   info->key.u.ipv4.dst =
> nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
>
> -   if (IN_MULTICAST(ntohl(info.key.u.ipv4.dst))) {
> +   if (IN_MULTICAST(ntohl(info->key.u.ipv4.dst))) {
> netdev_dbg(dev, "multicast remote is unsupported\n");
> return -EINVAL;
> }
> +   if (changelink &&
> +   ip_tunnel_info_af(>info) == AF_INET6) {
> +   info->mode &= ~IP_TUNNEL_INFO_IPV6;
> +   info->key.tun_flags &= ~TUNNEL_CSUM;
> +   *use_udp6_rx_checksums = false;
> +   }
This allows changelink to change ipv4 address but there are no changes
made to the geneve tunnel port hash table after this update. We also
need to check to see if there is any conflicts with existing ports.

What is the barrier between the rx/tx threads and changelink process?

> }
>
> if (data[IFLA_GENEVE_REMOTE6]) {
>   #if IS_ENABLED(CONFIG_IPV6)
> -   info.mode = IP_TUNNEL_INFO_IPV6;
> -   info.key.u.ipv6.dst =
> +   info->mode = IP_TUNNEL_INFO_IPV6;
> +   info->key.u.ipv6.dst =
> nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]);
>
> -   if (ipv6_addr_type() &
> +   if (ipv6_addr_type(>key.u.ipv6.dst) &
> IPV6_ADDR_LINKLOCAL) {
> netdev_dbg(dev, "link-local remote is unsupported\n");
> return -EINVAL;
> }
> -   if (ipv6_addr_is_multicast()) {
> +   if (ipv6_addr_is_multicast(>key.u.ipv6.dst)) {
> netdev_dbg(dev, "multicast remote is unsupported\n");
> return -EINVAL;
> }
> -   info.key.tun_flags |= TUNNEL_CSUM;
> -   use_udp6_rx_checksums = true;
> +   info->key.tun_flags |= TUNNEL_CSUM;
> +   *use_udp6_rx_checksums = true;
Same here. We need to check/fix the geneve tunnel hash table according
to new IP address.

>  #else
> return -EPFNOSUPPORT;
>  #endif
> @@ -1216,48 +1241,107 @@ static int geneve_newlink(struct net *net, struct 
> net_device *dev,
...
>
> -   if (data[IFLA_GENEVE_PORT])
> -   info.key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
> +   if (data[IFLA_GENEVE_PORT]) {
> +   if (changelink)
> +   return -EOPNOTSUPP;
> +   info->key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
> +   }
> +
> +   if (data[IFLA_GENEVE_COLLECT_METADATA]) {
> +   if (changelink)
> +   return -EOPNOTSUPP;
Rather

Re: [PATCH] liquidio: use pcie_flr instead of duplicating it

2017-05-16 Thread David Miller

From: Christoph Hellwig 
Date: Tue, 16 May 2017 16:21:46 +0200

> Signed-off-by: Christoph Hellwig 
> Tested-by: Felix Manlunas 

Applied to net-next, thanks.

Re: [PATCH] net: phy: Remove residual magic from PHY drivers

2017-05-16 Thread David Miller

From: Andrew Lunn 
Date: Tue, 16 May 2017 18:29:11 +0200

> commit fa8cddaf903c ("net phylib: Remove unnecessary condition check in phy")
> removed the only place where the PHY flag PHY_HAS_MAGICANEG was
> checked. But it left the flag being set in the drivers. Remove the flag.
> 
> Signed-off-by: Andrew Lunn 

Applied to net-next, thanks.

Re: [PATCH 1/1] dt-binding: net: wireless: fix node name in the BCM43xx example

2017-05-16 Thread Martin Blumenstingl

Hi Arend,

On Tue, May 16, 2017 at 12:05 AM, Arend Van Spriel
 wrote:
> On 15-5-2017 22:13, Martin Blumenstingl wrote:
>> The example in the BCM43xx documentation uses "brcmf" as node name.
>> However, wireless devices should be named "wifi" instead. Fix this to
>
> Hi Martin,
>
> Since when is that a rule. I never got the memo and the DTC did not ever
> complain to me about the naming. That being said I do not really care
> and I suppose it is for the sake of consistency only.
I'm not sure if it's actually a rule or (as you already noted) just
for consistency. back when I added devicetree support to ath9k Rob
pointed out that the node should be named "wifi" (instead of "ath9k"),
see [0]

>> make sure that .dts authors can simply use the documentation as
>> reference (or simply copy the node from the documentation and then
>> adjust only the board specific bits).
>
> Please feel free to add my...
>
> Acked-by: Arend van Spriel 
thank you!

@Rob: maybe you can ACK this as well if you're fine with this patch?

>> Signed-off-by: Martin Blumenstingl 
>> ---
>>  Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt 
>> b/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt
>> index 5dbf169cd81c..590f622188de 100644
>> --- a/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt
>> +++ b/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt
>> @@ -31,7 +31,7 @@ mmc3: mmc@01c12000 {
>>   non-removable;
>>   status = "okay";
>>
>> - brcmf: bcrmf@1 {
>> + brcmf: wifi@1 {
>>   reg = <1>;
>>   compatible = "brcm,bcm4329-fmac";
>>   interrupt-parent = <>;
>>

[0] http://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg14678.html

Re: [PATCH 4.4-only] openvswitch: clear sender cpu before forwarding packets

2017-05-16 Thread Joe Stringer

On 16 May 2017 at 07:25, Anoob Soman  wrote:
> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
> forwarding") the skb->sender_cpu needs to be cleared before forwarding
> packets.
>
> Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
> Signed-off-by: Anoob Soman 

Is this needed for 4.1 too?

Re: [PATCH net-next] cxgb4: update latest firmware version supported

2017-05-16 Thread David Miller

From: Ganesh Goudar 
Date: Tue, 16 May 2017 20:56:52 +0530

> Change t4fw_version.h to update latest firmware version
> number to 1.16.43.0.
> 
> Signed-off-by: Ganesh Goudar 

People are hitting regressions in 'net' due to using firmware allowed
by the current defines in combination with the FEC disabling commit.

So it doesn't make sense to "fix" this in 'net-next'.

Re: [PATCH v2 1/3] bpf: Use 1<<16 as ceiling for immediate alignment in verifier.

2017-05-16 Thread David Miller

From: Edward Cree 
Date: Tue, 16 May 2017 13:37:42 +0100

> On 15/05/17 17:04, David Miller wrote:
>> If we use 1<<31, then sequences like:
>>
>>  R1 = 0
>>  R1 <<= 2
>>
>> do silly things.
> Hmm.  It might be a bit late for this, but I wonder if, instead of handling
>  alignments as (1 << align), you could store them as -(1 << align), i.e.
>  leading 1s followed by 'align' 0s.
> Now the alignment of 0 is 0 (really 1 << 32), which doesn't change when
>  left-shifted some more.  Shifts of other numbers' alignments also do the
>  right thing, e.g. align(6) << 2 = (-2) << 2 = -8 = align(6 << 2).  Of
>  course you do all this in unsigned, to make sure right shifts work.
> This also makes other arithmetic simple to track; for instance, align(a + b)
>  is at worst align(a) | align(b).  (Of course, this bound isn't tight.)
> A number is 2^(n+1)-aligned if the 2^n bit of its alignment is cleared.
> Considered as unsigned numbers, smaller values are stricter alignments.

Thanks for the bit twiddling suggestion, I'll take a look!

Re: [PATCH net-next] bnx2x: Remove open coded carrier check

2017-05-16 Thread David Miller

From: Leon Romanovsky 
Date: Tue, 16 May 2017 15:20:56 +0300

> From: Leon Romanovsky 
> 
> There is inline function to test if carrier present,
> so it makes open-coded solution redundant.
> 
> Signed-off-by: Leon Romanovsky 

Applied.

Re: [PATCH] [net, 4.12] mlx5e: add CONFIG_INET dependency

2017-05-16 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 16 May 2017 13:27:49 +0200

> We now reference the arp_tbl, which requires IPv4 support to be
> enabled in the kernel, otherwise we get a link error:
> 
> drivers/net/built-in.o: In function `mlx5e_tc_update_neigh_used_value':
> (.text+0x16afec): undefined reference to `arp_tbl'
> drivers/net/built-in.o: In function `mlx5e_rep_neigh_init':
> en_rep.c:(.text+0x16c16d): undefined reference to `arp_tbl'
> drivers/net/built-in.o: In function `mlx5e_rep_netevent_event':
> en_rep.c:(.text+0x16cbb5): undefined reference to `arp_tbl'
> 
> This adds a Kconfig dependency for it.
> 
> Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
> Signed-off-by: Arnd Bergmann 

Applied, thanks.

Re: [patch iproute2 v2 repost 1/3] tc_filter: add support for chain index

2017-05-16 Thread Jiri Pirko

Tue, May 16, 2017 at 08:16:58PM CEST, step...@networkplumber.org wrote:
>On Tue, 16 May 2017 19:29:35 +0200
>Jiri Pirko  wrote:
>
>> From: Jiri Pirko 
>> 
>> Allow user to put filter to a specific chain identified by index.
>> 
>> Signed-off-by: Jiri Pirko 
>
>This will have to wait for the chain bits to show up upstream in net-next.
>

Sure. I just like to send the kernel patches alongside with the related
iproute2 patches. Thanks.

Re: [PATCH v2 net-next] tcp: internal implementation for pacing

2017-05-16 Thread David Miller

From: Eric Dumazet 
Date: Tue, 16 May 2017 04:24:36 -0700

> BBR congestion control depends on pacing, and pacing is
> currently handled by sch_fq packet scheduler for performance reasons,
> and also because implemening pacing with FQ was convenient to truly
> avoid bursts.
> 
> However there are many cases where this packet scheduler constraint
> is not practical.
> - Many linux hosts are not focusing on handling thousands of TCP
>   flows in the most efficient way.
> - Some routers use fq_codel or other AQM, but still would like
>   to use BBR for the few TCP flows they initiate/terminate.
> 
> This patch implements an automatic fallback to internal pacing.
 ...

Looks great, applied, thanks!

Re: linux-next: Tree for May 16 (net/core)

2017-05-16 Thread Paul Gortmaker

On Tue, May 16, 2017 at 12:28 PM, Randy Dunlap  wrote:
> On 05/15/17 18:21, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20170515:
>>
>
> on i386 or x86_64:
>
> when CONFIG_INET is not enabled:
>
> ../net/core/sock.c: In function 'skb_orphan_partial':
> ../net/core/sock.c:1810:2: error: implicit declaration of function 
> 'skb_is_tcp_pure_ack' [-Werror=implicit-function-declaration]
>   if (skb_is_tcp_pure_ack(skb))

Automated bisect on an ARM build with the same issue reveals:

f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 is the first bad commit
commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9
Author: Eric Dumazet 
Date:   Thu May 11 15:24:41 2017 -0700

netem: fix skb_orphan_partial()

I should have known that lowering skb->truesize was dangerous :/

In case packets are not leaving the host via a standard Ethernet device,
but looped back to local sockets, bad things can happen, as reported
by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )

So instead of tweaking skb->truesize, lets change skb->destructor
and keep a reference on the owner socket via its sk_refcnt.

Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
Signed-off-by: Eric Dumazet 
Reported-by: Michael Madsen 
Signed-off-by: David S. Miller 

:04 04 7bfb7a6f5e12373b1c50ede2455b6ddd6d79cee0
b45b7255322f1dff5e3ab8d3d707cf38a91c76ce M  net
bisect run success

http://kisskb.ellerman.id.au/kisskb/buildresult/13033081/

I'm guessing Eric already knows about this but I've Cc'd him just in case.

P.
--

>
>
> --
> ~Randy

Re: [PATCH net-next v2 0/3] udp: scalability improvements

2017-05-16 Thread David Miller

From: Paolo Abeni 
Date: Tue, 16 May 2017 11:20:12 +0200

> This patch series implement an idea suggested by Eric Dumazet to
> reduce the contention of the udp sk_receive_queue lock when the socket is
> under flood.

Series applied, thanks a lot.

Re: [PATCH net-next] geneve: add rtnl changelink support

2017-05-16 Thread David Miller

From: Girish Moodalbail 
Date: Mon, 15 May 2017 10:47:04 -0700

>   if (data[IFLA_GENEVE_REMOTE]) {
> - info.key.u.ipv4.dst =
> + info->key.u.ipv4.dst =
>   nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
>  
> - if (IN_MULTICAST(ntohl(info.key.u.ipv4.dst))) {
> + if (IN_MULTICAST(ntohl(info->key.u.ipv4.dst))) {
>   netdev_dbg(dev, "multicast remote is unsupported\n");
>   return -EINVAL;
>   }
> + if (changelink &&
> + ip_tunnel_info_af(>info) == AF_INET6) {
> + info->mode &= ~IP_TUNNEL_INFO_IPV6;
> + info->key.tun_flags &= ~TUNNEL_CSUM;
> + *use_udp6_rx_checksums = false;
> + }
>   }

I don't understand this "changelink" guarded code, why do you need to
clear all of this state out if the existing tunnel type if AF_INET6
and only when doing a changelink?

In any event, I think you need to add a comment explaining it.

Re: [PATCH] net/smc: mark as BROKEN due to remote memory exposure

2017-05-16 Thread Doug Ledford

On Tue, 2017-05-16 at 14:52 -0400, David Miller wrote:
> From: Doug Ledford 
> Date: Tue, 16 May 2017 14:03:22 -0400
> 
> > On Tue, 2017-05-16 at 13:36 -0400, David Miller wrote:
> >> From: Doug Ledford 
> >> Date: Tue, 16 May 2017 13:20:44 -0400
> >> 
> >> > Anyway, we're just talking out what happened, when what we
> really
> >> need
> >> > to focus on is moving forward.  Again, your thoughts on marking
> SMC
> >> > EXPERIMENTAL until it's fixed up and unfreezing the API in case
> we
> >> need
> >> > to adjust it to work on different link layers?
> >> 
> >> Something like:
> >> 
> >> http://patchwork.ozlabs.org/patch/762803/
> >> 
> >> with the addition of the EXPERIMENTAL dependency?
> >> 
> >> Sure.
> > 
> > Perfect.  I assume you'll submit it since it's in your patchworks?
> 
> Ok I applied the patch referenced above, but we don't actually have
> an EXPERIMENTAL symbol.  The closest thing we have is BROKEN and
> even in this situation that's a bit harsh.

I hadn't realized EXPERIMENTAL was gone.  Which is too bad, because
that's entirely appropriate in this case, and would have had the
desired side effect of keeping it out of any non-cutting edge distros
and warning people of possible API changes.  With EXPERIMENTAL gone,
the closest thing we have is drivers/staging, since that tends to imply
some of the same consequences.  I know you think BROKEN is overly
harsh, but I'm not sure we should just do nothing.  How about we take a
few days to let some of the RDMA people closely review the 143 page
(egads!) rfc (http://www.rfc-editor.org/info/rfc7609) to see if we
think it can be fixed to use multiple link layers with the existing API
in place or if it will require something other than AF_SMC.  If we need
to break API, then I think we should either fix it ASAP and send that
fix to the 4.11 stable series (which probably violates the normative
stable patch size/scope) or if the fix will take longer than this
kernel cycle, then move it to staging both here and in 4.11 stable, and
fix it there and then move it back.  Something like that would prevent
the kind of API flappage we ought not do

-- 
Doug Ledford 
    GPG KeyID: B826A3330E572FDD

Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

Re: [PATCH net] selftests/bpf: fix broken build due to types.h

2017-05-16 Thread David Miller


Please correct the address of the netdev list (it is just plain
'netdev' not 'linux-netdev').

Secondly, __always_inline should not be defined by types.h

That has to come from linux/compiler.h which we have no reason
to define a private version of for eBPF clang compilation.

The problem is that via several layers of indirection, linux/types.h
eventually includes linux/compiler.h and that is probably the more
appropriate thing for you to do.

Re: [PATCH 2/3] bpf: Track alignment of MAP pointers in verifier.

2017-05-16 Thread David Miller

From: Daniel Borkmann 
Date: Mon, 15 May 2017 23:55:47 +0200

> I'm actually wondering about the min_align/aux_off/aux_off_align and
> given this is not really related to varlen_map_access and we currently
> just skip this.
> 
> We should make sure that when env->strict_alignment is false that we
> ignore any difference in min_align/aux_off/aux_off_align, afaik, the
> min_align would also be set on regs other than ptr_to_pkt.

Ok I see what you are saying, alignment related register state has to
be taken into consideration during pruning but only when
env->strict_alignment is true.

->min_align is set on any register upon which a calculation is
performed.

> What about compare_ptrs_to_packet() for when env->strict_alignment is
> true in ptr_to_pkt case?

Yes we need to do something there, and yes we do need testcases.

You also remind me that I was thinking about whether we should
propagate alignment state through branches.  For example on
the taken path of a JEQ we can set both arms of the test to
have the largest of the two arms alignment.

1 2 3 >

1 - 100 of 276 matches

Mail list logo