Re: Doing crypto in small stack buffers (bluetooth vs vmalloc-stack crash, etc)

2016-06-22 Thread Herbert Xu
On Thu, Jun 23, 2016 at 11:48:25AM +0800, Herbert Xu wrote:
> 
> No we never had such an API in the kernel.  However, I see that
> rxkad does some pretty silly things and we should be able to avoid
> using the stack in pretty much all cases.  Let me try to come up with
> something.

Here it is:

---8<---
Subject: rxrpc: Avoid using stack memory in SG lists in rxkad

rxkad uses stack memory in SG lists which would not work if stacks
were allocated from vmalloc memory.  In fact, in most cases this
isn't even necessary as the stack memory ends up getting copied
over to kmalloc memory.

This patch eliminates all the unnecessary stack memory uses by
supplying the final destination directly to the crypto API.  In
two instances where a temporary buffer is actually needed we also
switch use the skb->cb area instead of the stack.

Finally there is no need to split a split-page buffer into two SG
entries so code dealing with that has been removed.

Signed-off-by: Herbert Xu 

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index f0b807a..8ee5933 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -277,6 +277,7 @@ struct rxrpc_connection {
struct key  *key;   /* security for this connection 
(client) */
struct key  *server_key;/* security for this service */
struct crypto_skcipher  *cipher;/* encryption handle */
+   struct rxrpc_crypt  csum_iv_head;   /* leading block for csum_iv */
struct rxrpc_crypt  csum_iv;/* packet checksum base */
unsigned long   events;
 #define RXRPC_CONN_CHALLENGE   0   /* send challenge packet */
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 6b726a0..ee142de 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -105,11 +105,9 @@ static void rxkad_prime_packet_security(struct 
rxrpc_connection *conn)
 {
struct rxrpc_key_token *token;
SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
-   struct scatterlist sg[2];
+   struct rxrpc_crypt *csum_iv;
+   struct scatterlist sg;
struct rxrpc_crypt iv;
-   struct {
-   __be32 x[4];
-   } tmpbuf __attribute__((aligned(16))); /* must all be in same page */
 
_enter("");
 
@@ -119,24 +117,21 @@ static void rxkad_prime_packet_security(struct 
rxrpc_connection *conn)
token = conn->key->payload.data[0];
memcpy(&iv, token->kad->session_key, sizeof(iv));
 
-   tmpbuf.x[0] = htonl(conn->epoch);
-   tmpbuf.x[1] = htonl(conn->cid);
-   tmpbuf.x[2] = 0;
-   tmpbuf.x[3] = htonl(conn->security_ix);
+   csum_iv = &conn->csum_iv_head;
+   csum_iv[0].x[0] = htonl(conn->epoch);
+   csum_iv[0].x[1] = htonl(conn->cid);
+   csum_iv[1].x[0] = 0;
+   csum_iv[1].x[1] = htonl(conn->security_ix);
 
-   sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-   sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+   sg_init_one(&sg, csum_iv, 16);
 
skcipher_request_set_tfm(req, conn->cipher);
skcipher_request_set_callback(req, 0, NULL, NULL);
-   skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+   skcipher_request_set_crypt(req, &sg, &sg, 16, iv.x);
 
crypto_skcipher_encrypt(req);
skcipher_request_zero(req);
 
-   memcpy(&conn->csum_iv, &tmpbuf.x[2], sizeof(conn->csum_iv));
-   ASSERTCMP((u32 __force)conn->csum_iv.n[0], ==, (u32 
__force)tmpbuf.x[2]);
-
_leave("");
 }
 
@@ -150,12 +145,9 @@ static int rxkad_secure_packet_auth(const struct 
rxrpc_call *call,
 {
struct rxrpc_skb_priv *sp;
SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
+   struct rxkad_level1_hdr hdr;
struct rxrpc_crypt iv;
-   struct scatterlist sg[2];
-   struct {
-   struct rxkad_level1_hdr hdr;
-   __be32  first;  /* first four bytes of data and padding */
-   } tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+   struct scatterlist sg;
u16 check;
 
sp = rxrpc_skb(skb);
@@ -165,24 +157,21 @@ static int rxkad_secure_packet_auth(const struct 
rxrpc_call *call,
check = sp->hdr.seq ^ sp->hdr.callNumber;
data_size |= (u32)check << 16;
 
-   tmpbuf.hdr.data_size = htonl(data_size);
-   memcpy(&tmpbuf.first, sechdr + 4, sizeof(tmpbuf.first));
+   hdr.data_size = htonl(data_size);
+   memcpy(sechdr, &hdr, sizeof(hdr));
 
/* start the encryption afresh */
memset(&iv, 0, sizeof(iv));
 
-   sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-   sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+   sg_init_one(&sg, sechdr, 8);
 
skcipher_request_set_tfm(req, call->conn->cipher);
skcipher_request_set_callback(req, 0, NULL, NULL);
-   skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+   skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 
crypto_skcipher_encry

[PATCH V2 0/3] basic device IOTLB support for vhost_net

2016-06-22 Thread Jason Wang
This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace IOMMU implementation (qemu)
for a secure DMA environment (DMAR) in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- when there's a IOTLB miss, it will notify userspace through
  vhost_net fd and then userspace read the fault address, size and
  access from vhost fd.
- userspace write the translation result back to vhost fd, vhost can
  then update its IOTLB.

The codes were optimized for fixed mapping users e.g dpdk in guest. It
will be slow if dynamic mappings were used in guest. We could do
optimizations on top.

The codes were designed to be architecture independent. It should be
easily ported to any architecture.

Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
hugepage case, 100% TLB hit rate were noticed.

Have a benchmark on this. Test was done with l2fwd in guest.For 2MB
page, no difference in 64B performance and I notice a 4%-5% drop for
1500B performance compare to UIO in guest. We can add some shortcut to
bypass the IOTLB for virtqueue accessing, but I think it's better to
do this on top.

Changes from V1:
- Fix i386 build warnings
- Drop access paramter for vhost_get_vq_desc() (fix VHOST SCSI build error)

Changes from RFC V3:
- rebase on latest
- minor tweak on commit log
- use VHOST_ACCESS_RO instead of VHOST_ACCESS_WO in vhost_copy_from_user()
- switch to use atomic userspace access helper in vhost_get/put_user()
- remove debug codes in vhost_iotlb_miss()
- use FIFO instead of FILO when doing TLB replacement
- fix unbalanced lock in vhost_process_iotlb_msg()

Changes from RFC V2:
- introduce memory accessors for vhost
- switch from ioctls to oridinary file read/write for iotlb miss and
  updating
- do not assume virtqueue were virtually mapped contiguously, all
  virtqueue access were done throug IOTLB
- verify memory access during IOTLB update and fail early
- introduce a module parameter for the size of IOTLB

Changes from RFC V1:
- support any size/range of updating and invalidation through
  introducing the interval tree.
- convert from per device iotlb request to per virtqueue iotlb
  request, this solves the possible deadlock in V1.
- read/write permission check support.

Please review.

Jason Wang (3):
  vhost: introduce vhost memory accessors
  vhost: convert pre sorted vhost memory array to interval tree
  vhost: device IOTLB API

 drivers/vhost/net.c|  58 +++-
 drivers/vhost/vhost.c  | 806 +++--
 drivers/vhost/vhost.h  |  57 +++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 829 insertions(+), 120 deletions(-)

-- 
1.8.3.1



[PATCH V2 2/3] vhost: convert pre sorted vhost memory array to interval tree

2016-06-22 Thread Jason Wang
Current pre-sorted memory region array has some limitations for future
device IOTLB conversion:

1) need extra work for adding and removing a single region, and it's
   expected to be slow because of sorting or memory re-allocation.
2) need extra work of removing a large range which may intersect
   several regions with different size.
3) need trick for a replacement policy like LRU

To overcome the above shortcomings, this patch convert it to interval
tree which can easily address the above issue with almost no extra
work.

The patch could be used for:

- Extend the current API and only let the userspace to send diffs of
  memory table.
- Simplify Device IOTLB implementation.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |   8 +--
 drivers/vhost/vhost.c | 182 --
 drivers/vhost/vhost.h |  27 ++--
 3 files changed, 128 insertions(+), 89 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1d3e45f..fc66956 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1038,20 +1038,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
struct socket *tx_sock = NULL;
struct socket *rx_sock = NULL;
long err;
-   struct vhost_memory *memory;
+   struct vhost_umem *umem;
 
mutex_lock(&n->dev.mutex);
err = vhost_dev_check_owner(&n->dev);
if (err)
goto done;
-   memory = vhost_dev_reset_owner_prepare();
-   if (!memory) {
+   umem = vhost_dev_reset_owner_prepare();
+   if (!umem) {
err = -ENOMEM;
goto done;
}
vhost_net_stop(n, &tx_sock, &rx_sock);
vhost_net_flush(n);
-   vhost_dev_reset_owner(&n->dev, memory);
+   vhost_dev_reset_owner(&n->dev, umem);
vhost_net_vq_reset(n);
 done:
mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f2a63a..166e779 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -42,6 +43,10 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
+INTERVAL_TREE_DEFINE(struct vhost_umem_node,
+rb, __u64, __subtree_last,
+START, LAST, , vhost_umem_interval_tree);
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
 {
@@ -300,10 +305,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call_ctx = NULL;
vq->call = NULL;
vq->log_ctx = NULL;
-   vq->memory = NULL;
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+   vq->umem = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -407,7 +412,7 @@ void vhost_dev_init(struct vhost_dev *dev,
mutex_init(&dev->mutex);
dev->log_ctx = NULL;
dev->log_file = NULL;
-   dev->memory = NULL;
+   dev->umem = NULL;
dev->mm = NULL;
spin_lock_init(&dev->work_lock);
INIT_LIST_HEAD(&dev->work_list);
@@ -512,27 +517,36 @@ err_mm:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_memory *vhost_dev_reset_owner_prepare(void)
+static void *vhost_kvzalloc(unsigned long size)
 {
-   return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n)
+   n = vzalloc(size);
+   return n;
+}
+
+struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+{
+   return vhost_kvzalloc(sizeof(struct vhost_umem));
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
 {
int i;
 
vhost_dev_cleanup(dev, true);
 
/* Restore memory to default empty mapping. */
-   memory->nregions = 0;
-   dev->memory = memory;
+   INIT_LIST_HEAD(&umem->umem_list);
+   dev->umem = umem;
/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 * VQs aren't running.
 */
for (i = 0; i < dev->nvqs; ++i)
-   dev->vqs[i]->memory = memory;
+   dev->vqs[i]->umem = umem;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
@@ -549,6 +563,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_clean(struct vhost_umem *umem)
+{
+   struct vhost_umem_node *node, *tmp;
+
+   if (!umem)
+   return;
+
+   list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
+   vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+   list_del(&node->link);
+  

[PATCH V2 3/3] vhost: device IOTLB API

2016-06-22 Thread Jason Wang
This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of DMA
remapping.

The idea is simple, cache the translation in a software device IOTLB
(which was implemented as interval tree) in vhost and use vhost_net
file descriptor for reporting IOTLB miss and IOTLB
update/invalidation. When vhost meets an IOTLB miss, the fault
address, size and access could be read from the file. After userspace
finishes the translation, it write the translated address to the
vhost_net file to update the device IOTLB.

When device IOTLB (VHOST_F_DEVICE_IOTLB) is enabled all vq address
set by ioctl were treated as iova instead of virtual address and the
accessing could only be done through IOTLB instead of direct
userspace memory access. Before each rounds or vq processing, all vq
metadata were prefetched in device IOTLB to make sure no translation
fault happens during vq processing.

In most cases, virtqueue were mapped contiguous even in virtual
address. So the IOTLB translation for virtqueue itself maybe a little
bit slower. We can add fast path on top of this patch.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c|  50 +++-
 drivers/vhost/vhost.c  | 634 ++---
 drivers/vhost/vhost.h  |  32 ++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 697 insertions(+), 47 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index fc66956..4ccebad 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-(1ULL << VIRTIO_NET_F_MRG_RXBUF)
+(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VHOST_F_DEVICE_IOTLB)
 };
 
 enum {
@@ -334,7 +335,7 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 {
unsigned long uninitialized_var(endtime);
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-   out_num, in_num, NULL, NULL);
+ out_num, in_num, NULL, NULL);
 
if (r == vq->num && vq->busyloop_timeout) {
preempt_disable();
@@ -344,7 +345,7 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
cpu_relax_lowlatency();
preempt_enable();
r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-   out_num, in_num, NULL, NULL);
+ out_num, in_num, NULL, NULL);
}
 
return r;
@@ -377,6 +378,9 @@ static void handle_tx(struct vhost_net *net)
if (!sock)
goto out;
 
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
vhost_disable_notify(&net->dev, vq);
 
hdr_size = nvq->vhost_hlen;
@@ -638,6 +642,10 @@ static void handle_rx(struct vhost_net *net)
sock = vq->private_data;
if (!sock)
goto out;
+
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
vhost_disable_notify(&net->dev, vq);
vhost_net_disable_vq(net, vq);
 
@@ -1086,6 +1094,11 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
mutex_unlock(&n->dev.mutex);
return -EFAULT;
}
+   if ((features & (1ULL << VHOST_F_DEVICE_IOTLB))) {
+   if (vhost_init_device_iotlb(&n->dev, true))
+   return -EFAULT;
+   }
+
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
mutex_lock(&n->vqs[i].vq.mutex);
n->vqs[i].vq.acked_features = features;
@@ -1168,9 +1181,40 @@ static long vhost_net_compat_ioctl(struct file *f, 
unsigned int ioctl,
 }
 #endif
 
+static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_net *n = file->private_data;
+   struct vhost_dev *dev = &n->dev;
+   int noblock = file->f_flags & O_NONBLOCK;
+
+   return vhost_chr_read_iter(dev, to, noblock);
+}
+
+static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb,
+   struct iov_iter *from)
+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_net *n = file->private_data;
+   struct vhost_dev *dev = &n->dev;
+
+   return vhost_chr_write_iter(dev, from);
+}
+
+static unsigned int vhost_net_chr_poll(struct file *file, poll_table *wait)
+{
+   struct vhost_net *n = file->private_data;
+   struct vhost_dev *dev = &n->dev;
+
+   return vhost_chr_poll(file, dev, wait);
+}
+
 static const struct file_operations vhost_net_fops = {
.owner  = THIS_MODULE,
.release= vhost_net_release,
+   .read_iter  = vhost_net_chr_read_iter,
+ 

[PATCH V2 1/3] vhost: introduce vhost memory accessors

2016-06-22 Thread Jason Wang
This patch introduces vhost memory accessors which were just wrappers
for userspace address access helpers. This is a requirement for vhost
device iotlb implementation which will add iotlb translations in those
accessors.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 50 +++---
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1..9f2a63a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -651,6 +651,22 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_memory *mem,
return 1;
 }
 
+#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+
+static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
+ const void *from, unsigned size)
+{
+   return __copy_to_user(to, from, size);
+}
+
+#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+
+static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
+   void *from, unsigned size)
+{
+   return __copy_from_user(to, from, size);
+}
+
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
struct vring_desc __user *desc,
struct vring_avail __user *avail,
@@ -1156,7 +1172,8 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
void __user *used;
-   if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 
0)
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+  &vq->used->flags) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1174,7 +1191,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
-   if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), 
vhost_avail_event(vq)))
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+  vhost_avail_event(vq)))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -1212,7 +1230,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
r = -EFAULT;
goto err;
}
-   r = __get_user(last_used_idx, &vq->used->idx);
+   r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
if (r)
goto err;
vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1392,7 +1410,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+   if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
   &vq->avail->idx);
return -EFAULT;
@@ -1414,8 +1432,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
-   if (unlikely(__get_user(ring_head,
-   &vq->avail->ring[last_avail_idx & (vq->num - 
1)]))) {
+   if (unlikely(vhost_get_user(vq, ring_head,
+&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   &vq->avail->ring[last_avail_idx % vq->num]);
@@ -1450,7 +1468,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
   i, vq->num, head);
return -EINVAL;
}
-   ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+   ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
+  sizeof desc);
if (unlikely(ret)) {
vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
   i, vq->desc + i);
@@ -1538,15 +1557,15 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
*vq,
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1) {
-   if (__put_user(heads[0].id, &used->id)) {
+   if (vhost_put_user(vq, heads[0].id, &used->id)) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
-   if (__put_user(heads[0].len, &used->len)) {
+   if (vhost_put_user(vq, heads[0].len, &used->len)) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
-   } else i

Re: 802.3ad bonding aggregator reselection

2016-06-22 Thread Jay Vosburgh
Veli-Matti Lintu  wrote:
[...]
>> I don't see an issue with the above behavior when ad_select is
>> set to the default value of "stable"; bonding does reselect a new
>> aggregator when all links fail, and it appears to follow the standard.
>
>In my testing ad_select=stable does not reselect a new aggregator when
>all links have failed. Reselection seems to occur only when a link
>comes up the failure. Here's an example of two bonds having three
>links each. Aggregator ID 3 is active with three ports and ID 2 has
>also three ports up.

Yes, I've since observed that as well.

[...]
>Are you able to reproduce this or is reselection working as expected?

Reselection is not working correctly at all.

I'm working up a more comprehensive fix; the setting of BEGIN in
the older code masked a number of issues in the reselection logic that
never came up because setting BEGIN would do a full reselection from
scratch at every slave carrier state change (meaning that no aggregator
ever ended up with link down ports as members).

My test patch at the moment is below (this is against net); any
testing or review would be appreciated.  I have not tested the ad_select
bandwidth behavior of this yet; I've been testing stable and count
first.

This patch should be conformant to the standard, which requires
link down ports to remain selected, but implementations are free to
choose an active aggregator however they wish.

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index b9304a295f86..57be940c4c37 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -657,6 +657,20 @@ static void __set_agg_ports_ready(struct aggregator 
*aggregator, int val)
}
 }
 
+static int __agg_active_ports(struct aggregator *agg)
+{
+   struct port *port;
+   int active = 0;
+
+   for (port = agg->lag_ports; port;
+port = port->next_port_in_aggregator) {
+   if (port->is_enabled)
+   active++;
+   }
+
+   return active;
+}
+
 /**
  * __get_agg_bandwidth - get the total bandwidth of an aggregator
  * @aggregator: the aggregator we're looking at
@@ -665,38 +679,39 @@ static void __set_agg_ports_ready(struct aggregator 
*aggregator, int val)
 static u32 __get_agg_bandwidth(struct aggregator *aggregator)
 {
u32 bandwidth = 0;
+   int nports = __agg_active_ports(aggregator);
 
-   if (aggregator->num_of_ports) {
+   if (nports) {
switch (__get_link_speed(aggregator->lag_ports)) {
case AD_LINK_SPEED_1MBPS:
-   bandwidth = aggregator->num_of_ports;
+   bandwidth = nports;
break;
case AD_LINK_SPEED_10MBPS:
-   bandwidth = aggregator->num_of_ports * 10;
+   bandwidth = nports * 10;
break;
case AD_LINK_SPEED_100MBPS:
-   bandwidth = aggregator->num_of_ports * 100;
+   bandwidth = nports * 100;
break;
case AD_LINK_SPEED_1000MBPS:
-   bandwidth = aggregator->num_of_ports * 1000;
+   bandwidth = nports * 1000;
break;
case AD_LINK_SPEED_2500MBPS:
-   bandwidth = aggregator->num_of_ports * 2500;
+   bandwidth = nports * 2500;
break;
case AD_LINK_SPEED_1MBPS:
-   bandwidth = aggregator->num_of_ports * 1;
+   bandwidth = nports * 1;
break;
case AD_LINK_SPEED_2MBPS:
-   bandwidth = aggregator->num_of_ports * 2;
+   bandwidth = nports * 2;
break;
case AD_LINK_SPEED_4MBPS:
-   bandwidth = aggregator->num_of_ports * 4;
+   bandwidth = nports * 4;
break;
case AD_LINK_SPEED_56000MBPS:
-   bandwidth = aggregator->num_of_ports * 56000;
+   bandwidth = nports * 56000;
break;
case AD_LINK_SPEED_10MBPS:
-   bandwidth = aggregator->num_of_ports * 10;
+   bandwidth = nports * 10;
break;
default:
bandwidth = 0; /* to silence the compiler */
@@ -1530,10 +1545,10 @@ static struct aggregator *ad_agg_selection_test(struct 
aggregator *best,
 
switch (__get_agg_selection_mode(curr->lag_ports)) {
case BOND_AD_COUNT:
-   if (curr->num_of_ports > best->num_of_ports)
+   if (__agg_active_ports(curr) > __agg_active_ports(best))
return curr;
 
-

[PATCH 1/2] net: thunderx: Fix link status reporting

2016-06-22 Thread sunil . kovvuri
From: Sunil Goutham 

Check for SMU RX local/remote faults along with SPU LINK
status. Otherwise at times link is UP at our end but DOWN
at link partner's side. Also due to an issue in BGX it's
rarely seen that initialization doesn't happen properly
and SMU RX reports faults with everything fine at SPU.
This patch tries to reinitialize LMAC to fix it.

Also fixed LMAC disable sequence to properly bring down link.

Signed-off-by: Sunil Goutham 
Signed-off-by: Tao Wang 

---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 86 +++
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h |  2 +
 2 files changed, 57 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index 3ed2198..b96aae0 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -551,7 +551,9 @@ static int bgx_xaui_check_link(struct lmac *lmac)
}
 
/* Clear rcvflt bit (latching high) and read it back */
-   bgx_reg_modify(bgx, lmacid, BGX_SPUX_STATUS2, SPU_STATUS2_RCVFLT);
+   if (bgx_reg_read(bgx, lmacid, BGX_SPUX_STATUS2) & SPU_STATUS2_RCVFLT)
+   bgx_reg_modify(bgx, lmacid,
+  BGX_SPUX_STATUS2, SPU_STATUS2_RCVFLT);
if (bgx_reg_read(bgx, lmacid, BGX_SPUX_STATUS2) & SPU_STATUS2_RCVFLT) {
dev_err(&bgx->pdev->dev, "Receive fault, retry training\n");
if (bgx->use_training) {
@@ -570,13 +572,6 @@ static int bgx_xaui_check_link(struct lmac *lmac)
return -1;
}
 
-   /* Wait for MAC RX to be ready */
-   if (bgx_poll_reg(bgx, lmacid, BGX_SMUX_RX_CTL,
-SMU_RX_CTL_STATUS, true)) {
-   dev_err(&bgx->pdev->dev, "SMU RX link not okay\n");
-   return -1;
-   }
-
/* Wait for BGX RX to be idle */
if (bgx_poll_reg(bgx, lmacid, BGX_SMUX_CTL, SMU_CTL_RX_IDLE, false)) {
dev_err(&bgx->pdev->dev, "SMU RX not idle\n");
@@ -589,29 +584,30 @@ static int bgx_xaui_check_link(struct lmac *lmac)
return -1;
}
 
-   if (bgx_reg_read(bgx, lmacid, BGX_SPUX_STATUS2) & SPU_STATUS2_RCVFLT) {
-   dev_err(&bgx->pdev->dev, "Receive fault\n");
-   return -1;
-   }
-
-   /* Receive link is latching low. Force it high and verify it */
-   bgx_reg_modify(bgx, lmacid, BGX_SPUX_STATUS1, SPU_STATUS1_RCV_LNK);
-   if (bgx_poll_reg(bgx, lmacid, BGX_SPUX_STATUS1,
-SPU_STATUS1_RCV_LNK, false)) {
-   dev_err(&bgx->pdev->dev, "SPU receive link down\n");
-   return -1;
-   }
-
+   /* Clear receive packet disable */
cfg = bgx_reg_read(bgx, lmacid, BGX_SPUX_MISC_CONTROL);
cfg &= ~SPU_MISC_CTL_RX_DIS;
bgx_reg_write(bgx, lmacid, BGX_SPUX_MISC_CONTROL, cfg);
-   return 0;
+
+   /* Check for MAC RX faults */
+   cfg = bgx_reg_read(bgx, lmacid, BGX_SMUX_RX_CTL);
+   /* 0 - Link is okay, 1 - Local fault, 2 - Remote fault */
+   cfg &= SMU_RX_CTL_STATUS;
+   if (!cfg)
+   return 0;
+
+   /* Rx local/remote fault seen.
+* Do lmac reinit to see if condition recovers
+*/
+   bgx_lmac_xaui_init(bgx, lmacid, bgx->lmac_type);
+
+   return -1;
 }
 
 static void bgx_poll_for_link(struct work_struct *work)
 {
struct lmac *lmac;
-   u64 link;
+   u64 spu_link, smu_link;
 
lmac = container_of(work, struct lmac, dwork.work);
 
@@ -621,8 +617,11 @@ static void bgx_poll_for_link(struct work_struct *work)
bgx_poll_reg(lmac->bgx, lmac->lmacid, BGX_SPUX_STATUS1,
 SPU_STATUS1_RCV_LNK, false);
 
-   link = bgx_reg_read(lmac->bgx, lmac->lmacid, BGX_SPUX_STATUS1);
-   if (link & SPU_STATUS1_RCV_LNK) {
+   spu_link = bgx_reg_read(lmac->bgx, lmac->lmacid, BGX_SPUX_STATUS1);
+   smu_link = bgx_reg_read(lmac->bgx, lmac->lmacid, BGX_SMUX_RX_CTL);
+
+   if ((spu_link & SPU_STATUS1_RCV_LNK) &&
+   !(smu_link & SMU_RX_CTL_STATUS)) {
lmac->link_up = 1;
if (lmac->bgx->lmac_type == BGX_MODE_XLAUI)
lmac->last_speed = 4;
@@ -636,9 +635,15 @@ static void bgx_poll_for_link(struct work_struct *work)
}
 
if (lmac->last_link != lmac->link_up) {
+   if (lmac->link_up) {
+   if (bgx_xaui_check_link(lmac)) {
+   /* Errors, clear link_up state */
+   lmac->link_up = 0;
+   lmac->last_speed = SPEED_UNKNOWN;
+   lmac->last_duplex = DUPLEX_UNKNOWN;
+   }
+   }
lmac->last_link = lmac->link_up;
-   if (lmac->link_up)
-   bgx_xaui_check_link(lmac);
}
 
queue_delay

[PATCH 2/2] net: thunderx: Fix TL4 configuration for secondary Qsets

2016-06-22 Thread sunil . kovvuri
From: Sunil Goutham 

TL4 calculation for a given SQ of secondary Qsets is incorrect
and goes out of bounds and also for some SQ's TL4 chosen will
transmit data via a different BGX interface and not same as
primary Qset's interface.

This patch fixes this issue.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/cavium/thunder/nic_main.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c 
b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 95f17f8..16ed203 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -499,6 +499,7 @@ static void nic_tx_channel_cfg(struct nicpf *nic, u8 vnic,
u32 rr_quantum;
u8 sq_idx = sq->sq_num;
u8 pqs_vnic;
+   int svf;
 
if (sq->sqs_mode)
pqs_vnic = nic->pqs_vf[vnic];
@@ -511,10 +512,19 @@ static void nic_tx_channel_cfg(struct nicpf *nic, u8 vnic,
/* 24 bytes for FCS, IPG and preamble */
rr_quantum = ((NIC_HW_MAX_FRS + 24) / 4);
 
-   tl4 = (lmac * NIC_TL4_PER_LMAC) + (bgx * NIC_TL4_PER_BGX);
+   if (!sq->sqs_mode) {
+   tl4 = (lmac * NIC_TL4_PER_LMAC) + (bgx * NIC_TL4_PER_BGX);
+   } else {
+   for (svf = 0; svf < MAX_SQS_PER_VF; svf++) {
+   if (nic->vf_sqs[pqs_vnic][svf] == vnic)
+   break;
+   }
+   tl4 = (MAX_LMAC_PER_BGX * NIC_TL4_PER_LMAC);
+   tl4 += (lmac * NIC_TL4_PER_LMAC * MAX_SQS_PER_VF);
+   tl4 += (svf * NIC_TL4_PER_LMAC);
+   tl4 += (bgx * NIC_TL4_PER_BGX);
+   }
tl4 += sq_idx;
-   if (sq->sqs_mode)
-   tl4 += vnic * 8;
 
tl3 = tl4 / (NIC_MAX_TL4 / NIC_MAX_TL3);
nic_reg_write(nic, NIC_PF_QSET_0_127_SQ_0_7_CFG2 |
-- 
2.7.4



[PATCH 0/2] net: thunderx: Miscellaneous fixes

2016-06-22 Thread sunil . kovvuri
From: Sunil Goutham 

This 2 patch series fixes issues w.r.t physical link status 
reporting and transmit datapath configuration for 
secondary qsets.

Sunil Goutham (2):
  net: thunderx: Fix link status reporting
  net: thunderx: Fix TL4 configuration for secondary Qsets

 drivers/net/ethernet/cavium/thunder/nic_main.c| 16 -
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 86 +++
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h |  2 +
 3 files changed, 70 insertions(+), 34 deletions(-)

-- 
2.7.4



Re: [PATCH net-next 10/19] net: hns: bugfix about pfc pause frame statistics

2016-06-22 Thread Yisen Zhuang


在 2016/6/22 17:41, Andy Shevchenko 写道:
> On Wed, 2016-06-22 at 09:43 +0800, Yisen Zhuang wrote:
>>
>> 在 2016/6/21 18:32, Andy Shevchenko 写道:
>>> On Tue, 2016-06-21 at 11:56 +0800, Yisen Zhuang wrote:
 From: Daode Huang 

 For SoC hip06, PFC pause handled in dsaf, while hip05 in XGMAC,
 so change the statistics of pfc pause in dsaf and remove the old
 pfc pause frame statistics.

>>>  
>>>
 +static char *hns_dsaf_get_node_stats_strings(char *data, int
 node,
 +   struct dsaf_device
 *dsaf_dev)
  {
char *buff = data;
 +  int i;
 +  bool is_ver1 = AE_IS_VER1(dsaf_dev->dsaf_ver);
  
snprintf(buff, ETH_GSTRING_LEN, "innod%d_pad_drop_pkts",
 node);
buff = buff + ETH_GSTRING_LEN;
 @@ -2502,6 +2530,18 @@ static char
 *hns_dsaf_get_node_stats_strings(char *data, int node)
buff = buff + ETH_GSTRING_LEN;
snprintf(buff, ETH_GSTRING_LEN, "innod%d_stp_drop_pkts",
 node);
buff = buff + ETH_GSTRING_LEN;
 +  if ((node < DSAF_SERVICE_NW_NUM) && (!is_ver1)) {
>>>
>>> Redundant parens.
>>>
 +  for (i = 0; i < DSAF_PRIO_NR; i++) {
 +  snprintf(buff, ETH_GSTRING_LEN,
 +   "inod%d_pfc_prio%d_pkts", node,
 i);
 +  buff = buff + ETH_GSTRING_LEN;
>>>
>>> buff += ...
>>>
 +  }
 +  for (i = 0; i < DSAF_PRIO_NR; i++) {
 +  snprintf(buff, ETH_GSTRING_LEN,
 +   "onod%d_pfc_prio%d_pkts", node,
 i);
 +  buff = buff + ETH_GSTRING_LEN;
>>>
>>> Ditto.
>>>
  {
u64 *p = data;
 +  int i;
struct dsaf_hw_stats *hw_stats = &ddev-
> hw_stats[node_num];
 +  bool is_ver1 = AE_IS_VER1(ddev->dsaf_ver);
  
p[0] = hw_stats->pad_drop;
p[1] = hw_stats->man_pkts;
 @@ -2527,8 +2569,16 @@ static u64 *hns_dsaf_get_node_stats(struct
 dsaf_device *ddev, u64 *data,
p[10] = hw_stats->local_addr_false;
p[11] = hw_stats->vlan_drop;
p[12] = hw_stats->stp_drop;
 -  p[13] = hw_stats->tx_pkts;
 +  if ((node_num < DSAF_SERVICE_NW_NUM) && (!is_ver1)) {
 +  for (i = 0; i < DSAF_PRIO_NR; i++) {
 +  p[13 + i] = hw_stats->rx_pfc[i];
 +  p[13 + i + DSAF_PRIO_NR] = hw_stats-
> tx_pfc[i];
 +  }
>>>
>>> Two different approaches how to assign data. Above uses 2 for-loops,
>>> here you put everything to one.
>>
>> Above cann't be merged to 1 for-loop, because lenght of the string is
>> unknowable.
> 
> It doesn't matter since you are incrementing start position by
> constant. 
> 
> snprintf(buff, ETH_GSTRING_LEN, "inod%d_pfc_prio%d_pkts", node, i);
> snprintf(buff, ETH_GSTRING_LEN, "onod%d_pfc_prio%d_pkts", node, i);
> 
> Same approach as below can be used
> 
> snprintf(buff + 0 * ETH_GSTRING_LEN * DSAF_PRIO_NR, ETH_GSTRING_LEN, ...
> snprintf(buff + 1 * ETH_GSTRING_LEN * DSAF_PRIO_NR, ETH_GSTRING_LEN, ...
> 
> Of course to make it less verbose you may add new definition(s) and/ or
> variable(s).
> 
>>
>> And here we put everything to one to reduce codes.
>>
> 
> I would suggest to use following pattern for such lines
> p[13 + i + 0 * DSAF_PRIO_NR] = hw_stats->rx_pfc[i];
> p[13 + i + 1 * DSAF_PRIO_NR] = hw_stats->tx_pfc[i];
> 
> That's allow reader to see what are you doing here.
> 
> P.S. This is for the future patches since current is already applied.

Hi Andy,

Many thanks for you suggestions. I will fix it with a new patch.

Thanks,

Yisen

> 



Re: [patch net-next v5 0/4] return offloaded stats as default and expose original sw stats

2016-06-22 Thread Jiri Pirko
Wed, Jun 22, 2016 at 09:32:25PM CEST, ro...@cumulusnetworks.com wrote:
>On Tue, Jun 21, 2016 at 8:15 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> The problem we try to handle is about offloaded forwarded packets
>> which are not seen by kernel. Let me try to draw it:
>>
>> port1   port2 (HW stats are counted here)
>>   \  /
>>\/
>> \  /
>>  --(A) ASIC --(B)--
>> |
>>(C)
>> |
>>CPU (SW stats are counted here)
>>
>>
>> Now we have couple of flows for TX and RX (direction does not matter here):
>>
>> 1) port1->A->ASIC->C->CPU
>>
>>For this flow, HW and SW stats are equal.
>>
>> 2) port1->A->ASIC->C->CPU->C->ASIC->B->port2
>>
>>For this flow, HW and SW stats are equal.
>>
>> 3) port1->A->ASIC->B->port2
>>
>>For this flow, SW stats are 0.
>>
>> The purpose of this patchset is to provide facility for user to
>> find out the difference between flows 1+2 and 3. In other words, user
>> will be able to see the statistics for the slow-path (through kernel).
>>
>> Also note that HW stats are what someone calls "accumulated" stats.
>> Every packet counted by SW is also counted by HW. Not the other way around.
>>
>> As a default the accumulated stats (HW) will be exposed to user
>> so the userspace apps can react properly.
>>
>>
>
>curious, how do you plan to handle virtual device counters like vlan
>and vxlan stats ?.

Yes, that is another problem (1). We have to push stats up to this devices
most probably. But that problem is orthogonal to this. To the user, you
will still need 2 sets of stats and HW stats being default. So this
patchset infra is going to be used as well.


>we can't separate CPU and HW stats there. In some cases (or ASICs) HW
>counters do
>not include CPU generated packetsyou will have to add CPU
>generated pkt counters to the
>hw counters for such virtual device stats.

Can you please provide and example how that could happen?


>
>example: In the switchdev model, for bridge vlan stats, when user
>queries bridge vlan stats,
>you will have to add the hw stats to the bridge driver vlan stats and
>return it to the user .

Yep, that is (1).


>
>Having a consistent model for all kinds of stats will help.


Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-22 Thread Jason Wang



On 2016年06月23日 02:18, Michael S. Tsirkin wrote:

On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote:

>Would it help to have ptr_ring_resize that gets an array of
>rings and resizes them both to same length?

OK, here it is. Untested so far, and no skb wrapper.
Pls let me know whether this is what you had in mind.


Exactly what I want.

Thanks


[PATCH] i40e: Remove redundant memset

2016-06-22 Thread Amitoj Kaur Chawla
Remove redundant call to memset before a call to memcpy.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression e1,e2,e3,e4;
@@

- memset(e1,e2,e3);
  memcpy(e1,e4,e3);

Signed-off-by: Amitoj Kaur Chawla 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 3449129..45c1671 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7961,7 +7961,6 @@ static int i40e_config_rss_aq(struct i40e_vsi *vsi, const 
u8 *seed,
u8 *rss_lut;
int ret, i;
 
-   memset(&rss_key, 0, sizeof(rss_key));
memcpy(&rss_key, seed, sizeof(rss_key));
 
rss_lut = kzalloc(pf->rss_table_size, GFP_KERNEL);
-- 
1.9.1



[PATCH 1/2] Bluetooth: Add LED triggers for HCI frames tx and rx

2016-06-22 Thread Guodong Xu
Two LED triggers are defined: tx_led and rx_led. Upon frames
available in HCI core layer, for tx or for rx, the combined LED
can blink.

Verified on HiKey, 96boards. It uses hi6220 SoC and TI WL1835 combo
chip.

Signed-off-by: Guodong Xu 
---
 include/net/bluetooth/hci_core.h |  1 +
 net/bluetooth/hci_core.c |  3 +++
 net/bluetooth/leds.c | 15 +++
 net/bluetooth/leds.h |  2 ++
 4 files changed, 21 insertions(+)

diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
index dc71473..37b8dd9 100644
--- a/include/net/bluetooth/hci_core.h
+++ b/include/net/bluetooth/hci_core.h
@@ -398,6 +398,7 @@ struct hci_dev {
bdaddr_trpa;
 
struct led_trigger  *power_led;
+   struct led_trigger  *tx_led, *rx_led;
 
int (*open)(struct hci_dev *hdev);
int (*close)(struct hci_dev *hdev);
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 45a9fc6..c6e1210 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -3248,6 +3248,7 @@ int hci_recv_frame(struct hci_dev *hdev, struct sk_buff 
*skb)
skb_queue_tail(&hdev->rx_q, skb);
queue_work(hdev->workqueue, &hdev->rx_work);
 
+   hci_leds_blink_oneshot(hdev->rx_led);
return 0;
 }
 EXPORT_SYMBOL(hci_recv_frame);
@@ -3325,6 +3326,8 @@ static void hci_send_frame(struct hci_dev *hdev, struct 
sk_buff *skb)
BT_ERR("%s sending frame failed (%d)", hdev->name, err);
kfree_skb(skb);
}
+
+   hci_leds_blink_oneshot(hdev->tx_led);
 }
 
 /* Send HCI command */
diff --git a/net/bluetooth/leds.c b/net/bluetooth/leds.c
index 8319c84..c4825d5 100644
--- a/net/bluetooth/leds.c
+++ b/net/bluetooth/leds.c
@@ -19,6 +19,8 @@ struct hci_basic_led_trigger {
 #define to_hci_basic_led_trigger(arg) container_of(arg, \
struct hci_basic_led_trigger, led_trigger)
 
+#define BLUETOOTH_BLINK_DELAY  50 /* ms */
+
 void hci_leds_update_powered(struct hci_dev *hdev, bool enabled)
 {
if (hdev->power_led)
@@ -37,6 +39,15 @@ static void power_activate(struct led_classdev *led_cdev)
led_trigger_event(led_cdev->trigger, powered ? LED_FULL : LED_OFF);
 }
 
+void hci_leds_blink_oneshot(struct led_trigger *trig)
+{
+   unsigned long led_delay = BLUETOOTH_BLINK_DELAY;
+
+   if (!trig)
+   return;
+   led_trigger_blink_oneshot(trig, &led_delay, &led_delay, 0);
+}
+
 static struct led_trigger *led_allocate_basic(struct hci_dev *hdev,
void (*activate)(struct led_classdev *led_cdev),
const char *name)
@@ -71,4 +82,8 @@ void hci_leds_init(struct hci_dev *hdev)
 {
/* initialize power_led */
hdev->power_led = led_allocate_basic(hdev, power_activate, "power");
+   /* initialize tx_led */
+   hdev->tx_led = led_allocate_basic(hdev, NULL, "tx");
+   /* initialize rx_led */
+   hdev->rx_led = led_allocate_basic(hdev, NULL, "rx");
 }
diff --git a/net/bluetooth/leds.h b/net/bluetooth/leds.h
index a9c4d6e..9b1cccd 100644
--- a/net/bluetooth/leds.h
+++ b/net/bluetooth/leds.h
@@ -9,8 +9,10 @@
 #if IS_ENABLED(CONFIG_BT_LEDS)
 void hci_leds_update_powered(struct hci_dev *hdev, bool enabled);
 void hci_leds_init(struct hci_dev *hdev);
+void hci_leds_blink_oneshot(struct led_trigger *trig);
 #else
 static inline void hci_leds_update_powered(struct hci_dev *hdev,
   bool enabled) {}
 static inline void hci_leds_init(struct hci_dev *hdev) {}
+static inline void hci_leds_blink_oneshot(struct led_trigger *trig) {}
 #endif
-- 
1.9.1



[PATCH 2/2] arm64: dts: hikey: set bluetooth led trigger

2016-06-22 Thread Guodong Xu
Set bluetooth led trigger to hci0-rx, and so LED blinks on hci
frame receiving.

Signed-off-by: Guodong Xu 
---
 arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index e92a30c..5032792 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -132,7 +132,7 @@
bt_active_led {
label = "bt_active";
gpios = <&gpio4 7 0>; /* <&gpio_bt_active_led>; */
-   linux,default-trigger = "hci0rx";
+   linux,default-trigger = "hci0-rx";
default-state = "off";
};
};
-- 
1.9.1



[PATCH] tipc: Use kmemdup instead of kmalloc and memcpy

2016-06-22 Thread Amitoj Kaur Chawla
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@

-  to = \(kmalloc\|kzalloc\)(size,flag);
+  to = kmemdup(from,size,flag);
   if (to==NULL || ...) S
-  memcpy(to, from, size);

Signed-off-by: Amitoj Kaur Chawla 
---
 net/tipc/server.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index 2446bfb..38a6f33 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -411,13 +411,12 @@ static struct outqueue_entry *tipc_alloc_entry(void 
*data, int len)
if (!entry)
return NULL;
 
-   buf = kmalloc(len, GFP_ATOMIC);
+   buf = kmemdup(data, len, GFP_ATOMIC);
if (!buf) {
kfree(entry);
return NULL;
}
 
-   memcpy(buf, data, len);
entry->iov.iov_base = buf;
entry->iov.iov_len = len;
 
-- 
1.9.1



Re: esp: Fix ESN generation under UDP encapsulation

2016-06-22 Thread Blair Steven
This change tests okay in my setup.

Thanks very much
-Blair

On 06/20/2016 10:59 PM, Steffen Klassert wrote:
> On Sat, Jun 18, 2016 at 01:03:36PM +0800, Herbert Xu wrote:
>> On Fri, Jun 17, 2016 at 12:24:29PM +0200, Steffen Klassert wrote:
>>> On Wed, Jun 15, 2016 at 12:44:54AM +, Blair Steven wrote:
 The restoration is happening - but being actioned on the wrong location.

 The destination IP address is being saved and restored, and the SPI
 being written directly after the destination IP address. From my
 understanding though, the ESN shuffling should have saved and restored
 the UDP source / dest ports + SPI.
>>> Yes, looks like we copy with a wrong offset if udp encapsulation
>>> is used, skb_transport_header() does not point to the esp header
>>> in this case. Ccing Herbert, he changed this part when switching
>>> to the new AEAD interface with
>>> commit 7021b2e1cddd ("esp4: Switch to new AEAD interface").
>> Thanks for catching this!
>>
>> I think rather than changing the transport header (which isn't
>> quite right because UDP still is the transport protocol), we can
>> just save the offset locally.  Something like this:
>>
>> ---8<---
>> Blair Steven noticed that ESN in conjunction with UDP encapsulation
>> is broken because we set the temporary ESP header to the wrong spot.
>>
>> This patch fixes this by first of all using the right spot, i.e.,
>> 4 bytes off the real ESP header, and then saving this information
>> so that after encryption we can restore it properly.
>>
>> Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface")
>> Reported-by: Blair Steven 
>> Signed-off-by: Herbert Xu 
> Looks good.
> Blair could you please test this?
>
> Thanks!


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Yuval Mintz
> Then again, if you're basically saying every HW-assisted offload on
> receive should be done under LRO flag, what would be the use case
> where a GRO-assisted offload would help?

> I.e., afaik LRO is superior to GRO in `brute force' -
> it creates better packed packets and utilizes memory better
> [with all the obvious cons such as inability for defragmentation].
> So if you'd have the choice of having an adpater perform 'classic'
> LRO aggregation or something that resembles a GRO packet,
> what would be the gain from doing the latter?

Just to relate to bnx2x/qede differences in current implementation -
when this GRO hw-offload was added to bnx2x, it has already
supported classical LRO, and due to above statement whenever LRO
was set driver aggregated incoming traffic as classic LRO.
I agree that in hindsight the lack of distinction between sw/hw GRO
was hurting us.

qede isn't implementing LRO, so we could easily mark this feature
under LRO there - but question is, given that the adapter can support
LRO, if we're going to suffer from all the shotrages that arise from
putting this feature under LRO, why should we bother?

You can argue that we might need a new feature bit for control
over such a feature; If we don't do that, is there any gain in all of this?


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Yuval Mintz
>> Claiming that hardware assist GRO is not possible is a plain mantra.

> I have no issue claiming hardware assist GRO is possible.  My problem
> is saying that the GRO feature flag can be used to enable it.  I would
> argue that any packet aggregation at the device or driver level is LRO
> regardless of how close it comes to GRO feature wise.  GRO should only
> be occurring in the receive path after calling the appropriate GRO
> function.  Otherwise it becomes really difficult to work around any
> possible issues that the hardware assisted GRO introduces without
> paying a huge penalty.  We need to keep these feature flags separate.
> I thought that was the whole reason why we have the distinction
> between LRO and GRO in the first place.

Then again, if you're basically saying every HW-assisted offload on
receive should be done under LRO flag, what would be the use case
where a GRO-assisted offload would help?

I.e., afaik LRO is superior to GRO in `brute force' -
it creates better packed packets and utilizes memory better
[with all the obvious cons such as inability for defragmentation].
So if you'd have the choice of having an adpater perform 'classic'
LRO aggregation or something that resembles a GRO packet,
what would be the gain from doing the latter?


Re: [PATCH] Maxim/driver: Add driver for maxim ds26522

2016-06-22 Thread kbuild test robot
Hi,

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20160622]
[cannot apply to v4.7-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Zhao-Qiang/Maxim-driver-Add-driver-for-maxim-ds26522/20160623-095007
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/module.h:18:0,
from drivers/net/wan/slic_ds26522.c:15:
>> drivers/net/wan/slic_ds26522.c:39:20: error: expected ',' or ';' before 
>> 'DRV_DESC'
MODULE_DESCRIPTION(DRV_DESC);
   ^
   include/linux/moduleparam.h:23:26: note: in definition of macro 
'__MODULE_INFO'
  = __stringify(tag) "=" info
 ^~~~
>> include/linux/module.h:218:42: note: in expansion of macro 'MODULE_INFO'
#define MODULE_DESCRIPTION(_description) MODULE_INFO(description, 
_description)
 ^~~
>> drivers/net/wan/slic_ds26522.c:39:1: note: in expansion of macro 
>> 'MODULE_DESCRIPTION'
MODULE_DESCRIPTION(DRV_DESC);
^~

vim +39 drivers/net/wan/slic_ds26522.c

 9   * under  the terms of  the GNU General  Public License as published by 
the
10   * Free Software Foundation;  either version 2 of the  License, or (at 
your
11   * option) any later version.
12   */
13  
14  #include 
  > 15  #include 
16  #include 
17  #include 
18  #include 
19  #include 
20  #include 
21  #include 
22  #include 
23  #include 
24  #include 
25  #include 
26  #include 
27  #include "slic_ds26522.h"
28  
29  #define DRV_NAME "ds26522"
30  
31  #define SLIC_TRANS_LEN 1
32  #define SLIC_TWO_LEN 2
33  #define SLIC_THREE_LEN 3
34  
35  static struct spi_device *g_spi;
36  
37  MODULE_LICENSE("GPL");
38  MODULE_AUTHOR("Zhao Qiang");
  > 39  MODULE_DESCRIPTION(DRV_DESC);
40  
41  /* the read/write format of address is
42   * w/r|A13|A12|A11|A10|A9|A8|A7|A6|A5|A4|A3|A2|A1|A0|x

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


RE: [PATCH 2/4] soc: fsl: add GUTS driver for QorIQ platforms

2016-06-22 Thread Yangbo Lu
Hi Arnd,

Could you comment on these?
Thanks.


Best regards,
Yangbo Lu


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Saturday, June 11, 2016 9:51 AM
> To: Arnd Bergmann; linuxppc-...@lists.ozlabs.org
> Cc: Mark Rutland; Ulf Hansson; linux-ker...@vger.kernel.org; linux-
> i...@vger.kernel.org; linux-...@vger.kernel.org; Qiang Zhao; Russell King;
> Bhupesh Sharma; Joerg Roedel; Claudiu Manoil; devicet...@vger.kernel.org;
> Kumar Gala; Rob Herring; Santosh Shilimkar; linux-arm-
> ker...@lists.infradead.org; netdev@vger.kernel.org; linux-
> m...@vger.kernel.org; Xiaobo Xie; Yang-Leo Li; iommu@lists.linux-
> foundation.org; Yangbo Lu
> Subject: Re: [PATCH 2/4] soc: fsl: add GUTS driver for QorIQ platforms
> 
> On Thu, 2016-06-02 at 10:43 +0200, Arnd Bergmann wrote:
> > On Wednesday, June 1, 2016 8:47:22 PM CEST Scott Wood wrote:
> > > On Mon, 2016-05-30 at 15:15 +0200, Arnd Bergmann wrote:
> > > > diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c new
> > > > file mode 100644 index ..2f30698f5bcf
> > > > --- /dev/null
> > > > +++ b/drivers/soc/fsl/guts.c
> > > > @@ -0,0 +1,130 @@
> > > > +/*
> > > > + * Freescale QorIQ Platforms GUTS Driver
> > > > + *
> > > > + * Copyright (C) 2016 Freescale Semiconductor, Inc.
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > +modify
> > > > + * it under the terms of the GNU General Public License as
> > > > +published by
> > > > + * the Free Software Foundation; either version 2 of the License,
> > > > +or
> > > > + * (at your option) any later version.
> > > > + */
> > > > +
> > > > +#include 
> > > > +#include  #include 
> > > > +#include  #include  #include
> > > > + #include 
> > > > +
> > > > +#define GUTS_PVR   0x0a0
> > > > +#define GUTS_SVR   0x0a4
> > > > +
> > > > +struct guts {
> > > > +   void __iomem *regs;
> > >
> > > We already have a struct to define guts.  Why are you not using it?
> > > Why do you consider using it to be "abuse"?  What if we want to move
> > > more guts functionality into this driver?
> >
> > This structure was in the original patch, I left it in there, only
> > removed the inclusion of the powerpc header file, which seemed to be
> > misplaced.
> 
> I'm not refering "struct guts".  I'm referring to changing "struct
> ccsr_guts __iomem *regs" into "void __iomem *regs".
> 
> And it's not a powerpc header file.
> 
> > > > +/*
> > > > + * Table for matching compatible strings, for device tree
> > > > + * guts node, for Freescale QorIQ SOCs.
> > > > + */
> > > > +static const struct of_device_id fsl_guts_of_match[] = {
> > > > +   /* For T4 & B4 Series SOCs */
> > > > +   { .compatible = "fsl,qoriq-device-config-1.0", .data = "T4/B4
> > > > series" },
> > > [snip]
> > > > +   { .compatible = "fsl,qoriq-device-config-2.0", .data = "P
> > > > series"
> > >
> > > As noted in my comment on patch 3/4, these descriptions are reversed.
> > >
> > > They're also incomplete.  t2080 has device config 2.0.  t1040 is
> > > described as
> > > 2.0 though it should probably be 2.1 (or better, drop the generic
> > > compatible altogether).
> >
> > Ok. Ideally I think we'd even look up the specific SoC names from the
> > SVC rather than the compatible string. I just didn't have a good list
> > for those to put in the driver.
> 
> The list is in arch/powerpc/include/asm/mpc85xx.h but I don't know why we
> need to convert it to a string in the first place.
> 
> >
> > > > +   /*
> > > > +* syscon devices default to little-endian, but on powerpc we
> > > > have
> > > > +* existing device trees with big-endian maps and an absent
> > > > endianess
> > > > +* "big-property"
> > > > +*/
> > > > +   if (!IS_ENABLED(CONFIG_POWERPC) &&
> > > > +   !of_property_read_bool(dev->of_node, "big-endian"))
> > > > +   guts->little_endian = true;
> > >
> > > This is not a syscon device (Yangbo's patch to add a guts node on
> > > ls2080 is the only guts node that says "syscon", and that was a
> > > leftover from earlier revisions and should probably be removed).
> > > Even if it were, where is it documented that syscon defaults to
> > > little-endian?
> >
> > Documentation/devicetree/bindings/regmap/regmap.txt
> >
> > We had a little screwup here, basically regmap (and by consequence,
> > syscon) always defaulted to little-endian way before that was
> > documented, so it's too late to change it,
> 
> What causes a device node to fall under the jurisdiction of regmap.txt?
>  Again, these nodes do not claim "syscon" compatibility.
> 
> > although I agree it would have made sense to document regmap to
> > default to big-endian on powerpc.
> 
> Please don't.  It's enough of a mess as is; no need to start throwing in
> architecture ifdefs.
> 
> > > Documentation/devicetree/bindings/common-properties.txt says that
> > > the individual binding specifies the default.  The default for this
> > > node shoul

Re: [PATCH net-next v10 2/5] openvswitch: set skb protocol and mac_len when receiving on internal device

2016-06-22 Thread Simon Horman
On Tue, Jun 21, 2016 at 09:30:17AM -0700, pravin shelar wrote:
> On Mon, Jun 20, 2016 at 7:25 PM, Simon Horman
>  wrote:
> > [Cc Jiri Benc]
> >
> > On Sat, Jun 18, 2016 at 06:38:54PM -0700, pravin shelar wrote:
> >> On Thu, Jun 16, 2016 at 10:53 PM, Simon Horman
> >>  wrote:
> >> > On Tue, Jun 07, 2016 at 03:45:27PM -0700, pravin shelar wrote:
> >> >> On Mon, Jun 6, 2016 at 8:08 PM, Simon Horman 
> >> >>  wrote:
> >> >> > On Thu, Jun 02, 2016 at 03:01:47PM -0700, pravin shelar wrote:
> >> >> >> On Wed, Jun 1, 2016 at 11:24 PM, Simon Horman
> >> >> >>  wrote:
> >> >> >> > * Set skb protocol based on contents of packet. I have observed 
> >> >> >> > this is
> >> >> >> >   necessary to get actual protocol of a packet when it is injected 
> >> >> >> > into an
> >> >> >> >   internal device e.g. by libnet in which case skb protocol will 
> >> >> >> > be set to
> >> >> >> >   ETH_ALL.
> >> 
> >> 
> >> >> > eth_type = eth_type_trans(skb, skb->dev);
> >> >> > skb->mac_len = skb->data - skb_mac_header(skb);
> >> >> > __skb_push(skb, skb->mac_len);
> >> >> >
> >> >> > if (eth_type == htons(ETH_P_8021Q))
> >> >> > skb->mac_len += VLAN_HLEN;
> >> >> >
> >> >> > Perhaps that logic ought to be in a helper used by both 
> >> >> > internal_dev_xmit()
> >> >> > and netdev_port_receive(). Or somehow centralised in 
> >> >> > ovs_vport_receive().
> >> >>
> >> >> This does looks bit complex. Can we use other skb metadata like
> >> >> skb_mac_header_was_set()?
> >> >
> >> > Yes, I think that can be made to work if skb->mac_header is unset
> >> > for l3 packets in netdev_port_receive(). The following is an incremental
> >> > patch on the entire series. Is this the kind of thing you had in mind?
> >> >
> >> > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> >> > index 86f2cfb19de3..42587d5bf894 100644
> >> > --- a/net/openvswitch/flow.c
> >> > +++ b/net/openvswitch/flow.c
> >> > @@ -729,7 +729,7 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
> >> > *tun_info,
> >> > key->phy.skb_mark = skb->mark;
> >> > ovs_ct_fill_key(skb, key);
> >> > key->ovs_flow_hash = 0;
> >> > -   key->phy.is_layer3 = skb->mac_len == 0;
> >> > +   key->phy.is_layer3 = skb_mac_header_was_set(skb) == 0;
> >> > key->recirc_id = 0;
> >> >
> >> > err = key_extract(skb, key);
> >> > diff --git a/net/openvswitch/vport-internal_dev.c 
> >> > b/net/openvswitch/vport-internal_dev.c
> >> > index 484ba529c682..8973d4db509b 100644
> >> > --- a/net/openvswitch/vport-internal_dev.c
> >> > +++ b/net/openvswitch/vport-internal_dev.c
> >> > @@ -50,7 +50,6 @@ static int internal_dev_xmit(struct sk_buff *skb, 
> >> > struct net_device *netdev)
> >> >
> >> > skb->protocol = eth_type_trans(skb, netdev);
> >> > skb_push(skb, ETH_HLEN);
> >> > -   skb_reset_mac_len(skb);
> >> >
> >> > len = skb->len;
> >> > rcu_read_lock();
> >> > diff --git a/net/openvswitch/vport-netdev.c 
> >> > b/net/openvswitch/vport-netdev.c
> >> > index 3df36df62ee9..4cf3f12ffc99 100644
> >> > --- a/net/openvswitch/vport-netdev.c
> >> > +++ b/net/openvswitch/vport-netdev.c
> >> > @@ -60,22 +60,9 @@ static void netdev_port_receive(struct sk_buff *skb)
> >> > if (vport->dev->type == ARPHRD_ETHER) {
> >> > skb_push(skb, ETH_HLEN);
> >> > skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
> >> > -   } else if (vport->dev->type == ARPHRD_NONE) {
> >> > -   if (skb->protocol == htons(ETH_P_TEB)) {
> >> > -   __be16 eth_type;
> >> > -
> >> > -   if (unlikely(skb->len < ETH_HLEN))
> >> > -   goto error;
> >> > -
> >> > -   eth_type = eth_type_trans(skb, skb->dev);
> >> > -   skb->mac_len = skb->data - skb_mac_header(skb);
> >> > -   __skb_push(skb, skb->mac_len);
> >> > -
> >> > -   if (eth_type == htons(ETH_P_8021Q))
> >> > -   skb->mac_len += VLAN_HLEN;
> >> > -   } else {
> >> > -   skb->mac_len = 0;
> >> > -   }
> >> > +   } else if (vport->dev->type == ARPHRD_NONE &&
> >> > +  skb->protocol != htons(ETH_P_TEB)) {
> >> > +   skb->mac_header = (typeof(skb->mac_header))~0U;
> >> > }
> >> >
> >> > ovs_vport_receive(vport, skb, skb_tunnel_info(skb));
> >>
> >> This certainly looks better. I was wondering if we can unset the mac
> >> header offset in L3 tunnel devices itself. So there is no need to have
> >> this check here.
> >
> > I think that might be possible for GRE by modifying the following in
> > __ipgre_rcv().
> >
> > if (tunnel->dev->type != ARPHRD_NONE)
> > skb_pop_mac_header(skb);
> > else
> > skb_reset_mac_header(skb);
> >
> > But I am unsure what s

Re: [PATCH V4 1/1] net: ethernet: Add TSE PCS support to dwmac-socfpga

2016-06-22 Thread Tien Hock Loh
Hi Peppe, 

On Wed, 2016-06-22 at 11:00 +0200, Giuseppe CAVALLARO wrote:
> Hello Tien Hock
> 
> On 6/21/2016 10:46 AM, th...@altera.com wrote:
> > From: Tien Hock Loh 
> >
> > This adds support for TSE PCS that uses SGMII adapter when the phy-mode of
> > the dwmac is set to sgmii
> >
> > Signed-off-by: Tien Hock Loh 
> 
> IIUC, you are keeping the two timers w/o looking.
> 
> Is there any motivation behind? I had understood you wanted
> to review it.

I've merged them into one timer, aneg_link_timer and one timer callback
(that invokes individually the auto_nego_timer_callback and
pcs_link_timer_callback) in the patch. Is that not what you were
expecting?

> 
> Let me know
> 
> Regards
> Peppe
> 
> >
> > ---
> > v2:
> > - Refactored the TSE PCS out from the dwmac-socfpga.c file
> > - Added binding documentation for TSE PCS sgmii adapter
> > v3:
> > - Added missing license header for new source files
> > - Updated tse_pcs.h include headers
> > - Standardize if statements
> > v4:
> > - Reset SGMII adapter on speed change
> > - Do not enable SGMII adapter if speed is not supported
> > - On init, if PCS reset fails, do not enable adapter
> > 123
> > ---
> >  .../devicetree/bindings/net/socfpga-dwmac.txt  |  19 ++
> >  drivers/net/ethernet/stmicro/stmmac/Makefile   |   2 +-
> >  drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c | 276 
> > +
> >  drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.h |  36 +++
> >  .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c| 149 +--
> >  5 files changed, 460 insertions(+), 22 deletions(-)
> >  create mode 100644 drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c
> >  create mode 100644 drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.h
> >
> > diff --git a/Documentation/devicetree/bindings/net/socfpga-dwmac.txt 
> > b/Documentation/devicetree/bindings/net/socfpga-dwmac.txt
> > index 72d82d6..dd10f2f 100644
> > --- a/Documentation/devicetree/bindings/net/socfpga-dwmac.txt
> > +++ b/Documentation/devicetree/bindings/net/socfpga-dwmac.txt
> > @@ -17,9 +17,26 @@ Required properties:
> >  Optional properties:
> >  altr,emac-splitter: Should be the phandle to the emac splitter soft IP 
> > node if
> > DWMAC controller is connected emac splitter.
> > +phy-mode: The phy mode the ethernet operates in
> > +altr,sgmii_to_sgmii_converter: phandle to the TSE SGMII converter
> > +
> > +This device node has additional phandle dependency, the sgmii converter:
> > +
> > +Required properties:
> > + - compatible  : Should be altr,gmii-to-sgmii-2.0
> > + - reg-names   : Should be "eth_tse_control_port"
> >
> >  Example:
> >
> > +gmii_to_sgmii_converter: phy@0x10240 {
> > +   compatible = "altr,gmii-to-sgmii-2.0";
> > +   reg = <0x0001 0x0240 0x0008>,
> > +   <0x0001 0x0200 0x0040>;
> > +   reg-names = "eth_tse_control_port";
> > +   clocks = <&sgmii_1_clk_0 &emac1 1 &sgmii_clk_125 &sgmii_clk_125>;
> > +   clock-names = "tse_pcs_ref_clk_clock_connection", "tse_rx_cdr_refclk";
> > +};
> > +
> >  gmac0: ethernet@ff70 {
> > compatible = "altr,socfpga-stmmac", "snps,dwmac-3.70a", "snps,dwmac";
> > altr,sysmgr-syscon = <&sysmgr 0x60 0>;
> > @@ -30,4 +47,6 @@ gmac0: ethernet@ff70 {
> > mac-address = [00 00 00 00 00 00];/* Filled in by U-Boot */
> > clocks = <&emac_0_clk>;
> > clock-names = "stmmaceth";
> > +   phy-mode = "sgmii";
> > +   altr,gmii-to-sgmii-converter = <&gmii_to_sgmii_converter>;
> >  };
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
> > b/drivers/net/ethernet/stmicro/stmmac/Makefile
> > index 0fb362d..0ff76e8 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/Makefile
> > +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
> > @@ -11,7 +11,7 @@ obj-$(CONFIG_DWMAC_IPQ806X)   += dwmac-ipq806x.o
> >  obj-$(CONFIG_DWMAC_LPC18XX)+= dwmac-lpc18xx.o
> >  obj-$(CONFIG_DWMAC_MESON)  += dwmac-meson.o
> >  obj-$(CONFIG_DWMAC_ROCKCHIP)   += dwmac-rk.o
> > -obj-$(CONFIG_DWMAC_SOCFPGA)+= dwmac-socfpga.o
> > +obj-$(CONFIG_DWMAC_SOCFPGA)+= dwmac-socfpga.o altr_tse_pcs.o
> >  obj-$(CONFIG_DWMAC_STI)+= dwmac-sti.o
> >  obj-$(CONFIG_DWMAC_SUNXI)  += dwmac-sunxi.o
> >  obj-$(CONFIG_DWMAC_GENERIC)+= dwmac-generic.o
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c 
> > b/drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c
> > new file mode 100644
> > index 000..40bfaac
> > --- /dev/null
> > +++ b/drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c
> > @@ -0,0 +1,276 @@
> > +/* Copyright Altera Corporation (C) 2016. All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License, version 2,
> > + * as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + 

[PATCH] Maxim/driver: Add driver for maxim ds26522

2016-06-22 Thread Zhao Qiang
Signed-off-by: Zhao Qiang 
---
 drivers/net/wan/Kconfig|  10 ++
 drivers/net/wan/Makefile   |   1 +
 drivers/net/wan/slic_ds26522.c | 256 +
 drivers/net/wan/slic_ds26522.h | 134 +
 4 files changed, 401 insertions(+)
 create mode 100644 drivers/net/wan/slic_ds26522.c
 create mode 100644 drivers/net/wan/slic_ds26522.h

diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig
index 9e314b7..bd3bf3f 100644
--- a/drivers/net/wan/Kconfig
+++ b/drivers/net/wan/Kconfig
@@ -291,6 +291,16 @@ config FSL_UCC_HDLC
  To compile this driver as a module, choose M here: the
  module will be called fsl_ucc_hdlc.
 
+config SLIC_DS26522
+   tristate "Slic Maxim ds26522 card support"
+   depends on SPI
+   help
+ This module initializes and configures the slic maxim card
+ in T1 or E1 mode.
+
+ To compile this driver as a module, choose M here: the
+ module will be called slic_ds26522.
+
 config DSCC4_PCISYNC
bool "Etinc PCISYNC features"
depends on DSCC4
diff --git a/drivers/net/wan/Makefile b/drivers/net/wan/Makefile
index 25fec40..73c2326 100644
--- a/drivers/net/wan/Makefile
+++ b/drivers/net/wan/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_PCI200SYN)   += pci200syn.o
 obj-$(CONFIG_PC300TOO) += pc300too.o
 obj-$(CONFIG_IXP4XX_HSS)   += ixp4xx_hss.o
 obj-$(CONFIG_FSL_UCC_HDLC) += fsl_ucc_hdlc.o
+obj-$(CONFIG_SLIC_DS26522) += slic_ds26522.o
 
 clean-files := wanxlfw.inc
 $(obj)/wanxl.o:$(obj)/wanxlfw.inc
diff --git a/drivers/net/wan/slic_ds26522.c b/drivers/net/wan/slic_ds26522.c
new file mode 100644
index 000..67fd8e7
--- /dev/null
+++ b/drivers/net/wan/slic_ds26522.c
@@ -0,0 +1,256 @@
+/*
+ * drivers/net/wan/slic_ds26522.c
+ *
+ * Copyright (C) 2016 Freescale Semiconductor, Inc.
+ *
+ * Author:Zhao Qiang
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "slic_ds26522.h"
+
+#define DRV_NAME "ds26522"
+
+#define SLIC_TRANS_LEN 1
+#define SLIC_TWO_LEN 2
+#define SLIC_THREE_LEN 3
+
+static struct spi_device *g_spi;
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Zhao Qiang");
+MODULE_DESCRIPTION(DRV_DESC);
+
+/* the read/write format of address is
+ * w/r|A13|A12|A11|A10|A9|A8|A7|A6|A5|A4|A3|A2|A1|A0|x
+ */
+static void slic_write(struct spi_device *spi, u16 addr,
+  u8 data)
+{
+   u8 temp[3];
+
+   addr = bitrev16(addr) >> 1;
+   data = bitrev8(data);
+   temp[0] = (u8)((addr >> 8) & 0x7f);
+   temp[1] = (u8)(addr & 0xfe);
+   temp[2] = data;
+
+   /* write spi addr and value */
+   spi_write(spi, &temp[0], SLIC_THREE_LEN);
+}
+
+static u8 slic_read(struct spi_device *spi, u16 addr)
+{
+   u8 temp[2];
+   u8 data;
+
+   addr = bitrev16(addr) >> 1;
+   temp[0] = (u8)(((addr >> 8) & 0x7f) | 0x80);
+   temp[1] = (u8)(addr & 0xfe);
+
+   spi_write_then_read(spi, &temp[0], SLIC_TWO_LEN, &data,
+   SLIC_TRANS_LEN);
+
+   data = bitrev8(data);
+   return data;
+}
+
+static bool get_slic_product_code(struct spi_device *spi)
+{
+   u8 device_id;
+
+   device_id = slic_read(spi, DS26522_IDR_ADDR);
+   if ((device_id & 0xf8) == 0x68)
+   return true;
+   else
+   return false;
+}
+
+static void ds26522_e1_spec_config(struct spi_device *spi)
+{
+   /* Receive E1 Mode, Framer Disabled */
+   slic_write(spi, DS26522_RMMR_ADDR, DS26522_RMMR_E1);
+
+   /* Transmit E1 Mode, Framer Disable */
+   slic_write(spi, DS26522_TMMR_ADDR, DS26522_TMMR_E1);
+
+   /* Receive E1 Mode Framer Enable */
+   slic_write(spi, DS26522_RMMR_ADDR,
+  slic_read(spi, DS26522_RMMR_ADDR) | DS26522_RMMR_FRM_EN);
+
+   /* Transmit E1 Mode Framer Enable */
+   slic_write(spi, DS26522_TMMR_ADDR,
+  slic_read(spi, DS26522_TMMR_ADDR) | DS26522_TMMR_FRM_EN);
+
+   /* RCR1, receive E1 B8zs & ESF */
+   slic_write(spi, DS26522_RCR1_ADDR,
+  DS26522_RCR1_E1_HDB3 | DS26522_RCR1_E1_CCS);
+
+   /* RSYSCLK=2.048MHz, RSYNC-Output */
+   slic_write(spi, DS26522_RIOCR_ADDR,
+  DS26522_RIOCR_2048KHZ | DS26522_RIOCR_RSIO_OUT);
+
+   /* TCR1 Transmit E1 b8zs */
+   slic_write(spi, DS26522_TCR1_ADDR, DS26522_TCR1_TB8ZS);
+
+   /* TSYSCLK=2.048MHz, TSYNC-Output */
+   slic_write(spi, DS26522_TIOCR_ADDR,
+  DS26522_TIOCR_2048KHZ | DS26522_TIOCR_TSIO_OUT);
+
+   /* Set E1TAF */
+   slic_write(spi, DS26522_E1TAF_ADDR, DS26522_E1TAF_DEFAULT);
+

Re: [PATCH v4 net] ipvs: fix bind to link-local mcast IPv6 address in backup

2016-06-22 Thread Simon Horman
On Fri, Jun 17, 2016 at 09:42:49AM +0300, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Thu, 16 Jun 2016, Quentin Armitage wrote:
> 
> > When using HEAD from
> > https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
> > the command:
> > ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> > --mcast-group ff02::1:81
> > fails with the error message:
> > Argument list too long
> > 
> > whereas both:
> > ipvsadm --start-daemon master --mcast-interface eth0.60 \
> > --mcast-group ff02::1:81
> > and:
> > ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> > --mcast-group 224.0.0.81
> > are successful.
> > 
> > The error message "Argument list too long" isn't helpful. The error occurs
> > because an IPv6 address is given in backup mode.
> > 
> > The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
> > since it fails to set the interface on the address or the socket before
> > calling inet6_bind() (via sock->ops->bind), where the test
> > 'if (!sk->sk_bound_dev_if)' failed.
> > 
> > Setting sock->sk->sk_bound_dev_if on the socket before calling
> > inet6_bind() resolves the issue.
> > 
> > Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")
> > Signed-off-by: Quentin Armitage 
> 
>   Looks good to me, thanks!
> 
> Acked-by: Julian Anastasov 
> 
>   Simon, please apply to ipvs tree. Patch compiles
> also on stable 4.4.13, 4.5.7 and 4.6.2, so no need for
> special versions. The ack is also for the other 3 patches
> from v4 (for ipvs-next) but they depend on this patch.

Thanks, done.


[PATCH v2] net/mlx5: use mlx5_buf_alloc_node instead of mlx5_buf_alloc in mlx5_wq_ll_create

2016-06-22 Thread Wang Sheng-Hui
Fixes: 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on
reader NUMA node")

Commit 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on
reader NUMA node") introduced mlx5_*_alloc_node() but missed changing
some calling and warn messages. This patch introduces 2 changes:
* Use mlx5_buf_alloc_node() instead of mlx5_buf_alloc() in
  mlx5_wq_ll_create()
* Update the failure warn messages with _node postfix for
  mlx5_*_alloc function names

Change since V1:
* Add Fixes line in commit log

Signed-off-by: Wang Sheng-Hui 
---
 drivers/net/ethernet/mellanox/mlx5/core/wq.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
index ce21ee5..821a087 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
@@ -75,14 +75,14 @@ int mlx5_wq_cyc_create(struct mlx5_core_dev *mdev, struct 
mlx5_wq_param *param,
 
err = mlx5_db_alloc_node(mdev, &wq_ctrl->db, param->db_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_db_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_db_alloc_node() failed, %d\n", err);
return err;
}
 
err = mlx5_buf_alloc_node(mdev, mlx5_wq_cyc_get_byte_size(wq),
  &wq_ctrl->buf, param->buf_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_buf_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_buf_alloc_node() failed, %d\n", err);
goto err_db_free;
}
 
@@ -111,14 +111,14 @@ int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct 
mlx5_wq_param *param,
 
err = mlx5_db_alloc_node(mdev, &wq_ctrl->db, param->db_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_db_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_db_alloc_node() failed, %d\n", err);
return err;
}
 
err = mlx5_buf_alloc_node(mdev, mlx5_cqwq_get_byte_size(wq),
  &wq_ctrl->buf, param->buf_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_buf_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_buf_alloc_node() failed, %d\n", err);
goto err_db_free;
}
 
@@ -148,13 +148,14 @@ int mlx5_wq_ll_create(struct mlx5_core_dev *mdev, struct 
mlx5_wq_param *param,
 
err = mlx5_db_alloc_node(mdev, &wq_ctrl->db, param->db_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_db_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_db_alloc_node() failed, %d\n", err);
return err;
}
 
-   err = mlx5_buf_alloc(mdev, mlx5_wq_ll_get_byte_size(wq), &wq_ctrl->buf);
+   err = mlx5_buf_alloc_node(mdev, mlx5_wq_ll_get_byte_size(wq),
+ &wq_ctrl->buf, param->buf_numa_node);
if (err) {
-   mlx5_core_warn(mdev, "mlx5_buf_alloc() failed, %d\n", err);
+   mlx5_core_warn(mdev, "mlx5_buf_alloc_node() failed, %d\n", err);
goto err_db_free;
}
 
-- 
2.7.4



Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 04:10 PM, Rick Jones wrote:

My systems are presently in the midst of an install but I should be able
to demonstrate it in the morning (US Pacific time, modulo the shuttle
service of a car repair place)


The installs finished sooner than I thought.  So, receiver:


root@np-cp1-comp0001-mgmt:/home/stack# uname -a
Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon 
May 23 15:39:22 UTC 2016 x86_64 GNU/Linux

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2
driver: bnx2x
version: 1.712.30-0
firmware-version: bc 7.10.10
bus-info: :05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840:

05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 
10/20-Gigabit Ethernet (rev 11)

Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M 
Adapter

(The pci.ids entry being from before that 10 GbE IP was purchased from 
Broadcom by QLogic...)


Verify that LRO is disabled (IIRC it is enabled by default):

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large
large-receive-offload: off

Verify that disable_tpa is not set:

root@np-cp1-comp0001-mgmt:/home/stack# cat 
/sys/module/bnx2x/parameters/disable_tpa

0

So this means we will see NIC-firmware GRO.

Start a tcpdump on the receiver:
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 200 -i hed2 
-w foo.pcap port 12867
tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 
bytes


Start a netperf test targeting that system, specifying a smaller MSS:

stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 
1400 -P 12867 -O throughput,transport_mss
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-comp0001-guest () port 12867 AF_INET : demo

Throughput Transport
   MSS
   bytes

3372.821388

Come back to the receiver and post-process the tcpdump capture to get 
the average segment size for the data segments:


200 packets captured
2000916 packets received by filter
0 packets dropped by kernel
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v 
"length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}'

reading from file foo.pcap, link-type EN10MB (Ethernet)
Average: 2741.93

and finally a snippet of the capture:

00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 
1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 
0,nop,wscale 7], length 0
00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], 
seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS 
val 1499053 ecr 1491134,nop,wscale 7], length 0
00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 
1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0
00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
19433:22209, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333994 IP 192.168.2.8.12867 > 192.16

Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Alexander Duyck
On Wed, Jun 22, 2016 at 4:52 PM, Rick Jones  wrote:
> On 06/22/2016 03:56 PM, Alexander Duyck wrote:
>>
>> On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet 
>> wrote:
>>>
>>> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:

 Had the bnx2x-driven NICs' firmware not had that rather unfortunate
 assumption about MSSes I probably would never have noticed.
>
>
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> I think you are typing a bit too far ahead into my keyboard with that last
> sentence.  And I may not have been sufficiently complete in what I wrote.
> If the bnx2x-driven NICs' firmware had been coalescing more than two
> segments together, not only would I probably not have noticed, I probably
> would not have been upset to learn it was NIC-firmware GRO rather than
> stack.
>
> My complaint is the specific bug of coalescing only two segments when their
> size is unexpected, and the difficulty present in disabling the bnx2x-driven
> NICs' firmware GRO.  I don't have a problem necessarily with the existence
> of NIC-firmware GRO in general.  I just want to be able to enable/disable it
> easily.

Right.  Which is why I thought we were supposed to be using the LRO
flag when a NIC is performing the GRO instead of it being done by the
kernel.

I have no problem with us doing hardware GRO.  The only issue I have
is with us identifying it as GRO.

In my opinion the way I would have preferred to have seen the bnx2x
handle all this is to make it that the NETIF_F_LRO flag controlled if
the NIC performed aggregation or not and the module parameter
determine if the LRO conforms to the GRO type layout for the frame.

As it is now I can easily see this causing issues if someone updates
how GRO/GSO handles frames without realizing they now need to make
sure the bnx2x and qede drivers need to generate the same frame layout
as well.

- Alex


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Alexander Duyck
On Wed, Jun 22, 2016 at 4:31 PM, Eric Dumazet  wrote:
> On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> Well, we offload TSO to the NIC. Or checksums as much as we can.
>
> At one point we have to trust the NIC we plug in our hosts, right ?
>
> Why RX is so different than TX exactly ?

Exactly my point.  The problem I have with the bnx2x and qede drivers
is that the NIC driver is using the GRO flag to enable LRO on the
device. It is like arguing that the NIC can do segmentation based on
the GSO flag instead of the TSO flag.  We should be keeping TSO/LRO as
the device features while GSO/GRO are kernel features that don't
directly change any settings on the device.

> Yes, LRO is hard to implement properly compared to TSO,
> but not something that is _fundamentally_ broken.

Yes and no.  More often then not there are issues that pop up as
protocols change.  I'm not assuming they are fundamentally broken.
The bit that I consider broken is that I cannot disable the NIC LRO
functionality without having to disable the stack from doing GRO.  All
I am asking for is LRO controls all of the NIC aggregation features
and GRO be kept a software only feature instead of being tied to a
hardware feature.

> Claiming that hardware assist GRO is not possible is a plain mantra.

I have no issue claiming hardware assist GRO is possible.  My problem
is saying that the GRO feature flag can be used to enable it.  I would
argue that any packet aggregation at the device or driver level is LRO
regardless of how close it comes to GRO feature wise.  GRO should only
be occurring in the receive path after calling the appropriate GRO
function.  Otherwise it becomes really difficult to work around any
possible issues that the hardware assisted GRO introduces without
paying a huge penalty.  We need to keep these feature flags separate.
I thought that was the whole reason why we have the distinction
between LRO and GRO in the first place.

- Alex


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Tom Herbert
On Wed, Jun 22, 2016 at 4:31 PM, Eric Dumazet  wrote:
> On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> Well, we offload TSO to the NIC. Or checksums as much as we can.
>
> At one point we have to trust the NIC we plug in our hosts, right ?
>
> Why RX is so different than TX exactly ?
>
LRO requires parsing the packet and hence DPI for things like foo/UDP.
TSO is much more straightforward and does not rely on parsing (e.g.
LRO would require NIC to parse EH, TSO doesn't). Also, TSO is
work-conserving, whereas LRO is not. I don't believe there is any
standard for how NICs are supposed to determined when to set packets
to host. All of this makes LRO much less deterministic and means
there's a lot of variance between NIC implementations. Plus there's
also the question of whether a vendor is testing IPv6. I do think all
this can be "fixed" once we have programmable NICs and we can program
our own implementation of LRO, but until then I think there is
inherent risk in using it. GRO rocks though!

Tom

> Yes, LRO is hard to implement properly compared to TSO,
> but not something that is _fundamentally_ broken.
>
> Claiming that hardware assist GRO is not possible is a plain mantra.
>
>


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 03:56 PM, Alexander Duyck wrote:

On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet  wrote:

On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:

Had the bnx2x-driven NICs' firmware not had that rather unfortunate
assumption about MSSes I probably would never have noticed.




It could be that you and Rick are running different firmware. I
believe you can expose that via "ethtool -i".  This is the ugly bit
about all this.  We are offloading GRO into the firmware of these
devices with no idea how any of it works and by linking GRO to LRO on
the same device you are stuck having to accept either the firmware
offload or nothing at all.  That is kind of the point Rick was trying
to get at.


I think you are typing a bit too far ahead into my keyboard with that 
last sentence.  And I may not have been sufficiently complete in what I 
wrote.  If the bnx2x-driven NICs' firmware had been coalescing more than 
two segments together, not only would I probably not have noticed, I 
probably would not have been upset to learn it was NIC-firmware GRO 
rather than stack.


My complaint is the specific bug of coalescing only two segments when 
their size is unexpected, and the difficulty present in disabling the 
bnx2x-driven NICs' firmware GRO.  I don't have a problem necessarily 
with the existence of NIC-firmware GRO in general.  I just want to be 
able to enable/disable it easily.


rick jones

Of course, what I really want are much, Much, MUCH larger MTUs.  It 
isn't for nothing that I used to refer to TSO as "Poor man's Jumbo 
Frames" :)


[no subject]

2016-06-22 Thread Mrs Alice Walton
my name is Mrs. Alice Walton, a business woman an America Citizen and the 
heiress to the fortune of Walmart stores, born October 7, 1949. I have a 
proposal for you


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 14:32 -0700, Alexander Duyck wrote:

> The idea behind GRO was to make it so that we had a generic way to
> handle this in software.  For the most part drivers doing LRO in
> software were doing the same thing that the GRO was doing.  The only
> reason it was deprecated is because GRO was capable of doing more than
> LRO could since we add one parser and suddenly all devices saw the
> benefit instead of just one specific device.  It is best to keep those
> two distinct solutions and then let the user sort out if they want to
> have the aggregation done by the device or the kernel.

Presumably we could add features flags to selectively enable part of LRO
(really simply GRO offloading) for NIC that partially match GRO
requirements.

Patch 5/5 seems to enable the hardware feature(s) :

+   p_ramrod->tpa_param.tpa_ipv4_tunn_en_flg = 1;
+   p_ramrod->tpa_param.tpa_ipv6_tunn_en_flg = 1;

So this NIC seems to have a way to control its LRO engine.

If some horrible bug happens for GRE+IPv6+TCP, you could disable LRO
coping with GRE encapsulation, instead of disabling whole LRO

We have software fallback. Nice, with quite heavy cpu cost.

If we can use offload without breaking rules, use it.

Some NIC have terrible LRO performance (I wont give details here),
but others are ok.

Some NIC have terrible TSO performance for small number of segments.
(gso_segs < 4)




Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:

> It could be that you and Rick are running different firmware. I
> believe you can expose that via "ethtool -i".  This is the ugly bit
> about all this.  We are offloading GRO into the firmware of these
> devices with no idea how any of it works and by linking GRO to LRO on
> the same device you are stuck having to accept either the firmware
> offload or nothing at all.  That is kind of the point Rick was trying
> to get at.


Well, we offload TSO to the NIC. Or checksums as much as we can.

At one point we have to trust the NIC we plug in our hosts, right ?

Why RX is so different than TX exactly ?

Yes, LRO is hard to implement properly compared to TSO,
but not something that is _fundamentally_ broken.

Claiming that hardware assist GRO is not possible is a plain mantra.




Re: r8169 regression: UDP packets dropped intermittantly

2016-06-22 Thread Francois Romieu
Jonathan Woithe  :
[...]
> to mainline (in which case I'll keep watching out for it)?  Or is the
> out-of-tree workaround mentioned above considered to be the long term
> fix for those who encounter the problem?

It's a workaround. Nothing less, nothing more.

IIRC the ga311 irq signaling was a bit special. I almost surely broke
it at some point.

-- 
Ueimor


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 03:47 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:

On 06/22/2016 11:22 AM, Yuval Mintz wrote:

But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.


Since you mention bnx2x...   I would argue that the NIC firmware on
those NICs driven by bnx2x is doing it badly.  Not so much from a
functional standpoint I suppose, but from a performance one.  The
NIC-firmware GRO done there has this rather unfortunate assumption about
"all MSSes will be directly driven by my own physical MTU" and when it
sees segments of a size other than would be suggested by the physical
MTU, will coalesce only two segments together.  They then do not get
further coalesced in the stack.

Suffice it to say this does not do well from a performance standpoint.

One can disable LRO via ethtool for these NICs, but what that does is
disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one
must also get the bnx2x module loaded with "disable-tpa=1" so the Linux
stack GRO gets used instead.

Had the bnx2x-driven NICs' firmware not had that rather unfortunate
assumption about MSSes I probably would never have noticed.


I do not see this behavior on my bnx2x nics ?

ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000
lpk51:~# ./netperf -H 10.246.11.52 -l 1000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.246.11.52 () port 0 AF_INET


I first saw this with VMs which themselves had 1400 byte MTUs on their 
vNICs, speaking though bnx2x-driven NICs with a 1500 byte MTU, but I did 
later reproduce it by tweaking the MTU of my sending side NIC to 
something like 1400 bytes and running a "bare iron" netperf.  I believe 
you may be able to achieve the same thing by having netperf set a 
smaller MSS via the test-specific -G option.


My systems are presently in the midst of an install but I should be able 
to demonstrate it in the morning (US Pacific time, modulo the shuttle 
service of a car repair place)



On receiver :


Paranoid question, but is LRO disabled on the receiver?  I don't know 
that LRO exhibits the behaviour, just GRO-in-the-NIC.


rick



15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 22752
15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0





Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Alexander Duyck
On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet  wrote:
> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
>> On 06/22/2016 11:22 AM, Yuval Mintz wrote:
>> > But seriously, this isn't really anything new but rather a step forward in
>> > the direction we've already taken - bnx2x/qede are already performing
>> > the same for non-encapsulated TCP.
>>
>> Since you mention bnx2x...   I would argue that the NIC firmware on
>> those NICs driven by bnx2x is doing it badly.  Not so much from a
>> functional standpoint I suppose, but from a performance one.  The
>> NIC-firmware GRO done there has this rather unfortunate assumption about
>> "all MSSes will be directly driven by my own physical MTU" and when it
>> sees segments of a size other than would be suggested by the physical
>> MTU, will coalesce only two segments together.  They then do not get
>> further coalesced in the stack.
>>
>> Suffice it to say this does not do well from a performance standpoint.
>>
>> One can disable LRO via ethtool for these NICs, but what that does is
>> disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one
>> must also get the bnx2x module loaded with "disable-tpa=1" so the Linux
>> stack GRO gets used instead.
>>
>> Had the bnx2x-driven NICs' firmware not had that rather unfortunate
>> assumption about MSSes I probably would never have noticed.
>
> I do not see this behavior on my bnx2x nics ?

It could be that you and Rick are running different firmware. I
believe you can expose that via "ethtool -i".  This is the ugly bit
about all this.  We are offloading GRO into the firmware of these
devices with no idea how any of it works and by linking GRO to LRO on
the same device you are stuck having to accept either the firmware
offload or nothing at all.  That is kind of the point Rick was trying
to get at.

The preferred setup would be to have any aggregation offload on the
device be flagged as LRO and then the offload performed by the kernel
be GRO.  By linking the two features as has apparently happened with
bnx2x and qede it limits the users options and kind of forces them
into an all or nothing situation unless they have insight into
proprietary driver options to disable these features.

- Alex


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-22 Thread Tom Herbert
On Wed, Jun 22, 2016 at 3:15 PM, Richard Weinberger
 wrote:
> On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert  wrote:
>> Transports over UDP is intended to encapsulate TCP and other transport
>> protocols directly and securely in UDP.
>>
>> The goal of this work is twofold:
>>
>> 1) Allow applications to run their own transport layer stack (i.e.from
>>userspace). This eliminates dependencies on the OS (e.g. solves a
>>major dependency issue for Facebook on clients).
>
> Facebook on clients would be a Facebook app on mobile devices?
> Does that mean that the Facebook app is so advanced and complicated
> that it needs a special TCP stack?!
>
Yes, in the sense that Facebook app is probably the biggest single app
in mobile and probably has about the most users. Advancing the
transport layer, especially with regards to security and privacy, is
critical to maintain long term viability. But that being said,
security, protocol ossification, middlebox intrusion, the demise of
the E2E model are everyone's problem. One major issue here, probably
the biggest issue on the whole Internet, is that upgrade story for
core software (FW, OS, etc.) in devices attached to the Internet is
miserable-- to the point that some people think this undermines the
future of the Internet (e.g.
http://www.darkreading.com/vulnerabilities---threats/internet-of-things-devices-are-doomed/d/d-id/1315735).
TOU is a means to eliminate the dependencies we have on devices or
OSes being secure or being updated in a timely fashion to provide
security improvements.

Tom

> --
> Thanks,
> //richard


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
> On 06/22/2016 11:22 AM, Yuval Mintz wrote:
> > But seriously, this isn't really anything new but rather a step forward in
> > the direction we've already taken - bnx2x/qede are already performing
> > the same for non-encapsulated TCP.
> 
> Since you mention bnx2x...   I would argue that the NIC firmware on 
> those NICs driven by bnx2x is doing it badly.  Not so much from a 
> functional standpoint I suppose, but from a performance one.  The 
> NIC-firmware GRO done there has this rather unfortunate assumption about 
> "all MSSes will be directly driven by my own physical MTU" and when it 
> sees segments of a size other than would be suggested by the physical 
> MTU, will coalesce only two segments together.  They then do not get 
> further coalesced in the stack.
> 
> Suffice it to say this does not do well from a performance standpoint.
> 
> One can disable LRO via ethtool for these NICs, but what that does is 
> disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one 
> must also get the bnx2x module loaded with "disable-tpa=1" so the Linux 
> stack GRO gets used instead.
> 
> Had the bnx2x-driven NICs' firmware not had that rather unfortunate 
> assumption about MSSes I probably would never have noticed.

I do not see this behavior on my bnx2x nics ?

ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000
lpk51:~# ./netperf -H 10.246.11.52 -l 1000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.246.11.52 () port 0 AF_INET


On receiver :

15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 22752
15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0




Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Hannes Frederic Sowa
On 22.06.2016 14:32, Alexander Duyck wrote:
> On Wed, Jun 22, 2016 at 11:22 AM, Yuval Mintz  wrote:
>> This series adds driver support for the processing of tunnel
>> [specifically vxlan/geneve/gre tunnels] packets which are
>> aggregated [GROed] by the hardware before driver passes
>> such packets upto the stack.
> First off I am pretty sure this isn't GRO.  This is LRO.
 Nopes. This is GRO - each MSS-sized frame will arrive on its
 own frag, whereas the headers [both  internal & external]
 would arrive on the linear part of the SKB.
>>
>>> No it is LRO, it just very closely mimics GRO.  If it is in hardware
>>> it is LRO.  GRO is extensible in software and bugs can be fixed in
>>> kernel, LRO is not extensible and any bugs in it are found in hardware
>>> or the driver and would have to be fixed there.  It all comes down to
>>> bug isolation.  If we find that disabling LRO makes the bug go away we
>>> know the hardware aggregation has an issue.  Both features should not
>>> operate on the same bit.  Otherwise when some bug is found in your
>>> implementation it will be blamed on GRO when the bug is actually in
>>> LRO.
>> I'm not aware of a definition stating GRO *has* to be extensible in SW;
>> AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
>> for aggregation, method of aggregation, etc.].
> 
> The idea behind GRO was to make it so that we had a generic way to
> handle this in software.  For the most part drivers doing LRO in
> software were doing the same thing that the GRO was doing.  The only
> reason it was deprecated is because GRO was capable of doing more than
> LRO could since we add one parser and suddenly all devices saw the
> benefit instead of just one specific device.  It is best to keep those
> two distinct solutions and then let the user sort out if they want to
> have the aggregation done by the device or the kernel.

The most important point while talking about GRO is that it is
symmetrical. The GSO functions need to be able to completely undo the
effect GRO did.

LRO is, for example, completely disabled for all devices in the
namespace as soon as forwarding is enabled because segmentation isn't
available for LRO. This lead to the definition of GRO and GSO as far as
I was told. Probably I would also personally take this as the definition
of GRO. Things like maximum segment size, flags, extensions and
everything must be aggregated and metadata be stored, so that GSO can
exactly build the original fragments (or some reasonable of the originals).

Again, LRO does not provide this, thus we have to disable it on all
devices that participate in forwarding. AFAIK one problem we constantly
saw is that the MSS was not correctly stored in the meta data and thus
the segmentation didn't create same-sized packets.

There are other small rules enforced nowadays. If we upgrade GSO side we
need to be able to modify GRO as well, vice versa.

The problem with a hardware implementation is that we might not be able
to modify core kernel code anymore, because hardware got into this
contract on how it implements GRO, so I agree with Alex, it would better
be kept a software only implementation.

>>> I realize this causes some pain when routing or bridging as LRO is
>>> disabled but that is kind of the point.  We don't want the hardware to
>>> be mangling frames when we are attempting to route packets between any
>>> two given devices.
> 
>> Actually, while I might disagree on whether this is LRO/GRO, I don't think
>> there's any problem for us to base this on the LRO feature - I.e., when we
>> started working on qede we didn't bother implementing LRO as we understood
>> it was deprecated, but this encapsulated aggregation is configurable; If it
>> makes life easier for everyone if we make the configuration based on the LRO
>> configuration, so that when LRO is disabled we won't have this turned on
>> it can be easily done.
> 
> If you can move this over to an LRO feature I think it would go a long
> way towards cleaning up many of my complaints with this.  LRO really
> isn't acceptable for routing or bridging scenarios whereas GRO
> sometimes is.  If we place your code under the same limitations as the
> LRO feature bit then we can probably look at allowing the tunnel
> aggregation to be performed since we can be more certain that the
> tunnel will be terminating at the local endpoint.
> 
> Also I don't know if you have been paying attention to recent
> discussions on the mailing list but the fact is GRO over UDP tunnels
> is still a subject for debate.  This patch takes things in the
> opposite direction of where we are currently talking about going with
> GRO.  I've added Hannes and Tom to this discussion just to make sure I
> have the proper understanding of all this as my protocol knowledge is
> somewhat lacking.
 I guess we're on the exact opposite side of the discussion - I.e., we're
 the vendor that tries pushing

Re: [PATCH] mellanox: mlx5: Use logging functions to reduce text ~10k/5%

2016-06-22 Thread Joe Perches
On Wed, 2016-06-22 at 14:40 -0600, Jason Gunthorpe wrote:
> On Wed, Jun 22, 2016 at 11:23:59AM -0700, Joe Perches wrote:
> > The output changes now do not include line #, but do include the
> > function offset.
> I've been using a technique like this in some code with good results:
> 
> struct source_location
> {
>    const char *file;
>    const char *func;
>    const char *format;
>    uint16_t line;
> };
> #define _LOCATION(format) ({static const source_location __location__
> = {\
>  __FILE__,__PRETTY_FUNCTION__,format,__LINE__};\
>    &__location__;})
> 
> void _mlx5_core_err(const struct source_location *loc,struct
> mlx5_core_dev *dev, ...);
> #define mlx5_core_err(dev,format,...)
> _mlx_core_err(_LOCATION(format),dev,__VA_ARGS__)
> 
> The call site .text overhead is the about same as what you have, but
> this still retains the function and line number information in
> .rodata.

Hello Jason.

As far as I know, no kernel code currently uses a _LOCATION
like macro.

I think your proposal is nearly identical code size to the
existing call.  Also, compiler format/argument checking is
eliminated and I think that is a significant negative.

Using the kernel vsprintf %pS or %ps extension is pretty common.

Using printk("%pS", __builtin_return_address(0)); in the called
function is no overhead at all and returns almost exactly
the same information.

Using more expressive messages is generally better than using
printk("%d", __LINE__);



Re: Regarding VRF support in Linux Kernel

2016-06-22 Thread David Ahern

On 6/22/16 4:03 PM, Randy Dunlap wrote:

On 06/22/16 14:05, Ajith Adapa wrote:

Hi,

I am following the steps present in
(https://www.kernel.org/doc/Documentation/networking/vrf.txt) and
trying to create IPv4 VRF in latest Fedora 24 distribution with 4.5
Linux kernel.

[root@localhost ip]# uname -a
Linux localhost.localdomain 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19
13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I am getting operation not supported error as shown below

# ip link add vrf-blue type vrf table 10
RTNETLINK answers: Operation not supported


VRF is not enabled in Fedora 24

David



Is there any CONFIG option to enable VRF in Linux kernel ?


Yes, in drivers/net/Kconfig:

config NET_VRF
tristate "Virtual Routing and Forwarding (Lite)"
depends on IP_MULTIPLE_TABLES
depends on NET_L3_MASTER_DEV
depends on IPV6 || IPV6=n
depends on IPV6_MULTIPLE_TABLES || IPV6=n
---help---
  This option enables the support for mapping interfaces into VRF's. The
  support enables VRF devices.








Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-22 Thread Richard Weinberger
On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert  wrote:
> Transports over UDP is intended to encapsulate TCP and other transport
> protocols directly and securely in UDP.
>
> The goal of this work is twofold:
>
> 1) Allow applications to run their own transport layer stack (i.e.from
>userspace). This eliminates dependencies on the OS (e.g. solves a
>major dependency issue for Facebook on clients).

Facebook on clients would be a Facebook app on mobile devices?
Does that mean that the Facebook app is so advanced and complicated
that it needs a special TCP stack?!

-- 
Thanks,
//richard


RE: [PATCH v6] r8152: Add support for setting pass through MAC address on RTL8153-AD

2016-06-22 Thread Mario_Limonciello
> -Original Message-
> From: Limonciello, Mario
> Sent: Tuesday, June 14, 2016 5:27 PM
> To: 'David Miller' ; pali.ro...@gmail.com
> Cc: gre...@linuxfoundation.org; and...@lunn.ch;
> hayesw...@realtek.com; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; linux-...@vger.kernel.org;
> anthony.w...@canonical.com
> Subject: RE: [PATCH v6] r8152: Add support for setting pass through MAC
> address on RTL8153-AD
> 
> > -Original Message-
> > From: David Miller [mailto:da...@davemloft.net]
> > Sent: Tuesday, June 14, 2016 1:35 PM
> > To: pali.ro...@gmail.com
> > Cc: gre...@linuxfoundation.org; and...@lunn.ch; Limonciello, Mario
> > ; hayesw...@realtek.com; linux-
> > ker...@vger.kernel.org; netdev@vger.kernel.org; linux-
> > u...@vger.kernel.org; anthony.w...@canonical.com
> > Subject: Re: [PATCH v6] r8152: Add support for setting pass through MAC
> > address on RTL8153-AD
> >
> > From: Pali Rohár 
> > Date: Tue, 14 Jun 2016 18:47:36 +0200
> >
> > > You have never seen two ethernet cards with same MAC addresses?
> Right
> > I
> > > have not seen two USB, but there is non zero chance that could happen.
> >
> > It would be an error scenerio, and something to be avoided.
> >
> > It is a valid and correct assumption that one is able to put
> > several devices at the same time on the same physical network
> > and expect it to work.
> >
> > The behavior added by the change in question invalidates that.
> >
> > I'm trying to consider the long term aspects of this, which is that if
> > more devices adopt this scheme we're in trouble if we blindly
> > interpret the MAC address in this way.
> >
> 
> Do you mean if other manufacturers start to ship devices with
> RTL8135-AD's w/ this pass through bit set and people start to try to
> mix and match?
> 
> > This firmware MAC property facility seems to be designed with only an
> > extremely narrow use case being considered.
> 
> Yes, as I understand it this is the reason that it's only on such specific 
> devices
> that the mac address pass through bit is actually set on the efuse.

David,

Did you have any more thoughts about this?  I'm happy to make some other
adjustments to the patch, if you have some recommendations.


Re: Regarding VRF support in Linux Kernel

2016-06-22 Thread Randy Dunlap
On 06/22/16 14:05, Ajith Adapa wrote:
> Hi,
> 
> I am following the steps present in
> (https://www.kernel.org/doc/Documentation/networking/vrf.txt) and
> trying to create IPv4 VRF in latest Fedora 24 distribution with 4.5
> Linux kernel.
> 
> [root@localhost ip]# uname -a
> Linux localhost.localdomain 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19
> 13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> I am getting operation not supported error as shown below
> 
> # ip link add vrf-blue type vrf table 10
> RTNETLINK answers: Operation not supported
> 
> Is there any CONFIG option to enable VRF in Linux kernel ?

Yes, in drivers/net/Kconfig:

config NET_VRF
tristate "Virtual Routing and Forwarding (Lite)"
depends on IP_MULTIPLE_TABLES
depends on NET_L3_MASTER_DEV
depends on IPV6 || IPV6=n
depends on IPV6_MULTIPLE_TABLES || IPV6=n
---help---
  This option enables the support for mapping interfaces into VRF's. The
  support enables VRF devices.




-- 
~Randy


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 11:22 AM, Yuval Mintz wrote:

But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.


Since you mention bnx2x...   I would argue that the NIC firmware on 
those NICs driven by bnx2x is doing it badly.  Not so much from a 
functional standpoint I suppose, but from a performance one.  The 
NIC-firmware GRO done there has this rather unfortunate assumption about 
"all MSSes will be directly driven by my own physical MTU" and when it 
sees segments of a size other than would be suggested by the physical 
MTU, will coalesce only two segments together.  They then do not get 
further coalesced in the stack.


Suffice it to say this does not do well from a performance standpoint.

One can disable LRO via ethtool for these NICs, but what that does is 
disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one 
must also get the bnx2x module loaded with "disable-tpa=1" so the Linux 
stack GRO gets used instead.


Had the bnx2x-driven NICs' firmware not had that rather unfortunate 
assumption about MSSes I probably would never have noticed.


happy benchmarking,

rick jones


Re: [RFC nf-next 1/3] netfilter: bridge: add and use br_nf_hook_thresh

2016-06-22 Thread Aaron Conole
Aaron Conole  writes:

> From: Florian Westphal 
>
> This replaces the last uses of NF_HOOK_THRESH().
> Followup patch will remove it and rename nf_hook_thresh.
>
> The reason is that inet (non-bridge) netfilter no longer invokes the
> hooks from hooks, so we do no longer need the thresh value to skip hooks
> with a lower priority.
>
> The bridge netfilter however may need to do this. br_nf_hook_thresh is a
> wrapper that is supposed to do this, i.e. only call hooks with a
> priority that exceeds NF_BR_PRI_BRNF.
>
> It's used only in the recursion cases of br_netfilter.
>
> Signed-off-by: Florian Westphal 
> Signed-off-by: Aaron Conole 
> ---

I will make sure to fix the address in this patch.  Apologies.


[RFC nf-next 3/3] netfilter: replace list_head with single linked list

2016-06-22 Thread Aaron Conole
The netfilter hook list never uses the prev pointer, and so can be
trimmed to be a smaller singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device
becomes 2176 bytes (down from 2240).

Signed-off-by: Aaron Conole 
Signed-off-by: Florian Westphal 
---

NOTE: The unregister list code is ugly right now - I will be fixing it to a
  cleaner version in the next round of submission.

 include/linux/netdevice.h |   2 +-
 include/linux/netfilter.h |  18 +++---
 include/linux/netfilter_ingress.h |  14 +++--
 include/net/netfilter/nf_queue.h  |   9 ++-
 include/net/netns/netfilter.h |   2 +-
 net/bridge/br_netfilter_hooks.c   |  21 +++
 net/netfilter/core.c  | 115 --
 net/netfilter/nf_internals.h  |  10 ++--
 net/netfilter/nf_queue.c  |  15 +++--
 net/netfilter/nfnetlink_queue.c   |   5 +-
 10 files changed, 121 insertions(+), 90 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e84d9d2..8235f67 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1747,7 +1747,7 @@ struct net_device {
 #endif
struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
-   struct list_headnf_hooks_ingress;
+   struct nf_hook_entry __rcu *nf_hooks_ingress;
 #endif
 
unsigned char   broadcast[MAX_ADDR_LEN];
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ad444f0..3390a84 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -55,12 +55,12 @@ struct nf_hook_state {
struct net_device *out;
struct sock *sk;
struct net *net;
-   struct list_head *hook_list;
+   struct nf_hook_entry *hook_list;
int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
 static inline void nf_hook_state_init(struct nf_hook_state *p,
- struct list_head *hook_list,
+ struct nf_hook_entry *hook_list,
  unsigned int hook,
  int thresh, u_int8_t pf,
  struct net_device *indev,
@@ -97,6 +97,12 @@ struct nf_hook_ops {
int priority;
 };
 
+struct nf_hook_entry {
+   struct nf_hook_entry __rcu  *next;
+   struct nf_hook_ops  ops;
+   const struct nf_hook_ops*orig_ops;
+};
+
 struct nf_sockopt_ops {
struct list_head list;
 
@@ -161,8 +167,6 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
 int (*okfn)(struct net *, struct sock *, 
struct sk_buff *),
 int thresh)
 {
-   struct list_head *hook_list;
-
 #ifdef HAVE_JUMP_LABEL
if (__builtin_constant_p(pf) &&
__builtin_constant_p(hook) &&
@@ -170,14 +174,14 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
return 1;
 #endif
 
-   hook_list = &net->nf.hooks[pf][hook];
-
-   if (!list_empty(hook_list)) {
+   if (rcu_access_pointer(net->nf.hooks[pf][hook])) {
+   struct nf_hook_entry *hook_list;
struct nf_hook_state state;
int ret;
 
/* We may already have this, but read-locks nest anyway */
rcu_read_lock();
+   hook_list = rcu_dereference(net->nf.hooks[pf][hook]);
nf_hook_state_init(&state, hook_list, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
 
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index 6965ba0..e3e3f6d 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -11,23 +11,27 @@ static inline bool nf_hook_ingress_active(const struct 
sk_buff *skb)
if 
(!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
return false;
 #endif
-   return !list_empty(&skb->dev->nf_hooks_ingress);
+   return rcu_access_pointer(skb->dev->nf_hooks_ingress) != NULL;
 }
 
 /* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
+   struct nf_hook_entry *e = rcu_dereference(skb->dev->nf_hooks_ingress);
struct nf_hook_state state;
 
-   nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
-  NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV,
-  skb->dev, NULL, NULL, dev_net(skb->dev), NULL);
+   if (unlikely(!e))
+   return 0;
+
+   nf_hook_state_init(&state, e, NF_NETDEV_INGRESS, INT_MIN,
+  NFPROTO_NETDEV, skb->dev, NULL, NULL,
+  dev_net(skb->dev), NULL);
return nf_hook_slow(skb, &state);
 }
 
 static inline void nf_

[RFC nf-next 0/3] Compact netfilter hooks list

2016-06-22 Thread Aaron Conole
This series makes a simple change to shrink the netfilter hook list
from a double linked list, to a singly linked list.  Since the hooks
are always traversed in-order, there is no need to maintain a previous
pointer.

This series is being submitted for early feedback. This was jointly
developed by Florian Westphal.

Aaron Conole (1):
  netfilter: replace list_head with single linked list

Florian Westphal (2):
  netfilter: bridge: add and use br_nf_hook_thresh
  netfilter: call nf_hook_state_init with rcu_read_lock held

 include/linux/netdevice.h  |   2 +-
 include/linux/netfilter.h  |  26 --
 include/linux/netfilter_ingress.h  |  15 ++--
 include/net/netfilter/br_netfilter.h   |   6 ++
 include/net/netfilter/nf_queue.h   |   9 +-
 include/net/netns/netfilter.h  |   2 +-
 net/bridge/br_netfilter_hooks.c|  50 +--
 net/bridge/br_netfilter_ipv6.c |  12 ++-
 net/bridge/netfilter/ebt_redirect.c|   2 +-
 net/bridge/netfilter/ebtables.c|   2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |   2 +-
 net/netfilter/core.c   | 120 +++--
 net/netfilter/nf_conntrack_core.c  |   2 +-
 net/netfilter/nf_conntrack_h323_main.c |   2 +-
 net/netfilter/nf_conntrack_helper.c|   2 +-
 net/netfilter/nf_internals.h   |  10 +--
 net/netfilter/nf_queue.c   |  15 ++--
 net/netfilter/nfnetlink_cthelper.c |   2 +-
 net/netfilter/nfnetlink_log.c  |   8 +-
 net/netfilter/nfnetlink_queue.c|   7 +-
 net/netfilter/xt_helper.c  |   2 +-
 24 files changed, 193 insertions(+), 111 deletions(-)

-- 
2.5.5



[RFC nf-next 1/3] netfilter: bridge: add and use br_nf_hook_thresh

2016-06-22 Thread Aaron Conole
From: Florian Westphal 

This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/net/netfilter/br_netfilter.h |  6 
 net/bridge/br_netfilter_hooks.c  | 57 ++--
 net/bridge/br_netfilter_ipv6.c   | 12 
 3 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/br_netfilter.h 
b/include/net/netfilter/br_netfilter.h
index e8d1448..0b0c35c 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -15,6 +15,12 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct 
sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
 
+int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
+ struct sk_buff *skb, struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buff *));
+
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 2d25979..19f230c 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -395,11 +396,10 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE,
-  NF_BR_PRE_ROUTING,
-  net, sk, skb, skb->dev, NULL,
-  br_nf_pre_routing_finish_bridge,
-  1);
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING,
+ net, sk, skb, skb->dev,
+ NULL,
+ br_nf_pre_routing_finish);
return 0;
}
ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr);
@@ -417,10 +417,8 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb,
-  skb->dev, NULL,
-  br_handle_frame_finish, 1);
-
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb, skb->dev, NULL,
+ br_handle_frame_finish);
return 0;
 }
 
@@ -992,6 +990,47 @@ static struct notifier_block brnf_notifier __read_mostly = 
{
.notifier_call = brnf_device_event,
 };
 
+/* recursively invokes nf_hook_slow (again), skipping already-called
+ * hooks (< NF_BR_PRI_BRNF).
+ *
+ * Called with rcu read lock held.
+ */
+int br_nf_hook_thresh(unsigned int hook, struct net *net,
+ struct sock *sk, struct sk_buff *skb,
+ struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buf *))
+{
+   struct nf_hook_ops *elem;
+   struct nf_hook_state state;
+   struct list_head *head;
+   int ret;
+
+   head = &net->nf.hooks[NFPROTO_BRIDGE][hook];
+
+   list_for_each_entry_rcu(elem, head, list) {
+   struct nf_hook_ops *next;
+
+   next = list_entry_rcu(list_next_rcu(&elem->list),
+ struct nf_hook_ops, list);
+   if (next->priority <= NF_BR_PRI_BRNF)
+   continue;
+   }
+
+   if (&elem->list == head)
+   return okfn(net, sk, skb);
+
+   nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1,
+  NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
+
+   ret = nf_hook_slow(skb, &state);
+   if (ret == 1)
+   ret = okfn(net, sk, skb);
+
+   return ret;
+}
+
 #ifdef CONFIG_SYSCTL
 static
 int brnf_sysctl_call_tables(struct ctl_table *ctl, int write,
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 5e59a84..5989661 100644
--- a/net

[RFC nf-next 2/3] netfilter: call nf_hook_state_init with rcu_read_lock held

2016-06-22 Thread Aaron Conole
From: Florian Westphal 

This makes things simpler because we can store the head of the list
in the nf_state structure without worrying about concurrent add/delete
of hook elements from the list.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/linux/netfilter.h  | 8 +++-
 include/linux/netfilter_ingress.h  | 1 +
 net/bridge/netfilter/ebt_redirect.c| 2 +-
 net/bridge/netfilter/ebtables.c| 2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   | 2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +-
 net/netfilter/core.c   | 5 +
 net/netfilter/nf_conntrack_core.c  | 2 +-
 net/netfilter/nf_conntrack_h323_main.c | 2 +-
 net/netfilter/nf_conntrack_helper.c| 2 +-
 net/netfilter/nfnetlink_cthelper.c | 2 +-
 net/netfilter/nfnetlink_log.c  | 8 ++--
 net/netfilter/nfnetlink_queue.c| 2 +-
 net/netfilter/xt_helper.c  | 2 +-
 16 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9a..ad444f0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -174,10 +174,16 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
 
if (!list_empty(hook_list)) {
struct nf_hook_state state;
+   int ret;
 
+   /* We may already have this, but read-locks nest anyway */
+   rcu_read_lock();
nf_hook_state_init(&state, hook_list, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
-   return nf_hook_slow(skb, &state);
+
+   ret = nf_hook_slow(skb, &state);
+   rcu_read_unlock();
+   return ret;
}
return 1;
 }
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index 5fcd375..6965ba0 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -14,6 +14,7 @@ static inline bool nf_hook_ingress_active(const struct 
sk_buff *skb)
return !list_empty(&skb->dev->nf_hooks_ingress);
 }
 
+/* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
struct nf_hook_state state;
diff --git a/net/bridge/netfilter/ebt_redirect.c 
b/net/bridge/netfilter/ebt_redirect.c
index 20396499..2e7c4f9 100644
--- a/net/bridge/netfilter/ebt_redirect.c
+++ b/net/bridge/netfilter/ebt_redirect.c
@@ -24,7 +24,7 @@ ebt_redirect_tg(struct sk_buff *skb, const struct 
xt_action_param *par)
return EBT_DROP;
 
if (par->hooknum != NF_BR_BROUTING)
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
ether_addr_copy(eth_hdr(skb)->h_dest,
br_port_get_rcu(par->in)->br->dev->dev_addr);
else
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 5a61f35..6faa2c3 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -148,7 +148,7 @@ ebt_basic_match(const struct ebt_entry *e, const struct 
sk_buff *skb,
return 1;
if (FWINV2(ebt_dev_check(e->out, out), EBT_IOUT))
return 1;
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
if (in && (p = br_port_get_rcu(in)) != NULL &&
FWINV2(ebt_dev_check(e->logical_in, p->br->dev), EBT_ILOGICALIN))
return 1;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index ae1a71a..eab0239 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -110,7 +110,7 @@ static unsigned int ipv4_helper(void *priv,
if (!help)
return NF_ACCEPT;
 
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
helper = rcu_dereference(help->helper);
if (!helper)
return NF_ACCEPT;
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c 
b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index c567e1b..2c08d6a 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -149,7 +149,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
return -NF_ACCEPT;
}
 
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
innerproto = __nf_ct_l4proto_find(PF_INET, origtuple.dst.protonum);
 
/* Ordinarily, we'd expect the inverted tupleproto, but it's
diff --git a/net/ipv6/netfilter/nf_conn

Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Alexander Duyck
On Wed, Jun 22, 2016 at 11:22 AM, Yuval Mintz  wrote:
> This series adds driver support for the processing of tunnel
> [specifically vxlan/geneve/gre tunnels] packets which are
> aggregated [GROed] by the hardware before driver passes
> such packets upto the stack.
 First off I am pretty sure this isn't GRO.  This is LRO.
>>> Nopes. This is GRO - each MSS-sized frame will arrive on its
>>> own frag, whereas the headers [both  internal & external]
>>> would arrive on the linear part of the SKB.
>
>> No it is LRO, it just very closely mimics GRO.  If it is in hardware
>> it is LRO.  GRO is extensible in software and bugs can be fixed in
>> kernel, LRO is not extensible and any bugs in it are found in hardware
>> or the driver and would have to be fixed there.  It all comes down to
>> bug isolation.  If we find that disabling LRO makes the bug go away we
>> know the hardware aggregation has an issue.  Both features should not
>> operate on the same bit.  Otherwise when some bug is found in your
>> implementation it will be blamed on GRO when the bug is actually in
>> LRO.
> I'm not aware of a definition stating GRO *has* to be extensible in SW;
> AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
> for aggregation, method of aggregation, etc.].

The idea behind GRO was to make it so that we had a generic way to
handle this in software.  For the most part drivers doing LRO in
software were doing the same thing that the GRO was doing.  The only
reason it was deprecated is because GRO was capable of doing more than
LRO could since we add one parser and suddenly all devices saw the
benefit instead of just one specific device.  It is best to keep those
two distinct solutions and then let the user sort out if they want to
have the aggregation done by the device or the kernel.

>> I realize this causes some pain when routing or bridging as LRO is
>> disabled but that is kind of the point.  We don't want the hardware to
>> be mangling frames when we are attempting to route packets between any
>> two given devices.

> Actually, while I might disagree on whether this is LRO/GRO, I don't think
> there's any problem for us to base this on the LRO feature - I.e., when we
> started working on qede we didn't bother implementing LRO as we understood
> it was deprecated, but this encapsulated aggregation is configurable; If it
> makes life easier for everyone if we make the configuration based on the LRO
> configuration, so that when LRO is disabled we won't have this turned on
> it can be easily done.

If you can move this over to an LRO feature I think it would go a long
way towards cleaning up many of my complaints with this.  LRO really
isn't acceptable for routing or bridging scenarios whereas GRO
sometimes is.  If we place your code under the same limitations as the
LRO feature bit then we can probably look at allowing the tunnel
aggregation to be performed since we can be more certain that the
tunnel will be terminating at the local endpoint.

 Also I don't know if you have been paying attention to recent
 discussions on the mailing list but the fact is GRO over UDP tunnels
 is still a subject for debate.  This patch takes things in the
 opposite direction of where we are currently talking about going with
 GRO.  I've added Hannes and Tom to this discussion just to make sure I
 have the proper understanding of all this as my protocol knowledge is
 somewhat lacking.
>>> I guess we're on the exact opposite side of the discussion - I.e., we're
>>> the vendor that tries pushing offload capabilities to the device.
>>> Do notice however that we're not planning on pushing anything new
>>> feature-wise, just like we haven't done anything for regular TCP GRO -
>>> All we do is allow our HW/FW to aggregate packets instead of stack.
>
>> If you have been following the list then arguably you didn't fully
>> understand what has been going on.  I just went through and cleaned up
>> all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
>> and tried to get this consolidated so that we could start to have
>> conversations about this without things being outright rejected.  I
>> feel like you guys have just prooven the argument for the other side
>> that said as soon as we start support any of it, the vendors were
>> going to go nuts and try to stuff everything and the kitchen sink into
>> the NICs.  The fact is I am starting to think they were right.

> You hurt my feeling; I started going nuts ages ago ;-)

I'm not saying anything about going nuts, I'm just saying you kind of
decided to sprint out into a mine-field where I have been treading the
last week or so.  I'm just wanting to avoid having to see all this
code getting ripped out.

> But seriously, this isn't really anything new but rather a step forward in
> the direction we've already taken - bnx2x/qede are already performing
> the same for non-encapsulated TCP.

Right.  That goes on for other NI

[PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-22 Thread Martin KaFai Lau
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |  1 +
 kernel/bpf/arraymap.c| 43 +++
 kernel/bpf/syscall.c |  3 ++-
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406459b..ef4e386 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -84,6 +84,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_PERCPU_HASH,
BPF_MAP_TYPE_PERCPU_ARRAY,
BPF_MAP_TYPE_STACK_TRACE,
+   BPF_MAP_TYPE_CGROUP_ARRAY,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 5af3073..aacd40b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -539,3 +539,46 @@ static int __init register_perf_event_array_map(void)
return 0;
 }
 late_initcall(register_perf_event_array_map);
+
+#ifdef CONFIG_CGROUPS
+static void *cgroup_fd_array_get_ptr(struct bpf_map *map,
+struct file *map_file /* not used */,
+int fd)
+{
+   return cgroup_get_from_fd(fd);
+}
+
+static void cgroup_fd_array_put_ptr(void *ptr)
+{
+   /* cgroup_put free cgrp after a rcu grace period */
+   cgroup_put(ptr);
+}
+
+static void cgroup_fd_array_free(struct bpf_map *map)
+{
+   bpf_fd_array_map_clear(map);
+   fd_array_map_free(map);
+}
+
+static const struct bpf_map_ops cgroup_array_ops = {
+   .map_alloc = fd_array_map_alloc,
+   .map_free = cgroup_fd_array_free,
+   .map_get_next_key = array_map_get_next_key,
+   .map_lookup_elem = fd_array_map_lookup_elem,
+   .map_delete_elem = fd_array_map_delete_elem,
+   .map_fd_get_ptr = cgroup_fd_array_get_ptr,
+   .map_fd_put_ptr = cgroup_fd_array_put_ptr,
+};
+
+static struct bpf_map_type_list cgroup_array_type __read_mostly = {
+   .ops = &cgroup_array_ops,
+   .type = BPF_MAP_TYPE_CGROUP_ARRAY,
+};
+
+static int __init register_cgroup_array_map(void)
+{
+   bpf_register_map_type(&cgroup_array_type);
+   return 0;
+}
+late_initcall(register_cgroup_array_map);
+#endif
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c23a4e93..cac13f1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -393,7 +393,8 @@ static int map_update_elem(union bpf_attr *attr)
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_update(map, key, value, attr->flags);
} else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY ||
-  map->map_type == BPF_MAP_TYPE_PROG_ARRAY) {
+  map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
+  map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY) {
rcu_read_lock();
err = bpf_fd_array_map_update_elem(map, f.file, key, value,
   attr->flags);
-- 
2.5.1



[PATCH net-next v2 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto

2016-06-22 Thread Martin KaFai Lau
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2.  It is similar to the
feature added in netfilter:
commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")

The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.

Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 12 
 kernel/bpf/verifier.c|  8 
 net/core/filter.c| 40 
 3 files changed, 60 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef4e386..bad309f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -314,6 +314,18 @@ enum bpf_func_id {
 */
BPF_FUNC_skb_get_tunnel_opt,
BPF_FUNC_skb_set_tunnel_opt,
+
+   /**
+* bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
+* @skb: pointer to skb
+* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+* @index: index of the cgroup in the bpf_map
+* Return:
+*   == 0 skb failed the cgroup2 descendant test
+*   == 1 skb succeeded the cgroup2 descendant test
+*< 0 error
+*/
+   BPF_FUNC_skb_in_cgroup,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 668e079..68753e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1062,6 +1062,10 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
if (func_id != BPF_FUNC_get_stackid)
goto error;
break;
+   case BPF_MAP_TYPE_CGROUP_ARRAY:
+   if (func_id != BPF_FUNC_skb_in_cgroup)
+   goto error;
+   break;
default:
break;
}
@@ -1081,6 +1085,10 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
if (map->map_type != BPF_MAP_TYPE_STACK_TRACE)
goto error;
break;
+   case BPF_FUNC_skb_in_cgroup:
+   if (map->map_type != BPF_MAP_TYPE_CGROUP_ARRAY)
+   goto error;
+   break;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index df6860c..a16f7d2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2024,6 +2024,42 @@ bpf_get_skb_set_tunnel_proto(enum bpf_func_id which)
}
 }
 
+#ifdef CONFIG_CGROUPS
+static u64 bpf_skb_in_cgroup(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct sk_buff *skb = (struct sk_buff *)(long)r1;
+   struct bpf_map *map = (struct bpf_map *)(long)r2;
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct cgroup *cgrp;
+   struct sock *sk;
+   u32 i = (u32)r3;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   sk = skb->sk;
+   if (!sk || !sk_fullsock(sk))
+   return -ENOENT;
+
+   if (unlikely(i >= array->map.max_entries))
+   return -E2BIG;
+
+   cgrp = READ_ONCE(array->ptrs[i]);
+   if (unlikely(!cgrp))
+   return -ENOENT;
+
+   return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data), cgrp);
+}
+
+static const struct bpf_func_proto bpf_skb_in_cgroup_proto = {
+   .func   = bpf_skb_in_cgroup,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+};
+#endif
+
 static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
@@ -2086,6 +2122,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return &bpf_get_route_realm_proto;
case BPF_FUNC_perf_event_output:
return bpf_get_event_output_proto();
+#ifdef CONFIG_CGROUPS
+   case BPF_FUNC_skb_in_cgroup:
+   return &bpf_skb_in_cgroup_proto;
+#endif
default:
return sk_filter_func_proto(func_id);
}
-- 
2.5.1



[PATCH net-next v2 1/4] cgroup: Add cgroup_get_from_fd

2016-06-22 Thread Martin KaFai Lau
Add a helper function to get a cgroup2 from a fd.  It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 35 +++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a20320c..984f73b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -87,6 +87,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct 
dentry *dentry,
   struct cgroup_subsys 
*ss);
 
 struct cgroup *cgroup_get_from_path(const char *path);
+struct cgroup *cgroup_get_from_fd(int fd);
 
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 86cb5c6..14617968 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -6205,6 +6206,40 @@ struct cgroup *cgroup_get_from_path(const char *path)
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_path);
 
+/**
+ * cgroup_get_from_fd - get a cgroup pointer from a fd
+ * @fd: fd obtained by open(cgroup2_dir)
+ *
+ * Find the cgroup from a fd which should be obtained
+ * by opening a cgroup directory.  Returns a pointer to the
+ * cgroup on success. ERR_PTR is returned if the cgroup
+ * cannot be found.
+ */
+struct cgroup *cgroup_get_from_fd(int fd)
+{
+   struct cgroup_subsys_state *css;
+   struct cgroup *cgrp;
+   struct file *f;
+
+   f = fget_raw(fd);
+   if (!f)
+   return ERR_PTR(-EBADF);
+
+   css = css_tryget_online_from_dir(f->f_path.dentry, NULL);
+   fput(f);
+   if (IS_ERR(css))
+   return ERR_CAST(css);
+
+   cgrp = css->cgroup;
+   if (!cgroup_on_dfl(cgrp)) {
+   cgroup_put(cgrp);
+   return ERR_PTR(-EBADF);
+   }
+
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
+
 /*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
-- 
2.5.1



[PATCH net-next v2 4/4] cgroup: bpf: Add an example to do cgroup checking in BPF

2016-06-22 Thread Martin KaFai Lau
test_cgrp2_array_pin.c:
A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
pouplates/updates it with a cgroup2's backed fd and pins it to a
bpf-fs's file.  The pinned file can be loaded by tc and then used
by the bpf prog later.  This program can also update an existing pinned
array and it could be useful for debugging/testing purpose.

test_cgrp2_tc_kern.c:
A bpf prog which should be loaded by tc.  It is to demonstrate
the usage of bpf_skb_in_cgroup.

test_cgrp2_tc.sh:
A script that glues the test_cgrp2_array_pin.c and
test_cgrp2_tc_kern.c together.  The idea is like:
1. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
   with a cgroup fd
2. Load the test_cgrp2_tc_kern.o by tc
3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
   dropped because of a match on the cgroup

Most of the lines in test_cgrp2_tc.sh is the boilerplate
to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
not bulletproof on errors but should work well enough and
give enough debug info if things did not go well.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 
---
 samples/bpf/Makefile   |   3 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/test_cgrp2_array_pin.c | 109 +
 samples/bpf/test_cgrp2_tc.sh   | 189 +
 samples/bpf/test_cgrp2_tc_kern.c   |  71 ++
 5 files changed, 374 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_array_pin.c
 create mode 100755 samples/bpf/test_cgrp2_tc.sh
 create mode 100644 samples/bpf/test_cgrp2_tc_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0bf2478..a98b780 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -20,6 +20,7 @@ hostprogs-y += offwaketime
 hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
+hostprogs-y += test_cgrp2_array_pin
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -40,6 +41,7 @@ offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
 spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
+test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -61,6 +63,7 @@ always += map_perf_test_kern.o
 always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
+always += test_cgrp2_tc_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 7904a2a..84e3fd9 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -70,6 +70,8 @@ static int (*bpf_l3_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l3_csum_replace;
 static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int 
flags) =
(void *) BPF_FUNC_l4_csum_replace;
+static int (*bpf_skb_in_cgroup)(void *ctx, void *map, int index) =
+   (void *) BPF_FUNC_skb_in_cgroup;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/test_cgrp2_array_pin.c 
b/samples/bpf/test_cgrp2_array_pin.c
new file mode 100644
index 000..70e86f7
--- /dev/null
+++ b/samples/bpf/test_cgrp2_array_pin.c
@@ -0,0 +1,109 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "libbpf.h"
+
+static void usage(void)
+{
+   printf("Usage: test_cgrp2_array_pin [...]\n");
+   printf("   -FFile to pin an BPF cgroup array\n");
+   printf("   -UUpdate an already pinned BPF cgroup 
array\n");
+   printf("   -v   Full path of the cgroup2\n");
+   printf("   -h  Display this help\n");
+}
+
+int main(int argc, char **argv)
+{
+   const char *pinned_file = NULL, *cg2 = NULL;
+   int create_array = 1;
+   int array_key = 0;
+   int array_fd = -1;
+   int cg2_fd = -1;
+   int ret = -1;
+   int opt;
+
+   while ((opt = getopt(argc, argv, "F:U:v:")) != -1) {
+   switch (opt) {
+   /* General args */
+   case 'F':
+   pinned_file = optarg;
+   break;
+   case 'U':
+   pinned_file = optarg;
+   create_array = 0;
+   break;
+   case 'v':
+   cg2 = optarg;
+   break;
+   default:
+   usage();
+   goto out;
+   }
+

[PATCH net-next v2 0/4] cgroup: bpf: cgroup2 membership test on skb

2016-06-22 Thread Martin KaFai Lau
v2:
- Fix two return cases in cgroup_get_from_fd()
- Fix compilation errors when CONFIG_CGROUPS is not used:
  - arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY
  - filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup
- Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd()

v1 cover letter:
This series is to implement a bpf-way to
check the cgroup2 membership of a skb (sk_buff).

It is similar to the feature added in netfilter:
c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")

The current target is the tc-like usage.



Regarding VRF support in Linux Kernel

2016-06-22 Thread Ajith Adapa
Hi,

I am following the steps present in
(https://www.kernel.org/doc/Documentation/networking/vrf.txt) and
trying to create IPv4 VRF in latest Fedora 24 distribution with 4.5
Linux kernel.

[root@localhost ip]# uname -a
Linux localhost.localdomain 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19
13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I am getting operation not supported error as shown below

# ip link add vrf-blue type vrf table 10
RTNETLINK answers: Operation not supported

Is there any CONFIG option to enable VRF in Linux kernel ?

Regards,
Ajith


Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-22 Thread Yuchung Cheng
On Wed, Jun 22, 2016 at 4:21 AM, Hagen Paul Pfeifer  wrote:
>
> > On June 22, 2016 at 7:53 AM Yuchung Cheng  wrote:
> >
> > Thanks for the patience. I've collected data from some Google Web
> > servers. They serve both a mix of US and SouthAm users using
> > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps,
> > gmails, etc but not Youtube videos). The mean RTT is about 100ms.
> >
> > The user connections were split into 4 groups of different TCP RTO
> > configs. Each group has many millions of connections but the
> > size variation among groups is well under 1%.
> >
> > B: baseline Linux
> > D: this patch
> > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298
> > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like 
> > B)
> >
> > For mean TCP latency of HTTP responses (first byte sent to last byte
> > acked), B < R < Y < D. But the differences are so insignificant (<1%).
> > The median, 95pctl, and 99pctl has similar indifference. In summary
> > there's hardly visible impact on latency. I also look at only response
> > less than 4KB but do not see a different picture.
> >
> > The main difference is the retransmission rate where R =~ Y < B =~D.
> > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal
> > more interesting details. The table shows the deltas in percentage to
> > the baseline B.
> >
> > D  R Y
> > --
> > Timeout  +12%   -16%  -16%
> > TailLossProb +28%-7%   -7%
> > DSACK_rcvd   +37%-7%   -7%
> > Cwnd-undo+16%   -29%  -29%
> >
> > RTO change affects TLP because TLP will use the min of RTO and TLP
> > timer value to arm the probe timer.
> >
> > The stats indicate that the main culprit of spurious timeouts / rtx is
> > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as
> > good as current Linux approach.
> >
> > Given that I would recommend we revise this patch to use the RFC
> > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can
> > further experiment the lower-bound and change that in a separate
> > patch.
>
> Great news Yuchung!
>
> Then Daniel will prepare v4 with a min-rto lower bound:
>
> max(RTTVAR, tcp_rto_min_us(struct sock))
>
> Any further suggestions Yuchung, Eric? We will also feed this v4 in our test 
> environment to check the behavior for sender limited, non-continuous flows.
yes a small one: I think the patch should change __tcp_set_rto()
instead of tcp_set_rto() so it applies to recurring timeouts as well.


>
> Hagen


Re: [PATCH net-next 7/9] net/mlx5e: Add 50G missing link mode to ethtool and mlx5 driver

2016-06-22 Thread Saeed Mahameed
On Wed, Jun 22, 2016 at 7:47 PM, David Decotigny  wrote:
> maybe split this into 2 separate patches?

sure


Re: [PATCH] mellanox: mlx5: Use logging functions to reduce text ~10k/5%

2016-06-22 Thread Jason Gunthorpe
On Wed, Jun 22, 2016 at 11:23:59AM -0700, Joe Perches wrote:

> The output changes now do not include line #, but do include the
> function offset.

I've been using a technique like this in some code with good results:

struct source_location
{
   const char *file;
   const char *func;
   const char *format;
   uint16_t line;
};
#define _LOCATION(format) ({static const source_location __location__ = {\
 __FILE__,__PRETTY_FUNCTION__,format,__LINE__};\
 &__location__;})

void _mlx5_core_err(const struct source_location *loc,struct mlx5_core_dev 
*dev, ...);
#define mlx5_core_err(dev,format,...) 
_mlx_core_err(_LOCATION(format),dev,__VA_ARGS__)

The call site .text overhead is the about same as what you have, but
this still retains the function and line number information in
.rodata.

Jason


Re: [PATCH net 0/2] mlx4_en fixes for 4.7-rc

2016-06-22 Thread David Miller
From: Tariq Toukan 
Date: Mon, 20 Jun 2016 17:35:07 +0300

> This small patchset includes two small fixes for mlx4_en driver.
> 
> One allows a clean shutdown even when clients do not release their
> netdev reference.
> The other adds error return values to the VLAN VID add/kill functions.

Series applied.


Re: [PATCH] can: only call can_stat_update with procfs

2016-06-22 Thread Marc Kleine-Budde
On 06/20/2016 05:51 PM, Arnd Bergmann wrote:
> The change to leave out procfs support in CAN when CONFIG_PROC_FS
> is not set was incomplete and leads to a build error:
> 
> net/built-in.o: In function `can_init':
> :(.init.text+0x9858): undefined reference to `can_stat_update'
> ERROR: "can_stat_update" [net/can/can.ko] undefined!
> 
> This tries a better approach, encapsulating all of the calls
> within IS_ENABLED(), so we also leave out the timer function
> from the object file.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: a20fadf85312 ("can: build proc support only if CONFIG_PROC_FS is 
> activated")

Applied to can-next.

Tnx,
Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature


Re: [net-next,v4] openvswitch: Add packet len info to upcall.

2016-06-22 Thread David Miller
From: William Tu 
Date: Mon, 20 Jun 2016 07:26:17 -0700

> The commit f2a4d086ed4c ("openvswitch: Add packet truncation support.")
> introduces packet truncation before sending to userspace upcall receiver.
> This patch passes up the skb->len before truncation so that the upcall
> receiver knows the original packet size. Potentially this will be used
> by sFlow, where OVS translates sFlow config header=N to a sample action,
> truncating packet to N byte in kernel datapath. Thus, only N bytes instead
> of full-packet size is copied from kernel to userspace, saving the
> kernel-to-userspace bandwidth.
> 
> Signed-off-by: William Tu 

Applied.


Re: [PATCH] kcm: fix /proc memory leak

2016-06-22 Thread David Miller
From: Jiri Slaby 
Date: Mon, 20 Jun 2016 11:36:28 +0200

> Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
> kmemleak:
> unreferenced object 0x88059c0e3458 (size 192):
>   comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
>   hex dump (first 32 bytes):
> 28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00  (Eq.
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [] kmem_cache_alloc_trace+0x16e/0x230
> [] seq_open+0x79/0x1d0
> [] kcm_seq_open+0x0/0x30 [kcm]
> [] seq_open+0x79/0x1d0
> [] __seq_open_private+0x2f/0xa0
> [] seq_open_net+0x38/0xa0
> ...
> 
> It is caused by a missing free in the ->release path. So fix it by
> providing seq_release_net as the ->release method.
> 
> Signed-off-by: Jiri Slaby 
> Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)

Applied and queued up for -stable, thanks.


Re: [PATCH net 1/1] tipc: unclone unbundled buffers before forwarding

2016-06-22 Thread David Miller
From: Jon Maloy 
Date: Mon, 20 Jun 2016 09:20:46 -0400

> When extracting an individual message from a received "bundle" buffer,
> we just create a clone of the base buffer, and adjust it to point into
> the right position of the linearized data area of the latter. This works
> well for regular message reception, but during periods of extremely high
> load it may happen that an extracted buffer, e.g, a connection probe, is
> reversed and forwarded through an external interface while the preceding
> extracted message is still unhandled. When this happens, the header or
> data area of the preceding message will be partially overwritten by a
> MAC header, leading to unpredicatable consequences, such as a link
> reset.
> 
> We now fix this by ensuring that the msg_reverse() function never
> returns a cloned buffer, and that the returned buffer always contains
> sufficient valid head and tail room to be forwarded.
> 
> Reported-by: Erik Hugne 
> Acked-by: Ying Xue 
> Signed-off-by: Jon Maloy 

Applied.


Re: [PATCH net] team: Fix possible deadlock during team enslave

2016-06-22 Thread David Miller
From: Ido Schimmel 
Date: Mon, 20 Jun 2016 11:53:20 +0300

> Both dev_uc_sync_multiple() and dev_mc_sync_multiple() require the
> source device to be locked by netif_addr_lock_bh(), but this is missing
> in team's enslave function, so add it.
> 
> This fixes the following lockdep warning:
> 
> Possible interrupt unsafe locking scenario:
> 
> CPU0CPU1
> 
>lock(_xmit_ETHER/1);
> local_irq_disable();
> lock(&(&mc->mca_lock)->rlock);
> lock(&team_netdev_addr_lock_key);
>
>  lock(&(&mc->mca_lock)->rlock);
> 
>   *** DEADLOCK ***
> 
> Fixes: cb41c997d444 ("team: team should sync the port's uc/mc addrs when add 
> a port")
> Signed-off-by: Jiri Pirko 
> Signed-off-by: Ido Schimmel 

Applied.


Re: pull-request: can 2016-06-20

2016-06-22 Thread David Miller
From: Marc Kleine-Budde 
Date: Mon, 20 Jun 2016 10:00:27 +0200

> this is a pull request of 3 patches for the upcoming linux-4.7 release.
> 
> The first patch is by Thor Thayer for the c_can/d_can driver. It fixes the
> registar access on Altera Cyclone devices, which caused CAN frames to have 0x0
> in the first two bytes incorrectly. Wolfgang Grandegger's patch for the at91
> driver fixes a hanging driver under high bus load situations. A patch for the
> gs_usb driver by Maximilian Schneider adds support for the bytewerk.org
> candleLight interface.

Pulled, thanks.


Re: [PATCH 1/2] net: ethernet: sun4i-emac: use phydev from struct net_device

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sat, 18 Jun 2016 15:15:39 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied


Re: [PATCH 2/2] net: ethernet: sun4i-emac: use phy_ethtool_{get|set}_link_ksettings

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sat, 18 Jun 2016 15:15:40 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: bgmac: use phydev from struct net_device

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sun, 19 Jun 2016 22:37:05 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: bgmac: use phy_ethtool_{get|set}_link_ksettings

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sun, 19 Jun 2016 22:37:06 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: altera_tse: use phy_ethtool_{get|set}_link_ksettings

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sat, 18 Jun 2016 16:37:21 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: altera_tse: use phydev from struct net_device

2016-06-22 Thread David Miller
From: Philippe Reynes 
Date: Sat, 18 Jun 2016 16:37:20 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [patch 1/3 -next] liquidio: a couple indenting tweaks

2016-06-22 Thread David Miller

Dan, please don't mix bug fixes and cleanups in the same patch
series.


Re: [PATCH] net: cavium: liquidio: Use vzalloc instead of vmalloc

2016-06-22 Thread David Miller
From: Amitoj Kaur Chawla 
Date: Sat, 18 Jun 2016 12:23:20 +0530

> vzalloc combines vmalloc and memset 0.
> 
> The Coccinelle semantic patch used to make this change is as follows:
> @@
> type T;
> T *d;
> expression e;
> statement S;
> @@
> 
> d =
> -vmalloc
> +vzalloc
>  (...);
> if (!d) S
> -   memset(d, 0, sizeof(T));
> 
> Signed-off-by: Amitoj Kaur Chawla 

This dos not apply cleanly to net-next.


Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 15:20 -0400, Jason Baron wrote:


> hmm, I think we need the smp_mb() there. From
> tcp_poll() we have:
> 
> 1) set_bit(SOCK_NOSPACE, ...)  (write)
> 2) smp_mb__after_atomic();
> 3) if (sk_stream_is_writeable(sk)) (read)
> 
> while in tcp_check_space() its:
> 
> 1) the state that sk_stream_is_writeable() cares about (write)
> 2) smp_mb();
> 3) if (sk->sk_socket && test_bit(SOCK_NOSPACE,...) (read)


Oh right, thanks for checking.

> 
> So if we can show that there are sufficient barriers
> for #1 (directly above), maybe it can be down-graded or
> eliminated. But it would still seem somewhat fragile.
> 
> Note I didn't observe any missing wakeups here, but I
> just wanted to make sure we didn't miss any, since they
> can be quite hard to debug.
x



Re: [PATCH iproute2 net-next v4 0/5] bridge: json support for fdb and vlan show

2016-06-22 Thread Jiri Pirko
Wed, Jun 22, 2016 at 08:10:47PM CEST, step...@networkplumber.org wrote:
>On Wed, 22 Jun 2016 16:53:44 +0200
>Jiri Pirko  wrote:
>
>> Wed, Jun 22, 2016 at 03:45:50PM CEST, ro...@cumulusnetworks.com wrote:
>> >From: Roopa Prabhu 
>> >
>> >This patch series adds json support for a few bridge show commands.
>> >We plan to follow up with json support for additional commands soon.  
>> 
>> I'm just curious, what is you use case for this? Apps can use rtnetlink
>> socket directly.
>
>Try using netlink in perl or python, it is quite difficult.

pyroute2? Quite easy...


Re: [patch net-next v5 0/4] return offloaded stats as default and expose original sw stats

2016-06-22 Thread Roopa Prabhu
On Tue, Jun 21, 2016 at 8:15 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> The problem we try to handle is about offloaded forwarded packets
> which are not seen by kernel. Let me try to draw it:
>
> port1   port2 (HW stats are counted here)
>   \  /
>\/
> \  /
>  --(A) ASIC --(B)--
> |
>(C)
> |
>CPU (SW stats are counted here)
>
>
> Now we have couple of flows for TX and RX (direction does not matter here):
>
> 1) port1->A->ASIC->C->CPU
>
>For this flow, HW and SW stats are equal.
>
> 2) port1->A->ASIC->C->CPU->C->ASIC->B->port2
>
>For this flow, HW and SW stats are equal.
>
> 3) port1->A->ASIC->B->port2
>
>For this flow, SW stats are 0.
>
> The purpose of this patchset is to provide facility for user to
> find out the difference between flows 1+2 and 3. In other words, user
> will be able to see the statistics for the slow-path (through kernel).
>
> Also note that HW stats are what someone calls "accumulated" stats.
> Every packet counted by SW is also counted by HW. Not the other way around.
>
> As a default the accumulated stats (HW) will be exposed to user
> so the userspace apps can react properly.
>
>

curious, how do you plan to handle virtual device counters like vlan
and vxlan stats ?.
we can't separate CPU and HW stats there. In some cases (or ASICs) HW
counters do
not include CPU generated packetsyou will have to add CPU
generated pkt counters to the
hw counters for such virtual device stats.

example: In the switchdev model, for bridge vlan stats, when user
queries bridge vlan stats,
you will have to add the hw stats to the bridge driver vlan stats and
return it to the user .

Having a consistent model for all kinds of stats will help.


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-22 Thread David Miller
From: David Ahern 
Date: Tue, 21 Jun 2016 22:06:01 -0600

> On 6/21/16 9:42 PM, Jerry Chu wrote:
>> Yes TOU may lower the bar for random hacks by Joe Random. But I'd
>> argue
>> no large organization would serious consider or dare deploy TCP stack
>> with random hacks.
> 
> There are userspace network stacks that have been around for years and
> widely deployed on devices that basically use Linux as the boot OS.

I'm not talking about TCP stacks that are trying to do things right.


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-22 Thread David Miller
From: Jerry Chu 
Date: Tue, 21 Jun 2016 20:42:19 -0700

> I don't believe TOU will lead to a proliferation of TCP
> implementations in the userland - getting a solid TCP implementation
> is hard.

The fear isn't doing legitimate things.

It's making TCP stacks that do evil stuff on purpose.

Also, making proprietary TCP stacks that override the kernel one.

And finally, here's the best part, all of the above can be done as a
new, huge, potential attack vector for hackers.

All they need is to get this evil TCP stack working once, then it's
in every root kit out there.  If you can take over the TCP stack of
a several hundred thousand machine strong botnet, imagine what you
could do.

And the TOU facility... facilitates this.


Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron



On 06/22/2016 02:51 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 11:43 -0700, Eric Dumazet wrote:

On Wed, 2016-06-22 at 14:18 -0400, Jason Baron wrote:

For 1/2, the getting the correct memory barrier, should I re-submit
that as a separate patch?

Are you sure a full memory barrier (smp_mb() is needed ?

Maybe smp_wmb() would be enough ?

(And smp_rmb() in tcp_poll() ?)

Well, in tcp_poll() smp_mb__after_atomic() is fine as it follows
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);

(although we might add a comment why we should keep
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk) before the set_bit() !)

But presumably smp_wmb() would be enough in tcp_check_space()






hmm, I think we need the smp_mb() there. From
tcp_poll() we have:

1) set_bit(SOCK_NOSPACE, ...)  (write)
2) smp_mb__after_atomic();
3) if (sk_stream_is_writeable(sk)) (read)

while in tcp_check_space() its:

1) the state that sk_stream_is_writeable() cares about (write)
2) smp_mb();
3) if (sk->sk_socket && test_bit(SOCK_NOSPACE,...) (read)

So if we can show that there are sufficient barriers
for #1 (directly above), maybe it can be down-graded or
eliminated. But it would still seem somewhat fragile.

Note I didn't observe any missing wakeups here, but I
just wanted to make sure we didn't miss any, since they
can be quite hard to debug.

Thanks,

-Jason


Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 11:43 -0700, Eric Dumazet wrote:
> On Wed, 2016-06-22 at 14:18 -0400, Jason Baron wrote:
> > 
> 
> > 
> > For 1/2, the getting the correct memory barrier, should I re-submit
> > that as a separate patch?
> 
> Are you sure a full memory barrier (smp_mb() is needed ?
> 
> Maybe smp_wmb() would be enough ?
> 
> (And smp_rmb() in tcp_poll() ?)

Well, in tcp_poll() smp_mb__after_atomic() is fine as it follows 
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);

(although we might add a comment why we should keep
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk) before the set_bit() !)

But presumably smp_wmb() would be enough in tcp_check_space()







Re: [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer

2016-06-22 Thread David Miller
From: Saeed Mahameed 
Date: Tue, 21 Jun 2016 16:10:26 +0300

> I would like to drop this series,

Don't worry, I dropped it several days before this.


Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Eric Dumazet
On Wed, 2016-06-22 at 14:18 -0400, Jason Baron wrote:
> 

> 
> For 1/2, the getting the correct memory barrier, should I re-submit
> that as a separate patch?

Are you sure a full memory barrier (smp_mb() is needed ?

Maybe smp_wmb() would be enough ?

(And smp_rmb() in tcp_poll() ?)





[RFC] UAPI for 6lowpan

2016-06-22 Thread Alexander Aring
Hi all,

I currently want to search nice solution to adding an UAPI for 6LoWPAN.
The current 6LoWPAN UAPI is some dev-hacking stuff in debugfs [0].

Marcel told me before I tried to send upstream patches for radvd [1] I
definitely should introduce a stable UAPI concept.

The question is here, what would be the right UAPI? Maybe extends
existing ones e.g. RTNL?

I would declare the following three cases where we need an UAPI in
different 6LoWPAN handling layers, for all cases we want to use netlink
of course:

 - 6LoWPAN specific only:

This is 6LoWPAN specific only and has nothing to do with L2 (e.g.
802.15.4 or BTLE) and L3 (IPv6).

As example this would be a per interface context table for stateful
compression. This context table contains IPv6 prefix and 6LoWPAN can
compress L3 address information by some context table lookup.

This context table need to be accessible from userspace.

Question:
What should here the right netlink UAPI?

In this case I think I cannot use RTNL because, RTNL is bound also to
PF_* which is in our case PF_INET6. RTNL is also very generic UAPI and
adding "ipv6 prefixes for some compression algorithm" isn't something
which can be found also in some other PF_* or existing RTNL commands.

I would here use an own netlink implementation, "nl6lowpan" or does RTNL
have some mechanism to provide such exotic settings per device type?

 - 6LoWPAN link-layer specific:

Sometimes we need for 6LoWPAN interface L2 information, e.g. short
address for 802.15.4 6LoWPAN.

In case of 802.15.4 6LoWPAN we create the 6LoWPAN interface like the
following ip command:

ip link add link wpan0 name lowpan0 type lowpan

Where wpan0 is the L2 device which doesn't has the capability to run IPv6
on it.

So theoretically we could get from the lowpan0 interface (6LoWPAN
interface) the ifindex of wpan0 interface and get the necessary L2
information from another netlink API (nl802154).

But this would only work for 802.15.4 case, BTLE works different and use
different L2 UAPI.

I think we should simple ignore such theoretically handling of L2
netlink API from upper 6LoWPAN interface and provide the L2 information
simple in "6LoWPAN netlink UAPI", e.g. short address netlink attribute will
be NULL for BTLE 6LoWPAN interfaces.

 - IPv6 Layer but 6LoWPAN specific:

These are settings which are 6LoWPAN specific but sitting in L3 layer
(IPv6).

As example this would be "6LoWPAN neighbour entry parameters" which
could also be L2 specific (e.g. short address on 802.15.4) or not.

In this case I would use existing neighbour cache UAPI. I tried to
implement such UAPI already by introduce a NDA_PRIVATE nested attribute
which depends on device type. Here is the draft (dumping functionality
only):

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 890158e..650b558 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1222,6 +1222,8 @@ struct net_device_ops {
netdev_features_t features);
int (*ndo_neigh_construct)(struct neighbour *n);
void(*ndo_neigh_destroy)(struct neighbour *n);
+   int (*ndo_neigh_fill_info)(struct sk_buff *skb,
+  struct neighbour *n);
 
int (*ndo_fdb_add)(struct ndmsg *ndm,
   struct nlattr *tb[],
@@ -1704,6 +1706,7 @@ struct net_device {
unsigned char   addr_assign_type;
unsigned char   addr_len;
unsigned short  neigh_priv_len;
+   unsigned short  neigh_priv_nlmsg_len;
unsigned short  dev_id;
unsigned short  dev_port;
spinlock_t  addr_list_lock;
diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index bd99a8d..0072008 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -15,6 +15,13 @@ struct ndmsg {
 };
 
 enum {
+   NDA_6LOWPAN_802154_SHORT_ADDR,
+   __NDA_6LOWPAN_MAX
+};
+
+#define NDA_6LOWPAN_MAX (__NDA_6LOWPAN_MAX - 1)
+
+enum {
NDA_UNSPEC,
NDA_DST,
NDA_LLADDR,
@@ -26,6 +33,7 @@ enum {
NDA_IFINDEX,
NDA_MASTER,
NDA_LINK_NETNSID,
+   NDA_PRIVATE,
__NDA_MAX
 };
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 29dd8cc..b6b3abb 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2153,6 +2153,7 @@ static int neigh_fill_info(struct sk_buff *skb, struct 
neighbour *neigh,
unsigned long now = jiffies;
struct nda_cacheinfo ci;
struct nlmsghdr *nlh;
+   struct nlattr *nest;
struct ndmsg *ndm;
 
nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ndm), flags);
@@ -2182,6 +2183,17 @@ static int neigh_fill_info(struct sk_buff *skb, struct 
neighbour *neigh,
}
}
 
+   if (neigh->dev->netdev_op

Re: [PATCH iproute2 net-next v4 0/5] bridge: json support for fdb and vlan show

2016-06-22 Thread Roopa Prabhu
On Wed, Jun 22, 2016 at 11:10 AM, Stephen Hemminger
 wrote:
> On Wed, 22 Jun 2016 16:53:44 +0200
> Jiri Pirko  wrote:
>
>> Wed, Jun 22, 2016 at 03:45:50PM CEST, ro...@cumulusnetworks.com wrote:
>> >From: Roopa Prabhu 
>> >
>> >This patch series adds json support for a few bridge show commands.
>> >We plan to follow up with json support for additional commands soon.
>>
>> I'm just curious, what is you use case for this? Apps can use rtnetlink
>> socket directly.
>
> Try using netlink in perl or python, it is quite difficult.

yep, ++


[PATCH] mellanox: mlx5: Use logging functions to reduce text ~10k/5%

2016-06-22 Thread Joe Perches
The logging macros create a bit of duplicated code/text.

Use specialized functions to reduce the duplication.

(defconfig/x86-64)
$ size drivers/net/ethernet/mellanox/mlx5/core/built-in.o*
   text    data bss dec hex filename
 178634    2059  16  180709   2c1e5 
drivers/net/ethernet/mellanox/mlx5/core/built-in.o.new
 188679    2059  16  190754   2e922 
drivers/net/ethernet/mellanox/mlx5/core/built-in.o.old

The output changes now do not include line #,
but do include the function offset.

Signed-off-by: Joe Perches 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 34 ++
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h| 13 +++--
 2 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index a19b593..34cbaf0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1557,3 +1557,37 @@ static void __exit cleanup(void)
 
 module_init(init);
 module_exit(cleanup);
+
+void mlx5_core_err(struct mlx5_core_dev *dev, const char *fmt, ...)
+{
+   struct va_format vaf;
+   va_list args;
+
+   va_start(args, fmt);
+
+   vaf.fmt = fmt;
+   vaf.va = &args;
+
+   dev_err(&dev->pdev->dev, "%s:%pS:(pid %d): %pV",
+   dev->priv.name, __builtin_return_address(0), current->pid,
+   &vaf);
+
+   va_end(args);
+}
+
+void mlx5_core_warn(struct mlx5_core_dev *dev, const char *fmt, ...)
+{
+   struct va_format vaf;
+   va_list args;
+
+   va_start(args, fmt);
+
+   vaf.fmt = fmt;
+   vaf.va = &args;
+
+   dev_warn(&dev->pdev->dev, "%s:%pS:(pid %d): %pV",
+    dev->priv.name, __builtin_return_address(0), current->pid,
+    &vaf);
+
+   va_end(args);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 2f86ec6..31430a7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -57,15 +57,10 @@ do {
\
    mlx5_core_dbg(__dev, format, ##__VA_ARGS__);\
 } while (0)
 
-#define mlx5_core_err(__dev, format, ...)  \
-   dev_err(&(__dev)->pdev->dev, "%s:%s:%d:(pid %d): " format,  \
-      (__dev)->priv.name, __func__, __LINE__, current->pid,\
-      ##__VA_ARGS__)
-
-#define mlx5_core_warn(__dev, format, ...) \
-   dev_warn(&(__dev)->pdev->dev, "%s:%s:%d:(pid %d): " format, \
-   (__dev)->priv.name, __func__, __LINE__, current->pid,   \
-   ##__VA_ARGS__)
+__printf(2, 3)
+void mlx5_core_err(struct mlx5_core_dev *dev, const char *fmt, ...);
+__printf(2, 3)
+void mlx5_core_warn(struct mlx5_core_dev *dev, const char *fmt, ...);
 
 #define mlx5_core_info(__dev, format, ...) \
    dev_info(&(__dev)->pdev->dev, format, ##__VA_ARGS__)




Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Yuval Mintz
 This series adds driver support for the processing of tunnel
 [specifically vxlan/geneve/gre tunnels] packets which are
 aggregated [GROed] by the hardware before driver passes
 such packets upto the stack.
>>> First off I am pretty sure this isn't GRO.  This is LRO.
>> Nopes. This is GRO - each MSS-sized frame will arrive on its
>> own frag, whereas the headers [both  internal & external]
>> would arrive on the linear part of the SKB.

> No it is LRO, it just very closely mimics GRO.  If it is in hardware
> it is LRO.  GRO is extensible in software and bugs can be fixed in
> kernel, LRO is not extensible and any bugs in it are found in hardware
> or the driver and would have to be fixed there.  It all comes down to
> bug isolation.  If we find that disabling LRO makes the bug go away we
> know the hardware aggregation has an issue.  Both features should not
> operate on the same bit.  Otherwise when some bug is found in your
> implementation it will be blamed on GRO when the bug is actually in
> LRO.
I'm not aware of a definition stating GRO *has* to be extensible in SW;
AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
for aggregation, method of aggregation, etc.].

> I realize this causes some pain when routing or bridging as LRO is
> disabled but that is kind of the point.  We don't want the hardware to
> be mangling frames when we are attempting to route packets between any
> two given devices.
Actually, while I might disagree on whether this is LRO/GRO, I don't think
there's any problem for us to base this on the LRO feature - I.e., when we
started working on qede we didn't bother implementing LRO as we understood
it was deprecated, but this encapsulated aggregation is configurable; If it
makes life easier for everyone if we make the configuration based on the LRO
configuration, so that when LRO is disabled we won't have this turned on
it can be easily done.

>>> Also I don't know if you have been paying attention to recent
>>> discussions on the mailing list but the fact is GRO over UDP tunnels
>>> is still a subject for debate.  This patch takes things in the
>>> opposite direction of where we are currently talking about going with
>>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>>> have the proper understanding of all this as my protocol knowledge is
>>> somewhat lacking.
>> I guess we're on the exact opposite side of the discussion - I.e., we're
>> the vendor that tries pushing offload capabilities to the device.
>> Do notice however that we're not planning on pushing anything new
>> feature-wise, just like we haven't done anything for regular TCP GRO -
>> All we do is allow our HW/FW to aggregate packets instead of stack.

> If you have been following the list then arguably you didn't fully
> understand what has been going on.  I just went through and cleaned up
> all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
> and tried to get this consolidated so that we could start to have
> conversations about this without things being outright rejected.  I
> feel like you guys have just prooven the argument for the other side
> that said as soon as we start support any of it, the vendors were
> going to go nuts and try to stuff everything and the kitchen sink into
> the NICs.  The fact is I am starting to think they were right.
You hurt my feeling; I started going nuts ages ago ;-)
But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.
And while I understand why you're being suspicious of such an addition,
I don't entirely see how it affects any of that discussion - I already yielded
that we can make this configurable, so if any routing decision would be
added that result in packets NOT supposed to be aggregated, the feature
can be turned off [at worst, at the expanse of it not having its benefit
on other connections].

>>> Ideally we need to be able to identify that a given packet terminates
>>> on a local socket in our namespace before we could begin to perform
>>> any sort of mangling on the local packets.  It is always possible that
>>> we could be looking at a frame that uses the same UDP port but is not
>>> the tunnel protocol if we are performing bridging or routing.  The
>>> current GRO implementation fails in that regard and there are
>>> discussions between several of us on how to deal with that.  It is
>>> likely that we would be forcing GRO for tunnels to move a bit further
 up the stack if bridging or routing so that we could verify that the
>>> frame is not being routed back out before we could aggregate it.

>> I'm aware of the on-going discussion, but I'm not sure this should
>> bother us greatly - the aggregation is done based on the
>> inner TCP connection; I.e., it's guaranteed all the frames belong to
>> the same TCP connection. While HW requires the UDP port for the
>> init

Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron



On 06/22/2016 01:34 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 11:32 -0400, Jason Baron wrote:

From: Jason Baron 

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.


Still, generating one POLLOUT event per incoming ACK will not please
epoll() users in edge-trigger mode.

Host is under global memory pressure, so we probably want to drain
socket queues _and_ reduce cpu pressure.

Strategy to insure all sockets converge to small amounts ASAP is simply
the best answer.

Letting big SO_SNDBUF offenders hog memory while their queue is big
is not going to help sockets who can not get ACK
(elephants get more ACK than mice, so they have more chance to succeed
their new allocations)

Your patch adds lot of complexity logic in tcp_sendmsg() and
tcp_sendpage().


I would prefer a simpler patch like :




Ok, fair enough. I'm going to assume that you will submit this as
a formal patch.

For 1/2, the getting the correct memory barrier, should I re-submit
that as a separate patch?

Thanks,

-Jason



Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-22 Thread Michael S. Tsirkin
On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote:
> Would it help to have ptr_ring_resize that gets an array of
> rings and resizes them both to same length?

OK, here it is. Untested so far, and no skb wrapper.
Pls let me know whether this is what you had in mind.

-->

ptr_ring: support resizing multiple queues

Signed-off-by: Michael S. Tsirkin 

---

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index a29b023..e576801 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -354,20 +354,14 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, int pad, gfp_t gfp
return 0;
 }
 
-static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
- void (*destroy)(void *))
+static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
+  int size, gfp_t gfp,
+  void (*destroy)(void *))
 {
-   unsigned long flags;
int producer = 0;
-   void **queue = __ptr_ring_init_queue_alloc(size, gfp);
void **old;
void *ptr;
 
-   if (!queue)
-   return -ENOMEM;
-
-   spin_lock_irqsave(&(r)->producer_lock, flags);
-
while ((ptr = ptr_ring_consume(r)))
if (producer < size)
queue[producer++] = ptr;
@@ -380,6 +374,23 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
old = r->queue;
r->queue = queue;
 
+   return old;
+}
+
+static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   void **queue = __ptr_ring_init_queue_alloc(size, gfp);
+   void **old;
+
+   if (!queue)
+   return -ENOMEM;
+
+   spin_lock_irqsave(&(r)->producer_lock, flags);
+
+   old = __ptr_ring_swap_queue(r, queue, size, gfp, destroy);
+
spin_unlock_irqrestore(&(r)->producer_lock, flags);
 
kfree(old);
@@ -387,6 +398,49 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
return 0;
 }
 
+static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, int nrings,
+  int size,
+  gfp_t gfp, void (*destroy)(void *))
+{
+   unsigned long flags;
+   void ***queues;
+   int i;
+
+   queues = kmalloc(nrings * sizeof *queues, gfp);
+   if (!queues)
+   goto noqueues;
+
+   for (i = 0; i < nrings; ++i) {
+   queues[i] = __ptr_ring_init_queue_alloc(size, gfp);
+   if (!queues[i])
+   goto nomem;
+   }
+
+   spin_lock_irqsave(&(rings[i])->producer_lock, flags);
+
+   for (i = 0; i < nrings; ++i)
+   queues[i] = __ptr_ring_swap_queue(rings[i], queues[i],
+ size, gfp, destroy);
+
+   spin_unlock_irqrestore(&(rings[i])->producer_lock, flags);
+
+   for (i = 0; i < nrings; ++i)
+   kfree(queues[i]);
+
+   kfree(queues);
+
+   return 0;
+
+nomem:
+   while (--i >= 0)
+   kfree(queues[i]);
+
+   kfree(queues);
+
+noqueues:
+   return -ENOMEM;
+}
+
 static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void 
*))
 {
void *ptr;
diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 26dc1d2..deb36af 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -17,6 +17,11 @@
 typedef pthread_spinlock_t  spinlock_t;
 
 typedef int gfp_t;
+static void *kmalloc(unsigned size, gfp_t gfp)
+{
+   return memalign(64, size);
+}
+
 static void *kzalloc(unsigned size, gfp_t gfp)
 {
void *p = memalign(64, size);


Re: [PATCH RFC] sched: split classification and enqueue

2016-06-22 Thread Stephen Hemminger
On Wed, 22 Jun 2016 12:03:55 +0200
Florian Westphal  wrote:

> Currently classification and enqueue is done in a single step.
> 
> core acquires the qdisc lock, then calls the ->enqueue() function
> of the qdisc.
> 
> Its the job of the qdisc and its attached classifiers to figure out what
> to do next.
> 
> Typically the enqueue function will call tc_classify() to lookup a
> child class, then call ->enqueue of the child qdisc.
> 
> This can repeat a number of times until a leaf qdisc is reached; this leaf
> will do the real enqueue operation (pfifo for example).
> 
> While this approach gives qdiscs and the classifier/action subsystem
> a lot of control, it has one major drawback:  The root qdisc lock
> is held for a prolonged period of time while we recurse through
> the qdisc hierarchy from root to leaf.
> 
> This (unfinished!) hack splits classification and enqueue into
> two steps.
> 
> Before enqueueing the packet and *before* acquiring the root qdisc lock,
> the new qdisc ->classify() function is invoked.
> 
> This function -- much like enqueue in the current scheme -- looks up
> a child class and/or determines the next qdisc where the packet needs
> to be handled via the classifier action subsystem.
> Then it invokes the new ->classify() hook of the child, which can repeat
> until the leaf qdisc is reached.
> 
> Whenever a classify operation has succeeded, each classify "level"
> stores itself (qdisc) and a cookie (typically just a pointer to the
> class) in the newly added Qdisc_classify_state struct.
> 
> After qdisc_classify returns, this struct contains the path that
> the skb is supposed to take through the qdisc hierarchy in *reverse*
> order, i.e. starting from the leaf up to the root.
> 
> Then, the root qdisc lock is taken and the *leaf* (first entry in
> Qdisc_classify_state) ->enqueue function is invoked.
> 
> If this succeeds, the kernel will invoke the new ->senqeue
> function for all the remaining entries in Qdisc_classify_state.
> 
> This function is responsible to perform needed qdisc internal actions,
> f.e. scheduling a class for transmit.
> 
> This reduces the amount of work done under qdisc root lock.
> Very simple test w. hfsc + u32 filter, gso off, 10G link and 20 netperf
> TCP_STREAM:
> 
>   before: 8.2 GB/s
>   after: 9.4 GB/s (same as w. mq+fq_codel)
> 
> Known issues with this patch:
>  - only converted a few qdiscs
>  - qdisc stats will race since they're changed outside of root lock.
>  - need to protect vs. class removal during classify ops
>  - mirred needs some work for this (but not worse than current situation)
>  - use-after-free in some places, need to ensure that skb remains valid
>iff enqueue returns != DROP.
>  - senqeue is a horrid name.  It is probably better to split qdiscs
>into leaf and non-leaf ones, where leaf qdiscs implement ->enqueue() and
>the others a notify_enqueue() (which doesn't return a value).
> 
> So far I did not see any fundamental issues with this approach.
> 
> If you see any, I'd be happy to learn about them before I spend more
> cycles on this.  The main work to be done is to convert qstats to
> percpu counters and then convert all the qdiscs to this new scheme,
> and of course to extend this to all intree schedulers.
> 
> If you attend NF workshop in Amsterdam feel free to yell at me in person 
> there.
> ---
>  include/net/sch_generic.h | 50 +
>  net/core/dev.c| 10 ++-
>  net/sched/sch_drr.c   | 38 +++--
>  net/sched/sch_fq_codel.c  | 71 
> +--
>  net/sched/sch_generic.c   | 32 +
>  net/sched/sch_hfsc.c  | 38 +++--
>  6 files changed, 232 insertions(+), 7 deletions(-)
> 

The architecture of linux packet scheduler makes this too racy.
Classification can be attached to different nodes on the hierarchy, and the
hierarchy could change underneath.

It is a good idea, but maybe would be better to have a separate classification
step before scheduler (use BPF) and put meta-data in skb. That way it could
be safely lockless and atomic.




Re: rstpd implementation

2016-06-22 Thread Stephen Hemminger
On Wed, 22 Jun 2016 12:44:52 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

> Phil  writes:
> 
> > Hi,
> >
> > When looking for an RSTP daemon I found Stephen Hemminger's
> > git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/rstp.git
> > 
> > with it's last commit from October 2011.
> >
> > Is this implementation still in good use by anybody - or has it been
> > replaced/superseded by another implementation?
> 
> I don't know and when you get into user space daemons they aren't much
> talked about on the kernel lists.  That said you will likely fair better
> on the netdev list (cc'd).
> 
> Eric

The current one I recommend is the MSTPd done by Cumulus
 https://sourceforge.net/p/mstpd/wiki/Home/
But like all projects they could use help


Re: [PATCH iproute2 net-next v4 0/5] bridge: json support for fdb and vlan show

2016-06-22 Thread Stephen Hemminger
On Wed, 22 Jun 2016 16:53:44 +0200
Jiri Pirko  wrote:

> Wed, Jun 22, 2016 at 03:45:50PM CEST, ro...@cumulusnetworks.com wrote:
> >From: Roopa Prabhu 
> >
> >This patch series adds json support for a few bridge show commands.
> >We plan to follow up with json support for additional commands soon.  
> 
> I'm just curious, what is you use case for this? Apps can use rtnetlink
> socket directly.

Try using netlink in perl or python, it is quite difficult.


Re: rstpd implementation

2016-06-22 Thread Eric W. Biederman
Phil  writes:

> Hi,
>
> When looking for an RSTP daemon I found Stephen Hemminger's
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/rstp.git
> 
> with it's last commit from October 2011.
>
> Is this implementation still in good use by anybody - or has it been
> replaced/superseded by another implementation?

I don't know and when you get into user space daemons they aren't much
talked about on the kernel lists.  That said you will likely fair better
on the netdev list (cc'd).

Eric


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Alexander Duyck
On Wed, Jun 22, 2016 at 10:16 AM, Yuval Mintz  wrote:
>>> This series adds driver support for the processing of tunnel
>>> [specifically vxlan/geneve/gre tunnels] packets which are
>>> aggregated [GROed] by the hardware before driver passes
>>> such packets upto the stack.
>
>> First off I am pretty sure this isn't GRO.  This is LRO.
> Nopes. This is GRO - each MSS-sized frame will arrive on its
> own frag, whereas the headers [both  internal & external]
> would arrive on the linear part of the SKB.

No it is LRO, it just very closely mimics GRO.  If it is in hardware
it is LRO.  GRO is extensible in software and bugs can be fixed in
kernel, LRO is not extensible and any bugs in it are found in hardware
or the driver and would have to be fixed there.  It all comes down to
bug isolation.  If we find that disabling LRO makes the bug go away we
know the hardware aggregation has an issue.  Both features should not
operate on the same bit.  Otherwise when some bug is found in your
implementation it will be blamed on GRO when the bug is actually in
LRO.

I realize this causes some pain when routing or bridging as LRO is
disabled but that is kind of the point.  We don't want the hardware to
be mangling frames when we are attempting to route packets between any
two given devices.

>> Also I don't know if you have been paying attention to recent
>> discussions on the mailing list but the fact is GRO over UDP tunnels
>> is still a subject for debate.  This patch takes things in the
>> opposite direction of where we are currently talking about going with
>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>> have the proper understanding of all this as my protocol knowledge is
>> somewhat lacking.
> I guess we're on the exact opposite side of the discussion - I.e., we're
> the vendor that tries pushing offload capabilities to the device.
> Do notice however that we're not planning on pushing anything new
> feature-wise, just like we haven't done anything for regular TCP GRO -
> All we do is allow our HW/FW to aggregate packets instead of stack.

If you have been following the list then arguably you didn't fully
understand what has been going on.  I just went through and cleaned up
all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
and tried to get this consolidated so that we could start to have
conversations about this without things being outright rejected.  I
feel like you guys have just prooven the argument for the other side
that said as soon as we start support any of it, the vendors were
going to go nuts and try to stuff everything and the kitchen sink into
the NICs.  The fact is I am starting to think they were right.

>> Ideally we need to be able to identify that a given packet terminates
>> on a local socket in our namespace before we could begin to perform
>> any sort of mangling on the local packets.  It is always possible that
>> we could be looking at a frame that uses the same UDP port but is not
>> the tunnel protocol if we are performing bridging or routing.  The
>> current GRO implementation fails in that regard and there are
>> discussions between several of us on how to deal with that.  It is
>> likely that we would be forcing GRO for tunnels to move a bit further
>> up the stack if bridging or routing so that we could verify that the
>> frame is not being routed back out before we could aggregate it.

> I'm aware of the on-going discussion, but I'm not sure this should
> bother us greatly - the aggregation is done based on the
> inner TCP connection; I.e., it's guaranteed all the frames belong to
> the same TCP connection. While HW requires the UDP port for the
> initial classification effort, it doesn't take decision solely on that.

I realize that UDP port is only one piece of this.  Exactly what
fields are being taken into account.  That information would be useful
as we could then go though and determine what the probability is of us
having a false match between any two packets in a given UDP flow.

Also I would still argue this is LRO.  If we are doing routing it
should be disabled for us to later re-enable if we feel it is safe.
Having a feature like this enabled by default with GRO is a really bad
idea.

>> Also I would be interested in knowing how your hardware handles
>> tunnels with outer checksums.  Is it just ignoring the frames in such
>> a case, ignoring the checksum, or is it actually validating the frames
>> and then merging the resulting checksum?

> HW is validating both inner and outer csum; if it wouldn't be able to
> due to some limitation, it would not try to pass the packet as a GRO
> aggregation but rather as regular seperated packets.
> But I don't believe it merges the checksum, only validate each
> independently [CSUM_UNNECESSARY].

So it is mangling the packets then.  If this were flagged as LRO at
least if bridging or routing is enabled you won't be mangling the
frames without the user having to specifically go back through and

Re: 802.3ad bonding aggregator reselection

2016-06-22 Thread Veli-Matti Lintu
2016-06-22 3:49 GMT+03:00 Jay Vosburgh :
>
> Veli-Matti Lintu  wrote:
> [...]
The ports are configured in switch settings (HP Procurve 2530-48G) in
same trunk group (TrkX) and trunk group type is set as LACP.
/proc/net/bonding/bond0 also shows that the three ports belong to same
aggregator and bandwidth tests also support this. In my understanding
Procurve's trunk group is pretty much the same as etherchannel in
Cisco's terminology. The bonded link comes always up properly, but
handling of links going down is the problem. Are there known
differences between different vendors there?
>>>
>>> I did the original LACP reselection testing on a Cisco switch,
>>> but I have an HP 2530 now; I'll test it later today or tomorrow and see
>>> if it behaves properly, and whether your proposed patch is needed.
>>
>>Thanks for taking a look at this. Here are some more details about the
>>setup as Zhu Yanjun also requested.
>
> Summary (because anything involving a standard tends to get long
> winded):
>
> This is not a switch problem.  Bonding appears to be following
> the standard in this case.  I've identified when this behavior changed,
> and I think we should violate the standard in this case for ad_select
> set to "bandwidth" or "count," neither of which is the default value.
>
> Long winded version:
>
> I've reproduced the issue locally, and it does not appear to be
> anything particular to the switch.  It appears to be due to changes from
>
> commit 7bb11dc9f59ddcb33ee317da77b235235aaa582a
> Author: Mahesh Bandewar 
> Date:   Sat Oct 31 12:45:06 2015 -0700
>
> bonding: unify all places where actor-oper key needs to be updated.
>
> Specifically this block:
>
>  void bond_3ad_handle_link_change(struct slave *slave, char link)
> [...]
> -   /* there is no need to reselect a new aggregator, just signal the
> -* state machines to reinitialize
> -*/
> -   port->sm_vars |= AD_PORT_BEGIN;
>
> Previously, setting BEGIN would cause the port in question to be
> reinitialized, which in turn would trigger reselection.
>
> I'm not sure that adding this section back is the correct fix
> from the point of view of the standard, however, as 802.1AX 5.2.3.1.2
> defines BEGIN as:
>
> A Boolean variable that is set to TRUE when the System is
> initialized or reinitialized, and is set to FALSE when
> (re-)initialization has completed.
>
> and in this case we're not reinitializing the System (i.e., the
> bond).
>
> Further, 802.1AX 5.4.12 says:
>
> If the port becomes inoperable and a BEGIN event has not
> occurred, the state machine enters the PORT_DISABLED
> state. Partner_Oper_Port_State.Synchronization is set to
> FALSE. This state allows the current Selection state to remain
> undisturbed, so that, in the event that the port is still
> connected to the same Partner and Partner port when it becomes
> operable again, there will be no disturbance caused to higher
> layers by unneccessary re-configuration.
>
> At the moment, bonding is doing what 5.4.12 specifies, by
> placing the port into PORT_DISABLED state.  bond_3ad_handle_link_change
> clears port->is_enabled, which causes ad_rx_machine to clear
> AD_PORT_MATCHED but leave AD_PORT_SELECTED set.  This in turn cause the
> selection logic to skip this port, resulting in the observed behavior
> (that the port is link down, but stays in the aggregator).
>
> Bonding will still remove the slave from the bond->slave_arr, so
> it won't actually try to send on this slave.  I'll further note that
> 802.1AX 5.4.7 defines port_enabled as:
>
> A variable indicating that the physical layer has indicated that
> the link has been established and the port is operable.
> Value: Boolean
> TRUE if the physical layer has indicated that the port is operable.
> FALSE otherwise.
>
> So, it appears that bonding is in conformance with the standard
> in this case.

I haven't done extensive testing on this, but I haven't noticed
anything that would indicate that anything is sent to failed ports. So
this part should be working.

> I don't see an issue with the above behavior when ad_select is
> set to the default value of "stable"; bonding does reselect a new
> aggregator when all links fail, and it appears to follow the standard.

In my testing ad_select=stable does not reselect a new aggregator when
all links have failed. Reselection seems to occur only when a link
comes up the failure. Here's an example of two bonds having three
links each. Aggregator ID 3 is active with three ports and ID 2 has
also three ports up.


802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:34:c7:f1
Active Aggregator Info:
Aggregator ID: 

Re: [PATCH RFC] sched: split classification and enqueue

2016-06-22 Thread Florian Westphal
Alexei Starovoitov  wrote:
> On Wed, Jun 22, 2016 at 3:03 AM, Florian Westphal  wrote:
> > Currently classification and enqueue is done in a single step.
> >
> > core acquires the qdisc lock, then calls the ->enqueue() function
> > of the qdisc.
> >
> > Its the job of the qdisc and its attached classifiers to figure out what
> > to do next.
> >
> > Typically the enqueue function will call tc_classify() to lookup a
> > child class, then call ->enqueue of the child qdisc.
> >
> > This can repeat a number of times until a leaf qdisc is reached; this leaf
> > will do the real enqueue operation (pfifo for example).
> >
> > While this approach gives qdiscs and the classifier/action subsystem
> > a lot of control, it has one major drawback:  The root qdisc lock
> > is held for a prolonged period of time while we recurse through
> > the qdisc hierarchy from root to leaf.
> >
> > This (unfinished!) hack splits classification and enqueue into
> > two steps.
> >
> > Before enqueueing the packet and *before* acquiring the root qdisc lock,
> > the new qdisc ->classify() function is invoked.
> 
> I believe John is finalizing his lockless qdisc patches...
> would this split still be needed after qdiscs become lockless?

The RFC series i saw from John did not change the qdiscs to become
lockless; it did however allow adding qdiscs that can tell stack to
not grab the qdisc root lock
[ http://patchwork.ozlabs.org/patch/561780/ ]

Some of the patches in his old RFC series however add percpu counters
etc. which would be needed for this too.

So AFAIU the two approaches complement one another and are not
mutually exclusive.  For a lot of existing schedulers some kind of
central lock is required since qdisc manages single resource (but we
might be able to move some of that work out of locked section).


Re: [PATCH iproute2 net-next v4 0/5] bridge: json support for fdb and vlan show

2016-06-22 Thread Roopa Prabhu
On Wed, Jun 22, 2016 at 7:53 AM, Jiri Pirko  wrote:
> Wed, Jun 22, 2016 at 03:45:50PM CEST, ro...@cumulusnetworks.com wrote:
>>From: Roopa Prabhu 
>>
>>This patch series adds json support for a few bridge show commands.
>>We plan to follow up with json support for additional commands soon.
>
> I'm just curious, what is you use case for this? Apps can use rtnetlink
> socket directly.

most important use case is for automation/orchestration tools.
They use existing linux tools to query and configure. Iproute2 output
is not machine
readable today ...and json is industry standard.


[PATCH v3] net: smsc911x: Fix bug where PHY interrupts are overwritten by 0

2016-06-22 Thread Jeremy Linton
By default, mdiobus_alloc() sets the PHYs to polling mode, but a
pointer size memcpy means that a couple IRQs end up being overwritten
with a value of 0. This means that PHY_POLL is disabled and results
in unpredictable behavior depending on the PHY's location on the
MDIO bus. Remove that memcpy and the now unused phy_irq member to
force the SMSC911x PHYs into polling mode 100% of the time.

Fixes: e7f4dc3536a4 ("mdio: Move allocation of interrupts into core")

Signed-off-by: Jeremy Linton 
Reviewed-by: Andrew Lunn 
Acked-by: Sergei Shtylyov 
---
 drivers/net/ethernet/smsc/smsc911x.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index 8af2556..b5ab5e1 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -116,7 +116,6 @@ struct smsc911x_data {
 
struct phy_device *phy_dev;
struct mii_bus *mii_bus;
-   int phy_irq[PHY_MAX_ADDR];
unsigned int using_extphy;
int last_duplex;
int last_carrier;
@@ -1073,7 +1072,6 @@ static int smsc911x_mii_init(struct platform_device *pdev,
pdata->mii_bus->priv = pdata;
pdata->mii_bus->read = smsc911x_mii_read;
pdata->mii_bus->write = smsc911x_mii_write;
-   memcpy(pdata->mii_bus->irq, pdata->phy_irq, sizeof(pdata->mii_bus));
 
pdata->mii_bus->parent = &pdev->dev;
 
-- 
2.5.5



  1   2   3   >