date:20170102

[no subject]

2017-01-02 Thread системы администратор



внимания;

Ваши сообщения превысил лимит памяти, который составляет 5 Гб, определенных 
администратором, который в настоящее время работает на 10.9GB, Вы не сможете 
отправить или получить новую почту, пока вы повторно не проверить ваш почтовый 
ящик почты. Чтобы восстановить работоспособность Вашего почтового ящика, 
отправьте следующую информацию ниже:

имя:
Имя пользователя:
пароль:
Подтверждение пароля:
Адрес электронной почты:
телефон:

Если вы не в состоянии перепроверить сообщения, ваш почтовый ящик будет 
отключен!

Приносим извинения за неудобства.
Проверочный код: EN: Ru...776774990..2017
Почты технической поддержки ©2017

спасибо
системы администратор

Re: [PATCH for-next V2 00/11] Mellanox mlx5 core and ODP updates 2017-01-01

2017-01-02 Thread Leon Romanovsky

On Tue, Jan 03, 2017 at 01:30:16AM +0200, Saeed Mahameed wrote:
> On Mon, Jan 2, 2017 at 10:53 PM, David Miller  wrote:
> > From: Saeed Mahameed 
> > Date: Mon,  2 Jan 2017 11:37:37 +0200
> >
> >> The following eleven patches mainly come from Artemy Kovalyov
> >> who expanded mlx5 on-demand-paging (ODP) support. In addition
> >> there are three cleanup patches which don't change any functionality,
> >> but are needed to align codebase prior accepting other patches.
> >
> > Series applied to net-next, thanks.
>
> Whoops,
>
> This series was meant as a pull request, you can blame it on me I
> kinda messed up the V2 title.
> Doug will have to pull same patches later, will this produce a
> conflict on merge window ?

Yes, but it can be easily avoided.

Doug,

We have another pull request to send and we will base its code on the
Dave's tree instead of Linus's rc tag. In such way, you will have the
same commits as Dave and won't have merge failures.

Please don't apply manually this specific patchset.

Sorry for the inconvenience.

Thanks.

>
> Sorry for the confusion.


signature.asc
Description: PGP signature

Re: [PATCH net 9/9] virtio-net: XDP support for small buffers

2017-01-02 Thread Jason Wang




On 2017年01月03日 06:43, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.

Cc: John Fastabend 
Signed-off-by: Jason Wang 
---
  drivers/net/virtio_net.c | 112 ---
  1 file changed, 87 insertions(+), 25 deletions(-)


Hi Jason,

I was doing some more testing on this what do you think about doing this
so that free_unused_bufs() handles the buffer free with dev_kfree_skb()
instead of put_page in small receive mode. Seems more correct to me.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 783e842..27ff76c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info 
*vi)

  static bool is_xdp_queue(struct virtnet_info *vi, int q)
  {
+   /* For small receive mode always use kfree_skb variants */
+   if (!vi->mergeable_rx_bufs)
+   return false;
+
 if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
 return false;
 else if (q < vi->curr_queue_pairs)


patch is untested just spotted doing code review.

Thanks,
John


We probably need a better name for this function.

Acked-by: Jason Wang

Re: [net PATCH] net: virtio: cap mtu when XDP programs are running

2017-01-02 Thread Jason Wang




On 2017年01月03日 06:30, John Fastabend wrote:

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

This patch sets/clears the max_mtu value at XDP load/unload time.

Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..783e842 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
  };
  
+#define MIN_MTU ETH_MIN_MTU

+#define MAX_MTU ETH_MAX_MTU
+
  static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
  {
unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
@@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
}
+   dev->max_mtu = max_sz;
+   } else {
+   dev->max_mtu = ETH_MAX_MTU;


Or use ETH_DATA_LEN here consider we only allocate a size of 
GOOD_PACKET_LEN for each small buffer?


Thanks


}
  
  	vi->xdp_queue_pairs = xdp_qp;

@@ -2133,9 +2139,6 @@ static bool virtnet_validate_features(struct 
virtio_device *vdev)
return true;
  }
  
-#define MIN_MTU ETH_MIN_MTU

-#define MAX_MTU ETH_MAX_MTU
-
  static int virtnet_probe(struct virtio_device *vdev)
  {
int i, err;

Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-02 Thread Jason Wang




On 2017年01月03日 03:44, John Fastabend wrote:

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

: There is a problem with this patch as is. When xdp prog is loaded
   the old buffers without the 256B headers need to be flushed so that
   the bpf prog has the necessary headroom. This patch does this by
   calling the virtqueue_detach_unused_buf() and followed by the
   virtnet_set_queues() call to reinitialize the buffers. However I
   don't believe this is safe per comment in virtio_ring this API
   is not valid on an active queue and the only thing we have done
   here is napi_disable/napi_enable wrappers which doesn't do anything
   to the emulation layer.

   So the RFC is really to find the best solution to this problem.
   A couple things come to mind, (a) always allocate the necessary
   headroom but this is a bit of a waste (b) add some bit somewhere
   to check if the buffer has headroom but this would mean XDP programs
   would be broke for a cycle through the ring, (c) figure out how
   to deactivate a queue, free the buffers and finally reallocate.
   I think (c) is the best choice for now but I'm not seeing the
   API to do this so virtio/qemu experts anyone know off-hand
   how to make this work? I started looking into the PCI callbacks
   reset() and virtio_device_ready() or possibly hitting the right
   set of bits with vp_set_status() but my first attempt just hung
   the device.


Hi John:

AFAIK, disabling a specific queue was supported only by virtio 1.0 
through queue_enable field in pci common cfg. But unfortunately, qemu 
does not emulate this at all and legacy device does not even support 
this. So the safe way is probably reset the device and redo the 
initialization here.




Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |  106 +++---
  1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..fcc5bd7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -159,6 +159,9 @@ struct virtnet_info {
/* Ethtool settings */
u8 duplex;
u32 speed;
+
+   /* Headroom allocated in RX Queue */
+   unsigned int headroom;
  };
  
  struct padded_vnet_hdr {

@@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
}
  
  	if (vi->mergeable_rx_bufs) {

+   xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
/* Zero header and leave csum up to XDP layers */
hdr = xdp->data;
memset(hdr, 0, vi->hdr_len);
@@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
num_sg = 2;
sg_init_table(sq->sg, 2);
sg_set_buf(sq->sg, hdr, vi->hdr_len);
-   skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - 
xdp->data);


vi->headroom look suspicious, should it be xdp->data - xdp->data_hard_start?


}
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
   data, GFP_ATOMIC);
@@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
   struct bpf_prog *xdp_prog,
   void *data, int len)
  {
-   int hdr_padded_len;
struct xdp_buff xdp;
-   void *buf;
unsigned int qp;
u32 act;
  
+

if (vi->mergeable_rx_bufs) {
-   hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-   xdp.data = data + hdr_padded_len;
+   int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+   /* Allow consuming headroom but reserve enough space to push
+* the descriptor on if we get an XDP_TX return code.
+*/
+   xdp.data_hard_start = data - vi->headroom + desc_room;
+   xdp.data = data + desc_room;
xdp.data_end = xdp.data + (len - vi->hdr_len);
-   buf = data;
} else { /* small buffers */
struct sk_buff *skb = data;
  
-		xdp.data = skb->data;

+   xdp.data_hard_start = skb->data;
+   xdp.data = skb->data + vi->headroom;
xdp.data_end = xdp.data + len;
-   buf = skb->data;
}
  
  	act = bpf_prog_run_xdp(xdp_prog, );

switch (act) {
case XDP_PASS:
+   if (!vi->mergeable_rx_bufs)
+   __skb_pull((struct sk_buff *) data,
+  xdp.data - xdp.data_hard_start);


Instead of doing things here and virtnet_xdp_xmit(). How about

[PATCHv2 net-next 3/3] sctp: remove asoc ssnmap and ssnmap.c

2017-01-02 Thread Xin Long

Since asoc stream arrays has replaced ssnmap, ssnmap is not used any
more, this patch is to remove asoc ssnmap and ssnmap.c.

Signed-off-by: Xin Long 
---
 include/net/sctp/sctp.h|   1 -
 include/net/sctp/structs.h |  33 
 net/sctp/Makefile  |   3 +-
 net/sctp/objcnt.c  |   2 -
 net/sctp/ssnmap.c  | 125 -
 5 files changed, 1 insertion(+), 163 deletions(-)
 delete mode 100644 net/sctp/ssnmap.c

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index d8833a8..598d938 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -283,7 +283,6 @@ extern atomic_t sctp_dbg_objcnt_chunk;
 extern atomic_t sctp_dbg_objcnt_bind_addr;
 extern atomic_t sctp_dbg_objcnt_bind_bucket;
 extern atomic_t sctp_dbg_objcnt_addr;
-extern atomic_t sctp_dbg_objcnt_ssnmap;
 extern atomic_t sctp_dbg_objcnt_datamsg;
 extern atomic_t sctp_dbg_objcnt_keys;
 
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index f81c321..9075d61 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -82,7 +82,6 @@ struct sctp_outq;
 struct sctp_bind_addr;
 struct sctp_ulpq;
 struct sctp_ep_common;
-struct sctp_ssnmap;
 struct crypto_shash;
 
 
@@ -377,35 +376,6 @@ typedef struct sctp_sender_hb_info {
__u64 hb_nonce;
 } __packed sctp_sender_hb_info_t;
 
-/*
- *  RFC 2960 1.3.2 Sequenced Delivery within Streams
- *
- *  The term "stream" is used in SCTP to refer to a sequence of user
- *  messages that are to be delivered to the upper-layer protocol in
- *  order with respect to other messages within the same stream.  This is
- *  in contrast to its usage in TCP, where it refers to a sequence of
- *  bytes (in this document a byte is assumed to be eight bits).
- *  ...
- *
- *  This is the structure we use to track both our outbound and inbound
- *  SSN, or Stream Sequence Numbers.
- */
-
-struct sctp_stream {
-   __u16 *ssn;
-   unsigned int len;
-};
-
-struct sctp_ssnmap {
-   struct sctp_stream in;
-   struct sctp_stream out;
-};
-
-struct sctp_ssnmap *sctp_ssnmap_new(__u16 in, __u16 out,
-   gfp_t gfp);
-void sctp_ssnmap_free(struct sctp_ssnmap *map);
-void sctp_ssnmap_clear(struct sctp_ssnmap *map);
-
 /* What is the current SSN number for this stream? */
 #define sctp_ssn_peek(asoc, type, sid) \
((asoc)->stream##type[sid].ssn)
@@ -1751,9 +1721,6 @@ struct sctp_association {
/* Default receive parameters */
__u32 default_rcv_context;
 
-   /* This tracks outbound ssn for a given stream.  */
-   struct sctp_ssnmap *ssnmap;
-
/* All outbound chunks go through this structure.  */
struct sctp_outq outqueue;
 
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 6c4f749..48bfc74 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -11,8 +11,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
  transport.o chunk.o sm_make_chunk.o ulpevent.o \
  inqueue.o outqueue.o ulpqueue.o \
  tsnmap.o bind_addr.o socket.o primitive.o \
- output.o input.o debug.o ssnmap.o auth.o \
- offload.o
+ output.o input.o debug.o auth.o offload.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/objcnt.c b/net/sctp/objcnt.c
index 40e7fac..105ac33 100644
--- a/net/sctp/objcnt.c
+++ b/net/sctp/objcnt.c
@@ -51,7 +51,6 @@ SCTP_DBG_OBJCNT(bind_addr);
 SCTP_DBG_OBJCNT(bind_bucket);
 SCTP_DBG_OBJCNT(chunk);
 SCTP_DBG_OBJCNT(addr);
-SCTP_DBG_OBJCNT(ssnmap);
 SCTP_DBG_OBJCNT(datamsg);
 SCTP_DBG_OBJCNT(keys);
 
@@ -67,7 +66,6 @@ static sctp_dbg_objcnt_entry_t sctp_dbg_objcnt[] = {
SCTP_DBG_OBJCNT_ENTRY(bind_addr),
SCTP_DBG_OBJCNT_ENTRY(bind_bucket),
SCTP_DBG_OBJCNT_ENTRY(addr),
-   SCTP_DBG_OBJCNT_ENTRY(ssnmap),
SCTP_DBG_OBJCNT_ENTRY(datamsg),
SCTP_DBG_OBJCNT_ENTRY(keys),
 };
diff --git a/net/sctp/ssnmap.c b/net/sctp/ssnmap.c
deleted file mode 100644
index b9c8521..000
--- a/net/sctp/ssnmap.c
+++ /dev/null
@@ -1,125 +0,0 @@
-/* SCTP kernel implementation
- * Copyright (c) 2003 International Business Machines, Corp.
- *
- * This file is part of the SCTP kernel implementation
- *
- * These functions manipulate sctp SSN tracker.
- *
- * This SCTP implementation is free software;
- * you can redistribute it and/or modify it under the terms of
- * the GNU General Public License as published by
- * the Free Software Foundation; either version 2, or (at your option)
- * any later version.
- *
- * This SCTP implementation is distributed in the hope that it
- * will be useful, but WITHOUT ANY WARRANTY; without even the implied
- * 
- * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
- * See the GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with GNU CC; see the file COPYING.  If

[PATCHv2 net-next 1/3] sctp: add stream arrays in asoc

2017-01-02 Thread Xin Long

This patch is to add streamout and streamin arrays in asoc, initialize
them in sctp_process_init and free them in sctp_association_free.

Stream arrays are used to replace ssnmap to save more stream things in
the next patch.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 18 ++
 net/sctp/associola.c   | 19 +++
 net/sctp/sm_make_chunk.c   | 17 -
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 87d56cc..549f17d 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1331,6 +1331,18 @@ struct sctp_inithdr_host {
__u32 initial_tsn;
 };
 
+struct sctp_stream_out {
+   __u16   ssn;
+   __u8state;
+};
+
+struct sctp_stream_in {
+   __u16   ssn;
+};
+
+#define SCTP_STREAM_CLOSED 0x00
+#define SCTP_STREAM_OPEN   0x01
+
 /* SCTP_GET_ASSOC_STATS counters */
 struct sctp_priv_assoc_stats {
/* Maximum observed rto in the association during subsequent
@@ -1879,6 +1891,12 @@ struct sctp_association {
 temp:1,/* Is it a temporary association? */
 prsctp_enable:1;
 
+   /* stream arrays */
+   struct sctp_stream_out *streamout;
+   struct sctp_stream_in *streamin;
+   __u16 streamoutcnt;
+   __u16 streamincnt;
+
struct sctp_priv_assoc_stats stats;
 
int sent_cnt_removable;
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index d3cc30c..290ec4d 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -361,6 +361,10 @@ void sctp_association_free(struct sctp_association *asoc)
/* Free ssnmap storage. */
sctp_ssnmap_free(asoc->ssnmap);
 
+   /* Free stream information. */
+   kfree(asoc->streamout);
+   kfree(asoc->streamin);
+
/* Clean up the bound address list. */
sctp_bind_addr_free(>base.bind_addr);
 
@@ -1130,6 +1134,8 @@ void sctp_assoc_update(struct sctp_association *asoc,
 * has been discarded and needs retransmission.
 */
if (asoc->state >= SCTP_STATE_ESTABLISHED) {
+   int i;
+
asoc->next_tsn = new->next_tsn;
asoc->ctsn_ack_point = new->ctsn_ack_point;
asoc->adv_peer_ack_point = new->adv_peer_ack_point;
@@ -1139,6 +1145,12 @@ void sctp_assoc_update(struct sctp_association *asoc,
 */
sctp_ssnmap_clear(asoc->ssnmap);
 
+   for (i = 0; i < asoc->streamoutcnt; i++)
+   asoc->streamout[i].ssn = 0;
+
+   for (i = 0; i < asoc->streamincnt; i++)
+   asoc->streamin[i].ssn = 0;
+
/* Flush the ULP reassembly and ordered queue.
 * Any data there will now be stale and will
 * cause problems.
@@ -1168,6 +1180,13 @@ void sctp_assoc_update(struct sctp_association *asoc,
new->ssnmap = NULL;
}
 
+   if (!asoc->streamin && !asoc->streamout) {
+   asoc->streamout = new->streamout;
+   asoc->streamin = new->streamin;
+   new->streamout = NULL;
+   new->streamin = NULL;
+   }
+
if (!asoc->assoc_id) {
/* get a new association id since we don't have one
 * yet.
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 9e9690b..eeadeef 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2442,13 +2442,28 @@ int sctp_process_init(struct sctp_association *asoc, 
struct sctp_chunk *chunk,
 * association.
 */
if (!asoc->temp) {
-   int error;
+   int error, i;
+
+   asoc->streamoutcnt = asoc->c.sinit_num_ostreams;
+   asoc->streamincnt = asoc->c.sinit_max_instreams;
 
asoc->ssnmap = sctp_ssnmap_new(asoc->c.sinit_max_instreams,
   asoc->c.sinit_num_ostreams, gfp);
if (!asoc->ssnmap)
goto clean_up;
 
+   asoc->streamout = kcalloc(asoc->streamoutcnt,
+ sizeof(*asoc->streamout), gfp);
+   if (!asoc->streamout)
+   goto clean_up;
+   for (i = 0; i < asoc->streamoutcnt; i++)
+   asoc->streamout[i].state = SCTP_STREAM_OPEN;
+
+   asoc->streamin = kcalloc(asoc->streamincnt,
+sizeof(*asoc->streamin), gfp);
+   if (!asoc->streamin)
+   goto clean_up;
+
error = sctp_assoc_set_id(asoc, gfp);
if (error)
goto clean_up;
-- 
2.1.0

[PATCHv2 net-next 2/3] sctp: replace ssnmap with asoc stream arrays

2017-01-02 Thread Xin Long

Stream arrays are used to save per stream information, which includes
ssn for each stream already.

This patch is to replace ssnmap with asoc stream arrays.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 19 ++-
 net/sctp/associola.c   | 10 --
 net/sctp/sm_make_chunk.c   | 11 ++-
 net/sctp/sm_statefuns.c|  3 +--
 net/sctp/ulpqueue.c| 33 +++--
 5 files changed, 20 insertions(+), 56 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 549f17d..f81c321 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -407,23 +407,16 @@ void sctp_ssnmap_free(struct sctp_ssnmap *map);
 void sctp_ssnmap_clear(struct sctp_ssnmap *map);
 
 /* What is the current SSN number for this stream? */
-static inline __u16 sctp_ssn_peek(struct sctp_stream *stream, __u16 id)
-{
-   return stream->ssn[id];
-}
+#define sctp_ssn_peek(asoc, type, sid) \
+   ((asoc)->stream##type[sid].ssn)
 
 /* Return the next SSN number for this stream. */
-static inline __u16 sctp_ssn_next(struct sctp_stream *stream, __u16 id)
-{
-   return stream->ssn[id]++;
-}
+#define sctp_ssn_next(asoc, type, sid) \
+   ((asoc)->stream##type[sid].ssn++)
 
 /* Skip over this ssn and all below. */
-static inline void sctp_ssn_skip(struct sctp_stream *stream, __u16 id, 
-__u16 ssn)
-{
-   stream->ssn[id] = ssn+1;
-}
+#define sctp_ssn_skip(asoc, type, sid, ssn) \
+   ((asoc)->stream##type[sid].ssn = ssn + 1)
   
 /*
  * Pointers to address related SCTP functions.
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 290ec4d..ea03270 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -358,9 +358,6 @@ void sctp_association_free(struct sctp_association *asoc)
 
sctp_tsnmap_free(>peer.tsn_map);
 
-   /* Free ssnmap storage. */
-   sctp_ssnmap_free(asoc->ssnmap);
-
/* Free stream information. */
kfree(asoc->streamout);
kfree(asoc->streamin);
@@ -1143,8 +1140,6 @@ void sctp_assoc_update(struct sctp_association *asoc,
/* Reinitialize SSN for both local streams
 * and peer's streams.
 */
-   sctp_ssnmap_clear(asoc->ssnmap);
-
for (i = 0; i < asoc->streamoutcnt; i++)
asoc->streamout[i].ssn = 0;
 
@@ -1174,11 +1169,6 @@ void sctp_assoc_update(struct sctp_association *asoc,
 
asoc->ctsn_ack_point = asoc->next_tsn - 1;
asoc->adv_peer_ack_point = asoc->ctsn_ack_point;
-   if (!asoc->ssnmap) {
-   /* Move the ssnmap. */
-   asoc->ssnmap = new->ssnmap;
-   new->ssnmap = NULL;
-   }
 
if (!asoc->streamin && !asoc->streamout) {
asoc->streamout = new->streamout;
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index eeadeef..78cbd1b 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1527,7 +1527,6 @@ void sctp_chunk_assign_ssn(struct sctp_chunk *chunk)
 {
struct sctp_datamsg *msg;
struct sctp_chunk *lchunk;
-   struct sctp_stream *stream;
__u16 ssn;
__u16 sid;
 
@@ -1536,7 +1535,6 @@ void sctp_chunk_assign_ssn(struct sctp_chunk *chunk)
 
/* All fragments will be on the same stream */
sid = ntohs(chunk->subh.data_hdr->stream);
-   stream = >asoc->ssnmap->out;
 
/* Now assign the sequence number to the entire message.
 * All fragments must have the same stream sequence number.
@@ -1547,9 +1545,9 @@ void sctp_chunk_assign_ssn(struct sctp_chunk *chunk)
ssn = 0;
} else {
if (lchunk->chunk_hdr->flags & SCTP_DATA_LAST_FRAG)
-   ssn = sctp_ssn_next(stream, sid);
+   ssn = sctp_ssn_next(chunk->asoc, out, sid);
else
-   ssn = sctp_ssn_peek(stream, sid);
+   ssn = sctp_ssn_peek(chunk->asoc, out, sid);
}
 
lchunk->subh.data_hdr->ssn = htons(ssn);
@@ -2447,11 +2445,6 @@ int sctp_process_init(struct sctp_association *asoc, 
struct sctp_chunk *chunk,
asoc->streamoutcnt = asoc->c.sinit_num_ostreams;
asoc->streamincnt = asoc->c.sinit_max_instreams;
 
-   asoc->ssnmap = sctp_ssnmap_new(asoc->c.sinit_max_instreams,
-  asoc->c.sinit_num_ostreams, gfp);
-   if (!asoc->ssnmap)
-   goto clean_up;
-
asoc->streamout = kcalloc(asoc->streamoutcnt,
  sizeof(*asoc->streamout), gfp);
if (!asoc->streamout)
diff --git a/net/sctp/sm_statefuns.c

[PATCHv2 net-next 0/3] sctp: prepare asoc stream for stream reconf

2017-01-02 Thread Xin Long

sctp stream reconf, described in RFC 6525, needs a structure to
save per stream information in assoc, like stream state.

In the future, sctp stream scheduler also needs it to save some
stream scheduler params and queues.

This patchset is to prepare the stream array in assoc for stream
reconf.

v1->v2:
  put these patches into a smaller group.

Xin Long (3):
  sctp: add stream arrays in asoc
  sctp: replace ssnmap with asoc stream arrays
  sctp: remove asoc ssnmap and ssnmap.c

 include/net/sctp/sctp.h|   1 -
 include/net/sctp/structs.h |  70 +
 net/sctp/Makefile  |   3 +-
 net/sctp/associola.c   |  23 ++---
 net/sctp/objcnt.c  |   2 -
 net/sctp/sm_make_chunk.c   |  24 ++---
 net/sctp/sm_statefuns.c|   3 +-
 net/sctp/ssnmap.c  | 125 -
 net/sctp/ulpqueue.c|  33 
 9 files changed, 69 insertions(+), 215 deletions(-)
 delete mode 100644 net/sctp/ssnmap.c

-- 
2.1.0

Re: [RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.

2017-01-02 Thread Harini Katakam

Hi Richard,

On Mon, Jan 2, 2017 at 9:43 PM, Richard Cochran
 wrote:
> On Mon, Jan 02, 2017 at 03:47:07PM +0100, Nicolas Ferre wrote:
>> Le 02/01/2017 à 12:31, Richard Cochran a écrit :
>> > This Cadence IP core is a complete disaster.
>>
>> Well, it evolved and propose several options to different SoC
>> integrators. This is not something unusual...
>> I suspect as well that some other network adapters have the same
>> weakness concerning PTP timestamp in single register as the early
>> revisions of this IP.
>
> It appears that this core can neither latch the time on read or write,
> or even latch time stamps.  I have worked with many different PTP HW
> implementations, even early ones like on the ixp4xx, and it is no
> exaggeration to say that this one is uniquely broken.
>
>> I suspect that Rafal tend to jump too quickly to the latest IP revisions
>> and add more options to this series: let's not try to pour too much
>> things into this code right now.
>
> Why can't you check the IP version in the driver?

There is an IP revision register but it would be probably be better
to rely on "caps" from the compatibility strings - to cover SoC
specific implementations. Also, when this extended BD is
added (with timestamp), additional words will need to be added
statically which will be consistent with Andrei's CONFIG_
checks.

>
> And is it really true that the registers don't latch the time stamps,
> as Rafal said?  If so, then we cannot accept the non-descriptor driver
> version, since it cannot possibly work correctly.
>

AFAIK, the two sets of registers only hold the timestamp till the next
event (or peer event) packet comes in.
I understand that it is not accurate - it is an initial version.

Regards,
Harini

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-02 Thread John Fastabend

On 17-01-02 05:22 PM, Jamal Hadi Salim wrote:
> On 17-01-02 05:58 PM, John Fastabend wrote:
>> On 17-01-02 02:21 PM, Jamal Hadi Salim wrote:
> 
> 
>> Well having the length value avoids ending up with cookie1, cookie2, ...
>> values as folks push more and more data into the cookie.
>>
> 
> Unless there is good reason I dont see why it shouldnt be a fixed length
> value. u64/128 should be plenty.
> 
>> I don't see any use in the kernel interpreting it. It has no use
>> for it as far as I can see. It doesn't appear to be metadata which
>> we use skb->mark for at the moment.
>>
> 
> Like all cookie semantics it is for storing state. The receiver (kernel)
> is not just store it and not intepret it. The user when reading it back
> simplifies what they have to do for their processing.
> 
>>
>> The tuple  really should be unique why
>> not use this for system wide mappings?
>>
> 
> I think on a single machine should be enough, however:
> typically the user wants to define the value in a manner that
> in a distributed system it is unique. It would be trickier to
> do so with well defined values such as above.
> 

Just extend the tuple  that
should be unique in the domain of hostname's, or use some other domain
wide machine identifier.

> 
>> The only thing I can think to do with this that I can't do with
>> the above tuple and a simple userspace lookup is stick hardware specific
>> "hints" in the cookie for the firmware to consume. Which would be
>> very helpful for what its worth.
>>
> 
> Ok, very different from our use case with actions.
> We just use those values to map to stats info without needing to
> know what flow or action (graph) it is associated with.
> 

Sure.

>> Its a bit strange to push it as an action when its not really an
>> action in the traditional datapath.
>>
> 
> The action is part of a graph pointed to by a filter.

Although actions can be shared so the cookie can be shared across
filters. Maybe its useful but it doesn't uniquely identify a filter
in the shared case but the user would have to specify that case
so maybe its not important.

> 
>> I suspect the OVS usage is a simple 1:1 lookup from OVS id to TC id to
>> avoid a userspace map lookup.
> 
> Not that i care about OVS but it sounds like a good use case (even for
> tc), no?

I'm not opposed to it. Just pushing on the use case a bit to understand.

> 
> cheers,
> jamal

Re: [PATCH net-next V3 3/3] tun: rx batching

2017-01-02 Thread Jason Wang




On 2017年01月01日 05:03, Stephen Hemminger wrote:

On Fri, 30 Dec 2016 13:20:51 +0800
Jason Wang  wrote:


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..a268ed9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -75,6 +75,10 @@
  
  #include 
  
+static int rx_batched;

+module_param(rx_batched, int, 0444);
+MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
+
  /* Uncomment to enable debugging */

I like the concept or rx batching. But controlling it via a module parameter
is one of the worst API choices.  Ethtool would be better to use because that is
how other network devices control batching.

If you do ethtool, you could even extend it to have an number of packets
and max latency value.


Right, this is better (I believe you mean rx-frames). For rx-usecs, we 
could do it on top in the future.


Thanks

Re: [PATCH net-next V3 3/3] tun: rx batching

2017-01-02 Thread Jason Wang

On 2017年01月01日 01:31, David Miller wrote:

From: Jason Wang 
Date: Fri, 30 Dec 2016 13:20:51 +0800

@@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);

  	rxhash = skb_get_hash(skb);

+
  #ifndef CONFIG_4KSTACKS
-   local_bh_disable();
-   netif_receive_skb(skb);
-   local_bh_enable();
+   if (!rx_batched) {
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+   } else {
+   tun_rx_batched(tfile, skb, more);
+   }
  #else
netif_rx_ni(skb);
  #endif

If rx_batched has been set, and we are talking to clients not using
this new MSG_MORE facility (or such clients don't have multiple TX
packets to send to you, thus MSG_MORE is often clear), you are doing a
lot more work per-packet than the existing code.

You take the queue lock, you test state, you splice into a local queue
on the stack, then you walk that local stack queue to submit just one
SKB to netif_receive_skb().

I think you want to streamline this sequence in such cases so that the
cost before and after is similar if not equivalent.

Yes, so I will do a skb_queue_empty() check if !MSG_MORE and call 
netif_receive_skb() immediately in this case. This can save the wasted 
efforts.

Thanks

linux-next: manual merge of the rdma-leon tree with the net-next tree

2017-01-02 Thread Stephen Rothwell

Hi Leon,

Today's linux-next merge of the rdma-leon tree got conflicts in:

  drivers/infiniband/hw/mlx5/main.c
  drivers/infiniband/hw/mlx5/mlx5_ib.h
  drivers/infiniband/hw/mlx5/mr.c
  drivers/infiniband/hw/mlx5/qp.c
  drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c
  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
  drivers/net/ethernet/mellanox/mlx5/core/eq.c
  drivers/net/ethernet/mellanox/mlx5/core/main.c
  include/linux/mlx5/device.h
  include/linux/mlx5/driver.h
  include/linux/mlx5/mlx5_ifc.h

between commits in the rdma-leon tree and commits (several of whitch
are the same, or similar, patches) in the net-next tree.

I just dropped the rdma-leon tree for today.  Please clean it up.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-02 Thread Jamal Hadi Salim


On 17-01-02 05:58 PM, John Fastabend wrote:

On 17-01-02 02:21 PM, Jamal Hadi Salim wrote:




Well having the length value avoids ending up with cookie1, cookie2, ...
values as folks push more and more data into the cookie.



Unless there is good reason I dont see why it shouldnt be a fixed length
value. u64/128 should be plenty.


I don't see any use in the kernel interpreting it. It has no use
for it as far as I can see. It doesn't appear to be metadata which
we use skb->mark for at the moment.



Like all cookie semantics it is for storing state. The receiver (kernel)
is not just store it and not intepret it. The user when reading it back
simplifies what they have to do for their processing.



The tuple  really should be unique why
not use this for system wide mappings?



I think on a single machine should be enough, however:
typically the user wants to define the value in a manner that
in a distributed system it is unique. It would be trickier to
do so with well defined values such as above.



The only thing I can think to do with this that I can't do with
the above tuple and a simple userspace lookup is stick hardware specific
"hints" in the cookie for the firmware to consume. Which would be
very helpful for what its worth.



Ok, very different from our use case with actions.
We just use those values to map to stats info without needing to
know what flow or action (graph) it is associated with.


Its a bit strange to push it as an action when its not really an
action in the traditional datapath.



The action is part of a graph pointed to by a filter.


I suspect the OVS usage is a simple 1:1 lookup from OVS id to TC id to
avoid a userspace map lookup.


Not that i care about OVS but it sounds like a good use case (even for
tc), no?

cheers,
jamal

RE: [PATCH v2] scm: fix possible control message header alignment issue

2017-01-02 Thread YUAN Linyu



> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Saturday, December 31, 2016 4:21 AM
> To: cug...@163.com
> Cc: netdev@vger.kernel.org; YUAN Linyu
> Subject: Re: [PATCH v2] scm: fix possible control message header alignment
> issue
> If you can come up with a case where this does happen in
> practice, I will continue to consider this patch.
> 
Yes, before send patch I also check two archs(arm-v7 and powerpc e6500), they 
are aligned.
No one report issue, I think cmsghdr aligned on all archs.

> Otherwise, we should make the assumptions that exist explicit
> and get rid of all of the code that does that funny alignment
> upon the cmsghdr structure.
> 
Do you accept that I remove all CMSG{_COMPAT}_ALIGN of header ?

> Thanks.

Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

2017-01-02 Thread Ansis Atteka

On Sat, Dec 31, 2016 at 4:07 PM, Ansis Atteka  wrote:
> On Wed, Nov 30, 2016 at 3:58 AM, Hayes Wang  wrote:
>> Mark Lord 
>> [...]
>>> > Not sure why, because there really is no other way for the data to
>>> > appear where it does at the beginning of that URB buffer.
>>> >
>>> > This does seem a rather unexpected burden to place upon someone
>>> > reporting a regression in a USB network driver that corrupts user data.
>>>
>>> If you are the only person who can actively reproduce this, which
>>> seems to be the case right now, this is unfortunately the only way to
>>> reach a proper analysis and fix.
>>
>> I have tested it with iperf more than five days without any error.
>> I would think if there is any other way to reproduce it.
>>

I think that I am getting closer to the root cause of this bug. Also,
I have a workaround that at least makes r8152 functionally stable in
my Dell TB15 dock. Mark, would you mind giving a chance to the patch
that I have in the bottom of this email to see if it helps your issue
too (you might have to tweak those settings slightly differently if
you use something else than USB 3.0)

Long story short - what I observed in Wireshark is that if there are
more than ~10 Ethernet frames *close together to each other* then the
data corruption bug starts to express itself. If there are ~15 or more
Ethernet frames close together to each other then the XHCI starts to
emit the "ERROR Transfer event TRB DMA ptr not part of current TD
ep_index 2 comp_code 13" error message and r8152 driver gets toasted.
Hayes, in your iperf reproduction environment did you
1) connect sender and receiver directly with an Ethernet cable?
2) use iperf's TCP mode instead of UDP mode, because I believe that
with UDP mode packets are more likely to be sparsely distributed?
Also, this bug is way easier to reproduce when IP fragmentation kicks
in because IP fragments are typically sent out very close to each
other.
3) were you plugging your USB Ethernet dongle in USB 3.0 port or
whatever Mark was using? It seems that each USB mode has different
coalesce parameters and yours might have work "out of box"?

While I would not call this a proper fix, because it simply reduces
coalescing timeouts by order of 10X and most likely does not eliminate
security aspects of the bug, it at least made my system functionally
stable and I don't see either of those two bugs in my setup anymore:

git diff
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index c254248..4979690 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -365,9 +365,9 @@
 #define PCUT_STATUS0x0001

 /* USB_RX_EARLY_TIMEOUT */
-#define COALESCE_SUPER  85000U
-#define COALESCE_HIGH  25U
-#define COALESCE_SLOW  524280U
+#define COALESCE_SUPER  8500U
+#define COALESCE_HIGH  25000U
+#define COALESCE_SLOW  52428U

 /* USB_WDT11_CTRL */
 #define TIMER11_EN 0x0001

[PATCH] drop_monitor: consider inserted data in genlmsg_end

2017-01-02 Thread Reiter Wolfgang

Final nlmsg_len field update must reflect inserted net_dm_drop_point
data.

This patch depends on previous patch:
"drop_monitor: add missing call to genlmsg_end"

Signed-off-by: Reiter Wolfgang 
---
 net/core/drop_monitor.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index f465bad..fb55327 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -102,7 +102,6 @@ static struct sk_buff *reset_per_cpu_data(struct 
per_cpu_dm_data *data)
}
msg = nla_data(nla);
memset(msg, 0, al);
-   genlmsg_end(skb, msg_header);
goto out;
 
 err:
@@ -112,6 +111,13 @@ static struct sk_buff *reset_per_cpu_data(struct 
per_cpu_dm_data *data)
swap(data->skb, skb);
spin_unlock_irqrestore(>lock, flags);
 
+   if (skb) {
+   struct nlmsghdr *nlh = (struct nlmsghdr *)skb->data;
+   struct genlmsghdr *gnlh = (struct genlmsghdr *)nlmsg_data(nlh);
+
+   genlmsg_end(skb, genlmsg_data(gnlh));
+   }
+
return skb;
 }
 
-- 
2.9.3

Re: [PATCH] drop_monitor: consider inserted data in genlmsg_end

2017-01-02 Thread David Miller

From: Reiter Wolfgang 
Date: Tue,  3 Jan 2017 00:34:10 +0100

> Final nlmsg_len field update must reflect inserted net_dm_drop_point
> data.
> 
> This patch depends on previous patch:
> "drop_monitor: add missing call to genlmsg_end"
> 
> Signed-off-by: Reiter Wolfgang 

Several coding style errors:

> @@ -112,6 +111,12 @@ static struct sk_buff *reset_per_cpu_data(struct 
> per_cpu_dm_data *data)
>   swap(data->skb, skb);
>   spin_unlock_irqrestore(>lock, flags);
>  
> + if(skb) {

There must be a space between "if" and "(skb)"

> + struct nlmsghdr *nlh = (struct nlmsghdr *)skb->data;
> + struct genlmsghdr *gnlh = (struct genlmsghdr *)nlmsg_data(nlh);
> + genlmsg_end(skb, genlmsg_data(gnlh));
> + }

There should be an empty line between the local variable declarations
and actual code.

Re: [PATCH v2 2/2] net: sfc: falcon: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 19:02:46 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 
> ---
> Changelog:
> v2:
> - simplify the code of ef4_ethtool_get_link_ksettings
>   (feedback from Bert Kenward)

Applied.

Re: [PATCH] net: dlink: dl2k: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 20:49:26 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> The previous implementation of set_settings was modifying
> the value of speed and duplex, but with the new API, it's not
> possible. The structure ethtool_link_ksettings is defined
> as const.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH v2 1/2] net: mdio: add mdio45_ethtool_ksettings_get

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 19:02:45 +0100

> There is a function in mdio for the old ethtool api gset.
> We add a new function mdio45_ethtool_ksettings_get for the
> new ethtool api glinksettings.
> 
> Signed-off-by: Philippe Reynes 
> ---
> Changelog:
> v2:
> - simplify the code of ef4_ethtool_get_link_ksettings
>   (feedback from Bert Kenward)

Applied.

Re: [PATCH] net: dlink: sundance: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 20:52:12 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: dec: de2104x: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 19:05:38 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: emulex: benet: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Mon,  2 Jan 2017 17:42:14 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: dec: uli526x: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 19:11:06 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: faraday: ftmac100: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Mon,  2 Jan 2017 19:53:11 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: fealnx: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Mon,  2 Jan 2017 20:47:27 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: dec: winbond-840: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread David Miller

From: Philippe Reynes 
Date: Sun,  1 Jan 2017 20:47:01 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

[PATCH] drop_monitor: consider inserted data in genlmsg_end

2017-01-02 Thread Reiter Wolfgang

Final nlmsg_len field update must reflect inserted net_dm_drop_point
data.

This patch depends on previous patch:
"drop_monitor: add missing call to genlmsg_end"

Signed-off-by: Reiter Wolfgang 
---
 net/core/drop_monitor.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index f465bad..ccaaf3e 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -102,7 +102,6 @@ static struct sk_buff *reset_per_cpu_data(struct 
per_cpu_dm_data *data)
}
msg = nla_data(nla);
memset(msg, 0, al);
-   genlmsg_end(skb, msg_header);
goto out;
 
 err:
@@ -112,6 +111,12 @@ static struct sk_buff *reset_per_cpu_data(struct 
per_cpu_dm_data *data)
swap(data->skb, skb);
spin_unlock_irqrestore(>lock, flags);
 
+   if(skb) {
+   struct nlmsghdr *nlh = (struct nlmsghdr *)skb->data;
+   struct genlmsghdr *gnlh = (struct genlmsghdr *)nlmsg_data(nlh);
+   genlmsg_end(skb, genlmsg_data(gnlh));
+   }
+
return skb;
 }
 
-- 
2.9.3

Re: [PATCH for-next V2 00/11] Mellanox mlx5 core and ODP updates 2017-01-01

2017-01-02 Thread Saeed Mahameed

On Mon, Jan 2, 2017 at 10:53 PM, David Miller  wrote:
> From: Saeed Mahameed 
> Date: Mon,  2 Jan 2017 11:37:37 +0200
>
>> The following eleven patches mainly come from Artemy Kovalyov
>> who expanded mlx5 on-demand-paging (ODP) support. In addition
>> there are three cleanup patches which don't change any functionality,
>> but are needed to align codebase prior accepting other patches.
>
> Series applied to net-next, thanks.

Whoops,

This series was meant as a pull request, you can blame it on me I
kinda messed up the V2 title.
Doug will have to pull same patches later, will this produce a
conflict on merge window ?

Sorry for the confusion.

Re: [PATCH v2 net-next 2/2] tools: test case for TPACKET_V3/TX_RING support

2017-01-02 Thread Willem de Bruijn

On Mon, Jan 2, 2017 at 6:02 PM, Sowmini Varadhan
 wrote:
> On (01/02/17 17:31), Willem de Bruijn wrote:
>>
>> Thanks for adding this.
>>
>> walk_v3_tx is almost identical to walk_v1_v2_tx. That function can
>> just be extended to add a v3 case where it already multiplexes between
>> v1 and v2.
>
> I looked at that, but the sticky point is that v1/v2 sets up the
> ring->rd* related variables based on frames (e.g., rd_num is tp_frame_nr)
> whereas V3 sets these up based on blocks (e.g, rd_num is  tp_block_nr)
> so this impacts the core sending loop a bit.

Good point. Yes, deduplicating the function will help make it crystal
clear where v3 differs from v2.

The patch already has __v3_tx_kernel_ready and __v3_tx_user_ready,
which can be plugged into the existing multiplexer functions
__v1_v2_tx_kernel_ready and __v2_v2_tx_user_ready multiplexer
(along with changing their names).

We'll indeed need a similar multiplexer function for calculating the next
frame to work around this rd_num issue, then.

Re: tcp_bbr: Forcing set of BBR congestion control as default

2017-01-02 Thread Neal Cardwell

On Mon, Jan 2, 2017 at 2:30 PM, Sedat Dilek  wrote:
> OK, this looks now good.

Great. Glad to hear it!

> Does BBR only work with fq-qdisc best?

Yes. BBR is designed to work with pacing. And so far the "fq" qdisc is
the only qdisc that offers pacing. So BBR currently needs the "fq"
qdisc. In the future, other qdiscs (or even other layers of the stack)
may offer pacing, in which case BBR could use those as well.

> What about fq_codel?

The "fq_codel" qdisc does not implement pacing, so it would not be sufficient.

cheers,
neal

Re: [PATCH v2 net-next 1/2] af_packet: TX_RING support for TPACKET_V3

2017-01-02 Thread Sowmini Varadhan

On (01/02/17 17:57), Willem de Bruijn wrote:
> One more point. We should validate the tpacket_req3 input and fail if
> unsupported options are passed. Specifically, fail if any of
> {tp_retire_blk_tov, tp_sizeof_priv, tp_feature_req_word} is non-zero.
> 
> Otherwise looks good to me.

Ok, I'll send out v3 tomorrow, with the test case also updated
to share code with walk_v1_v2_tx as cleanly as possible. 

Thanks for the review feedback!

--Sowmini

Re: [PATCH v2 net-next 2/2] tools: test case for TPACKET_V3/TX_RING support

2017-01-02 Thread Sowmini Varadhan

On (01/02/17 17:31), Willem de Bruijn wrote:
> 
> Thanks for adding this.
> 
> walk_v3_tx is almost identical to walk_v1_v2_tx. That function can
> just be extended to add a v3 case where it already multiplexes between
> v1 and v2.

I looked at that, but the sticky point is that v1/v2 sets up the
ring->rd* related variables based on frames (e.g., rd_num is tp_frame_nr)
whereas V3 sets these up based on blocks (e.g, rd_num is  tp_block_nr) 
so this impacts the core sending loop a bit.

I suppose we could change the walk_v2_v2_tx to be something like
while (total_packets > 0) {
if (ring->version) {
/* V3 send, that takes above difference into account */
} else {
/* existing code */
}
/* status_bar_update(), user_ready  update frame_num */
}

I can change it as above, if you think this would help.

--Sowmini

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-02 Thread John Fastabend

On 17-01-02 02:21 PM, Jamal Hadi Salim wrote:
> On 17-01-02 01:23 PM, John Fastabend wrote:
> 
>>
>> Additionally I would like to point out this is an arbitrary length binary
>> blob (for undefined use, without even a specified encoding) that gets pushed
>> between user space and hardware ;) This seemed to get folks fairly excited in
>> the past.
>>
> 
> The binary blob size is a little strange - but i think there is value
> in storing some "cookie" field. The challenge is whether the kernel
> gets to intepret it; in which case encoding must be specified. Or
> whether we should leave it up to user space - in which something
> like tc could standardize its own encodings.
> 

Well having the length value avoids ending up with cookie1, cookie2, ...
values as folks push more and more data into the cookie.

I don't see any use in the kernel interpreting it. It has no use
for it as far as I can see. It doesn't appear to be metadata which
we use skb->mark for at the moment.

>> Some questions, exactly what do you mean by "port mappings" above? In
>> general the 'tc' API uses the netdev the netlink msg is processed on as
>> the port mapping. If you mean OVS port to netdev port I think this is
>> a OVS problem and nothing to do with 'tc'. For what its worth there is an
>> existing problem with 'tc' where rules only apply to a single ingress or
>> egress port which is limiting on hardware.
>>
> 
> In our case the desire is to be able to correlate for a system wide
> mostly identity/key mapping.
> 

The tuple  really should be unique why
not use this for system wide mappings?

The only thing I can think to do with this that I can't do with
the above tuple and a simple userspace lookup is stick hardware specific
"hints" in the cookie for the firmware to consume. Which would be
very helpful for what its worth.

>> The UFID in my ovs code base is defined as best I can tell here,
>>
>> [OVS_FLOW_ATTR_UFID] = { .type = NL_A_UNSPEC, .optional = true,
>>  .min_len = sizeof(ovs_u128) },
>>
>> So you need 128 bits if you want a 1:1 mapping onto 'tc'. So rather
>> than an arbitrary blob why not make the case that 'tc' ids need to be
>> 128 bits long? Even if its just initially done in flower call it
>> flower_flow_id and define it so its not opaque and at least at the code
>> level it isn't an arbitrary blob of data.
>>
> 
> I dont know what this UFID is, but do note:
> The idea is not new - the FIB for example has some such cookie
> (albeit a tiny one) which will typically be populated to tell
> you who/what installed the entry.
> I could see f.e use for this cookie to simplify and pretty print in
> a human language for the u32 classifier (i.e user space tc sets
> some fields in the cookie when updating kernel and when user space
> invokes get/dump it uses the cookie to intepret how to pretty print).
> 
> I have attached a compile tested version of the cookies on actions
> (flat 64 bit; now that we have experienced the use when we have a
> large number of counters - I would not mind a 128 bit field).
> 

Its a bit strange to push it as an action when its not really an
action in the traditional datapath.

I suspect the OVS usage is a simple 1:1 lookup from OVS id to TC id to
avoid a userspace map lookup.

> 
> cheers,
> jamal
> 
>> And what are the "next" uses of this besides OVS. It would be really
>> valuable to see how this generalizes to other usage models. To avoid
>> embedding OVS syntax into 'tc'.
>>
>> Finally if you want to see an example of binary data encodings look at
>> how drivers/hardware/users are currently using the user defined bits in
>> ethtools ntuple API. Also track down out of tree drivers to see other
>> interesting uses. And that was capped at 64bits :/
>>
>> Thanks,
>> John
>>
>>
>>
>>
>>
>

Re: [PATCH v2 net-next 1/2] af_packet: TX_RING support for TPACKET_V3

2017-01-02 Thread Willem de Bruijn

On Sun, Jan 1, 2017 at 5:45 PM, Sowmini Varadhan
 wrote:
> Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx, *_v3
> does not currently have TX_RING support. As a result an application
> that wants the best perf for Tx and Rx (e.g. to handle request/response
> transacations) ends up needing 2 sockets, one with *_v2 for Tx and
> another with *_v3 for Rx.
>
> This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
> so that an application can use a single descriptor to get the benefits
> of _v3 RX_RING and _v2 TX_RING. An application may do a block-send by
> first filling up multiple frames in the Tx ring and then triggering a
> transmit. This patch only support fixed size Tx frames for TPACKET_V3,
> and requires that tp_next_offset must be zero.
>
> Signed-off-by: Sowmini Varadhan 
> ---
> v2: sanity checks on tp_next_offset and corresponding Doc updates
> as suggested by Willem de Bruijn
>
>  Documentation/networking/packet_mmap.txt |9 +++--
>  net/packet/af_packet.c   |   27 +++
>  2 files changed, 26 insertions(+), 10 deletions(-)

> @@ -4180,9 +4193,7 @@ static int packet_set_ring(struct sock *sk, union 
> tpacket_req_u *req_u,
> goto out;
> switch (po->tp_version) {
> case TPACKET_V3:
> -   /* Transmit path is not supported. We checked
> -* it above but just being paranoid
> -*/
> +   /* Block transmit is not supported yet */
> if (!tx_ring)
> init_prb_bdqc(po, rb, pg_vec, req_u);


One more point. We should validate the tpacket_req3 input and fail if
unsupported options are passed. Specifically, fail if any of
{tp_retire_blk_tov, tp_sizeof_priv, tp_feature_req_word} is non-zero.

Otherwise looks good to me.

[net-next][PATCH v3 00/17] net: RDS updates

2017-01-02 Thread Santosh Shilimkar

v2->v3:
- Re-based against latest net-next head.
- Dropped a user visible change after discussing with David Miller.
  It needs some more work to fully support old/new tools matrix.
- Addressed Dave's comment about bool usage in patch
  "RDS: IB: track and log active side..." 

v1->v2:
Re-aligned indentation in patch 'RDS: mark few internal functions..."

Series consist of:
 - RDMA transport fixes for map failure, listen sequence, handler panic and
   composite message notification.
 - Couple of sparse fixes.
 - Message logging improvements for bind failure, use once mr semantics
   and connection remote address, active end point.
 - Performance improvement for RDMA transport by reducing the post send
   pressure on the queue and spreading the CQ vectors.
 - Useful statistics for socket send/recv usage and receive cache usage.
 - Additional RDS CMSG used by application to track the RDS message
   stages for certain type of traffic to find out latency spots.
   Can be enabled/disabled per socket.

Series generated against 'net-next'. Full patchset is also available on
below git tree.


The following changes since commit 525dfa2cdce4f5ab76251b5e57ebabf4f2dfc40c:

  Merge branch 'mlx5-odp' (2017-01-02 15:51:21 -0500)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.11/net-next/rds_v3

for you to fetch changes up to 3289025aedc018f8fd9d0e37fb9efa0c6d531ffa:

  RDS: add receive message trace used by application (2017-01-02 14:02:59 -0800)


Avinash Repaka (1):
  RDS: make message size limit compliant with spec

Qing Huang (1):
  RDS: RDMA: start rdma listening after init

Santosh Shilimkar (14):
  RDS: log the address on bind failure
  RDS: mark few internal functions static to make sparse build happy
  RDS: IB: include faddr in connection log
  RDS: IB: make the transport retry count smallest
  RDS: RDMA: fix the ib_map_mr_sg_zbva() argument
  RDS: RDMA: return appropriate error on rdma map failures
  RDS: IB: split the mr registration and invalidation path
  RDS: RDMA: silence the use_once mr log flood
  RDS: IB: track and log active side endpoint in connection
  RDS: IB: add few useful cache stasts
  RDS: IB: Add vector spreading for cqs
  RDS: RDMA: Fix the composite message user notification
  RDS: IB: fix panic due to handlers running post teardown
  RDS: add receive message trace used by application

Venkat Venkatsubra (1):
  RDS: add stat for socket recv memory usage

 include/uapi/linux/rds.h | 33 ++
 net/rds/af_rds.c | 28 +++
 net/rds/bind.c   |  4 +--
 net/rds/connection.c | 10 +++---
 net/rds/ib.c | 11 ++
 net/rds/ib.h | 22 ++--
 net/rds/ib_cm.c  | 89 ++--
 net/rds/ib_frmr.c| 16 +
 net/rds/ib_recv.c| 14 ++--
 net/rds/ib_send.c| 29 +---
 net/rds/ib_stats.c   |  2 ++
 net/rds/rdma.c   | 22 ++--
 net/rds/rdma_transport.c | 11 ++
 net/rds/rds.h| 17 +
 net/rds/recv.c   | 36 ++--
 net/rds/send.c   | 50 ---
 net/rds/tcp_listen.c |  1 +
 net/rds/tcp_recv.c   |  5 +++
 18 files changed, 335 insertions(+), 65 deletions(-)

-- 
1.9.1

[net-next][PATCH v3 04/17] RDS: IB: make the transport retry count smallest

2017-01-02 Thread Santosh Shilimkar

Transport retry is not much useful since it indicate packet loss
in fabric so its better to failover fast rather than longer retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 45ac8e8..f4e8121 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -16,7 +16,7 @@
 #define RDS_IB_DEFAULT_SEND_WR 256
 #define RDS_IB_DEFAULT_FR_WR   512
 
-#define RDS_IB_DEFAULT_RETRY_COUNT 2
+#define RDS_IB_DEFAULT_RETRY_COUNT 1
 
 #define RDS_IB_SUPPORTED_PROTOCOLS 0x0003  /* minor versions 
supported */
 
-- 
1.9.1

[net-next][PATCH v3 08/17] RDS: IB: split the mr registration and invalidation path

2017-01-02 Thread Santosh Shilimkar

MR invalidation in RDS is done in background thread and not in
data path like registration. So break the dependency between them
which helps to remove the performance bottleneck.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  4 +++-
 net/rds/ib_cm.c   |  9 +++--
 net/rds/ib_frmr.c | 11 ++-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f4e8121..f14c26d 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,7 +14,8 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
-#define RDS_IB_DEFAULT_FR_WR   512
+#define RDS_IB_DEFAULT_FR_WR   256
+#define RDS_IB_DEFAULT_FR_INV_WR   256
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 1
 
@@ -125,6 +126,7 @@ struct rds_ib_connection {
 
/* To control the number of wrs from fastreg */
atomic_ti_fastreg_wrs;
+   atomic_ti_fastunreg_wrs;
 
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index b9da1e5..3002acf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -382,7 +382,10 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 * completion queue and send queue. This extra space is used for FRMR
 * registration and invalidation work requests
 */
-   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+   fr_queue_space = rds_ibdev->use_fastreg ?
+(RDS_IB_DEFAULT_FR_WR + 1) +
+(RDS_IB_DEFAULT_FR_INV_WR + 1)
+: 0;
 
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
@@ -444,6 +447,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
+   atomic_set(>i_fastunreg_wrs, RDS_IB_DEFAULT_FR_INV_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -766,7 +770,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
   (atomic_read(>i_signaled_sends) == 0) &&
-  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR) &&
+  (atomic_read(>i_fastunreg_wrs) == 
RDS_IB_DEFAULT_FR_INV_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 66b3d62..48332a6 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -241,8 +241,8 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (frmr->fr_state != FRMR_IS_INUSE)
goto out;
 
-   while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
-   atomic_inc(>ic->i_fastreg_wrs);
+   while (atomic_dec_return(>ic->i_fastunreg_wrs) <= 0) {
+   atomic_inc(>ic->i_fastunreg_wrs);
cpu_relax();
}
 
@@ -261,7 +261,7 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (unlikely(ret)) {
frmr->fr_state = FRMR_IS_STALE;
frmr->fr_inv = false;
-   atomic_inc(>ic->i_fastreg_wrs);
+   atomic_inc(>ic->i_fastunreg_wrs);
pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
goto out;
}
@@ -289,9 +289,10 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
if (frmr->fr_inv) {
frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
+   atomic_inc(>i_fastreg_wrs);
+   } else {
+   atomic_inc(>i_fastunreg_wrs);
}
-
-   atomic_inc(>i_fastreg_wrs);
 }
 
 void rds_ib_unreg_frmr(struct list_head *list, unsigned int *nfreed,
-- 
1.9.1

[net-next][PATCH v3 05/17] RDS: RDMA: fix the ib_map_mr_sg_zbva() argument

2017-01-02 Thread Santosh Shilimkar

Fixes warning: Using plain integer as NULL pointer

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_frmr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index d921adc..66b3d62 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -104,14 +104,15 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
struct rds_ib_frmr *frmr = >u.frmr;
struct ib_send_wr *failed_wr;
struct ib_reg_wr reg_wr;
-   int ret;
+   int ret, off = 0;
 
while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
atomic_inc(>ic->i_fastreg_wrs);
cpu_relax();
}
 
-   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len, 0, PAGE_SIZE);
+   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len,
+   , PAGE_SIZE);
if (unlikely(ret != ibmr->sg_len))
return ret < 0 ? ret : -EINVAL;
 
-- 
1.9.1

[net-next][PATCH v3 09/17] RDS: RDMA: silence the use_once mr log flood

2017-01-02 Thread Santosh Shilimkar

In absence of extension headers, message log will keep
flooding the console. As such even without use_once we can
clean up the MRs so its not really an error case message
so make it debug message

Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index ea96114..4297f3f 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -415,7 +415,8 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int 
force)
spin_lock_irqsave(>rs_rdma_lock, flags);
mr = rds_mr_tree_walk(>rs_rdma_keys, r_key, NULL);
if (!mr) {
-   printk(KERN_ERR "rds: trying to unuse MR with unknown r_key 
%u!\n", r_key);
+   pr_debug("rds: trying to unuse MR with unknown r_key %u!\n",
+r_key);
spin_unlock_irqrestore(>rs_rdma_lock, flags);
return;
}
-- 
1.9.1

[net-next][PATCH v3 02/17] RDS: mark few internal functions static to make sparse build happy

2017-01-02 Thread Santosh Shilimkar

Fixes below warnings:
warning: symbol 'rds_send_probe' was not declared. Should it be static?
warning: symbol 'rds_send_ping' was not declared. Should it be static?
warning: symbol 'rds_tcp_accept_one_path' was not declared. Should it be static?
warning: symbol 'rds_walk_conn_path_info' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar 
---
 net/rds/connection.c | 10 +-
 net/rds/send.c   |  4 ++--
 net/rds/tcp_listen.c |  1 +
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index fe9d31c..0e04dcc 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -545,11 +545,11 @@ void rds_for_each_conn_info(struct socket *sock, unsigned 
int len,
 }
 EXPORT_SYMBOL_GPL(rds_for_each_conn_info);
 
-void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
-struct rds_info_iterator *iter,
-struct rds_info_lengths *lens,
-int (*visitor)(struct rds_conn_path *, void *),
-size_t item_len)
+static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
+   struct rds_info_iterator *iter,
+   struct rds_info_lengths *lens,
+   int (*visitor)(struct rds_conn_path *, void 
*),
+   size_t item_len)
 {
u64  buffer[(item_len + 7) / 8];
struct hlist_head *head;
diff --git a/net/rds/send.c b/net/rds/send.c
index 77c8c6e..bb13c56 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1169,7 +1169,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
  * or
  *   RDS_FLAG_HB_PONG|RDS_FLAG_ACK_REQUIRED
  */
-int
+static int
 rds_send_probe(struct rds_conn_path *cp, __be16 sport,
   __be16 dport, u8 h_flags)
 {
@@ -1238,7 +1238,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
return rds_send_probe(cp, 0, dport, 0);
 }
 
-void
+static void
 rds_send_ping(struct rds_connection *conn)
 {
unsigned long flags;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index f74bab3..67d0929 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -79,6 +79,7 @@ int rds_tcp_keepalive(struct socket *sock)
  * smaller ip address, we recycle conns in RDS_CONN_ERROR on the passive side
  * by moving them to CONNECTING in this function.
  */
+static
 struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn)
 {
int i;
-- 
1.9.1

[PATCH] net: freescale: dpaa: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
index 27e7044..15571e2 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
@@ -72,8 +72,8 @@
 #define DPAA_STATS_PERCPU_LEN ARRAY_SIZE(dpaa_stats_percpu)
 #define DPAA_STATS_GLOBAL_LEN ARRAY_SIZE(dpaa_stats_global)
 
-static int dpaa_get_settings(struct net_device *net_dev,
-struct ethtool_cmd *et_cmd)
+static int dpaa_get_link_ksettings(struct net_device *net_dev,
+  struct ethtool_link_ksettings *cmd)
 {
int err;
 
@@ -82,13 +82,13 @@ static int dpaa_get_settings(struct net_device *net_dev,
return 0;
}
 
-   err = phy_ethtool_gset(net_dev->phydev, et_cmd);
+   err = phy_ethtool_ksettings_get(net_dev->phydev, cmd);
 
return err;
 }
 
-static int dpaa_set_settings(struct net_device *net_dev,
-struct ethtool_cmd *et_cmd)
+static int dpaa_set_link_ksettings(struct net_device *net_dev,
+  const struct ethtool_link_ksettings *cmd)
 {
int err;
 
@@ -97,9 +97,9 @@ static int dpaa_set_settings(struct net_device *net_dev,
return -ENODEV;
}
 
-   err = phy_ethtool_sset(net_dev->phydev, et_cmd);
+   err = phy_ethtool_ksettings_set(net_dev->phydev, cmd);
if (err < 0)
-   netdev_err(net_dev, "phy_ethtool_sset() = %d\n", err);
+   netdev_err(net_dev, "phy_ethtool_ksettings_set() = %d\n", err);
 
return err;
 }
@@ -402,8 +402,6 @@ static void dpaa_get_strings(struct net_device *net_dev, 
u32 stringset,
 }
 
 const struct ethtool_ops dpaa_ethtool_ops = {
-   .get_settings = dpaa_get_settings,
-   .set_settings = dpaa_set_settings,
.get_drvinfo = dpaa_get_drvinfo,
.get_msglevel = dpaa_get_msglevel,
.set_msglevel = dpaa_set_msglevel,
@@ -414,4 +412,6 @@ static void dpaa_get_strings(struct net_device *net_dev, 
u32 stringset,
.get_sset_count = dpaa_get_sset_count,
.get_ethtool_stats = dpaa_get_ethtool_stats,
.get_strings = dpaa_get_strings,
+   .get_link_ksettings = dpaa_get_link_ksettings,
+   .set_link_ksettings = dpaa_set_link_ksettings,
 };
-- 
1.7.4.4

[net-next][PATCH v3 06/17] RDS: RDMA: start rdma listening after init

2017-01-02 Thread Santosh Shilimkar

From: Qing Huang 

This prevents RDS from handling incoming rdma packets before RDS
completes initializing its recv/send components.

Signed-off-by: Qing Huang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index d5f3117..fc59821 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -206,18 +206,13 @@ static int rds_rdma_init(void)
 {
int ret;
 
-   ret = rds_rdma_listen_init();
+   ret = rds_ib_init();
if (ret)
goto out;
 
-   ret = rds_ib_init();
+   ret = rds_rdma_listen_init();
if (ret)
-   goto err_ib_init;
-
-   goto out;
-
-err_ib_init:
-   rds_rdma_listen_stop();
+   rds_ib_exit();
 out:
return ret;
 }
-- 
1.9.1

[net-next][PATCH v3 01/17] RDS: log the address on bind failure

2017-01-02 Thread Santosh Shilimkar

It's useful to know the IP address when RDS fails to bind a
connection. Thus, adding it to the error message.

Orabug: 21894138
Reviewed-by: Wei Lin Guay 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 095f6ce..3a915be 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -176,8 +176,8 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
if (!trans) {
ret = -EADDRNOTAVAIL;
rds_remove_bound(rs);
-   printk_ratelimited(KERN_INFO "RDS: rds_bind() could not find a 
transport, "
-   "load rds_tcp or rds_rdma?\n");
+   pr_info_ratelimited("RDS: %s could not find a transport for 
%pI4, load rds_tcp or rds_rdma?\n",
+   __func__, >sin_addr.s_addr);
goto out;
}
 
-- 
1.9.1

[net-next][PATCH v3 03/17] RDS: IB: include faddr in connection log

2017-01-02 Thread Santosh Shilimkar

Also use pr_* for it.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c   | 19 +--
 net/rds/ib_recv.c |  4 ++--
 net/rds/ib_send.c |  4 ++--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5b2ab95..b9da1e5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -113,19 +113,18 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
}
 
if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-   printk(KERN_NOTICE "RDS/IB: Connection to %pI4 version %u.%u 
failed,"
-  " no longer supported\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version));
+   pr_notice("RDS/IB: Connection <%pI4,%pI4> version %u.%u no 
longer supported\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version));
rds_conn_destroy(conn);
return;
} else {
-   printk(KERN_NOTICE "RDS/IB: connected to %pI4 version 
%u.%u%s\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version),
-  ic->i_flowctl ? ", flow control" : "");
+   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version),
+ ic->i_flowctl ? ", flow control" : "");
}
 
/*
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 606a11f..6803b75 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -980,8 +980,8 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
} else {
/* We expect errors as the qp is drained during shutdown */
if (rds_conn_up(conn) || rds_conn_connecting(conn))
-   rds_ib_conn_error(conn, "recv completion on %pI4 had 
status %u (%s), disconnecting and reconnecting\n",
- >c_faddr,
+   rds_ib_conn_error(conn, "recv completion on <%pI4,%pI4> 
had status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr,
  wc->status,
  ib_wc_status_msg(wc->status));
}
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 84d90c9..19eca5c 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -300,8 +300,8 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
/* We expect errors as the qp is drained during shutdown */
if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-   rds_ib_conn_error(conn, "send completion on %pI4 had status %u 
(%s), disconnecting and reconnecting\n",
- >c_faddr, wc->status,
+   rds_ib_conn_error(conn, "send completion on <%pI4,%pI4> had 
status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr, wc->status,
  ib_wc_status_msg(wc->status));
}
 }
-- 
1.9.1

[net-next][PATCH v3 07/17] RDS: RDMA: return appropriate error on rdma map failures

2017-01-02 Thread Santosh Shilimkar

The first message to a remote node should prompt a new
connection even if it is RDMA operation. For RDMA operation
the MR mapping can fail because connections is not yet up.

Since the connection establishment is asynchronous,
we make sure the map failure because of unavailable
connection reach to the user by appropriate error code.
Before returning to the user, lets trigger the connection
so that its ready for the next retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index bb13c56..0a6f38b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -945,6 +945,11 @@ static int rds_cmsg_send(struct rds_sock *rs, struct 
rds_message *rm,
ret = rds_cmsg_rdma_map(rs, rm, cmsg);
if (!ret)
*allocated_mr = 1;
+   else if (ret == -ENODEV)
+   /* Accommodate the get_mr() case which can fail
+* if connection isn't established yet.
+*/
+   ret = -EAGAIN;
break;
case RDS_CMSG_ATOMIC_CSWP:
case RDS_CMSG_ATOMIC_FADD:
@@ -1082,8 +1087,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 
/* Parse any control messages the user may have included. */
ret = rds_cmsg_send(rs, rm, msg, _mr);
-   if (ret)
+   if (ret) {
+   /* Trigger connection so that its ready for the next retry */
+   if (ret ==  -EAGAIN)
+   rds_conn_connect_if_down(conn);
goto out;
+   }
 
if (rm->rdma.op_active && !conn->c_trans->xmit_rdma) {
printk_ratelimited(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n",
-- 
1.9.1

[net-next][PATCH v3 15/17] RDS: add stat for socket recv memory usage

2017-01-02 Thread Santosh Shilimkar

From: Venkat Venkatsubra 

Tracks the receive side memory added to scokets and removed from sockets.

Signed-off-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rds.h  | 3 +++
 net/rds/recv.c | 4 
 2 files changed, 7 insertions(+)

diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0bb8213..8ccd5a9 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -631,6 +631,9 @@ struct rds_statistics {
uint64_ts_cong_update_received;
uint64_ts_cong_send_error;
uint64_ts_cong_send_blocked;
+   uint64_ts_recv_bytes_added_to_socket;
+   uint64_ts_recv_bytes_removed_from_socket;
+
 };
 
 /* af_rds.c */
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 9d0666e..ba19eee 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -94,6 +94,10 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, 
struct sock *sk,
return;
 
rs->rs_rcv_bytes += delta;
+   if (delta > 0)
+   rds_stats_add(s_recv_bytes_added_to_socket, delta);
+   else
+   rds_stats_add(s_recv_bytes_removed_from_socket, -delta);
now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
 
rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
-- 
1.9.1

[net-next][PATCH v3 11/17] RDS: IB: add few useful cache stasts

2017-01-02 Thread Santosh Shilimkar

Tracks the ib receive cache total, incoming and frag allocations.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 7 +++
 net/rds/ib_recv.c  | 6 ++
 net/rds/ib_stats.c | 2 ++
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 5f02b4d..c62e551 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -151,6 +151,7 @@ struct rds_ib_connection {
u64 i_ack_recv; /* last ACK received */
struct rds_ib_refill_cache i_cache_incs;
struct rds_ib_refill_cache i_cache_frags;
+   atomic_ti_cache_allocs;
 
/* sending acks */
unsigned long   i_ack_flags;
@@ -254,6 +255,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rx_refill_from_cq;
uint64_ts_ib_rx_refill_from_thread;
uint64_ts_ib_rx_alloc_limit;
+   uint64_ts_ib_rx_total_frags;
+   uint64_ts_ib_rx_total_incs;
uint64_ts_ib_rx_credit_updates;
uint64_ts_ib_ack_sent;
uint64_ts_ib_ack_send_failure;
@@ -276,6 +279,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
+   uint64_ts_ib_recv_added_to_cache;
+   uint64_ts_ib_recv_removed_from_cache;
 };
 
 extern struct workqueue_struct *rds_ib_wq;
@@ -406,6 +411,8 @@ int rds_ib_send_grab_credits(struct rds_ib_connection *ic, 
u32 wanted,
 /* ib_stats.c */
 DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
 #define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member)
+#define rds_ib_stats_add(member, count) \
+   rds_stats_add_which(rds_ib_stats, member, count)
 unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
unsigned int avail);
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 6803b75..4b0f126 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -194,6 +194,8 @@ static void rds_ib_frag_free(struct rds_ib_connection *ic,
rdsdebug("frag %p page %p\n", frag, sg_page(>f_sg));
 
rds_ib_recv_cache_put(>f_cache_entry, >i_cache_frags);
+   atomic_add(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
 }
 
 /* Recycle inc after freeing attached frags */
@@ -261,6 +263,7 @@ static struct rds_ib_incoming *rds_ib_refill_one_inc(struct 
rds_ib_connection *i
atomic_dec(_ib_allocation);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_incs);
}
INIT_LIST_HEAD(>ii_frags);
rds_inc_init(>ii_inc, ic->conn, ic->conn->c_faddr);
@@ -278,6 +281,8 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
cache_item = rds_ib_recv_cache_get(>i_cache_frags);
if (cache_item) {
frag = container_of(cache_item, struct rds_page_frag, 
f_cache_entry);
+   atomic_sub(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
} else {
frag = kmem_cache_alloc(rds_ib_frag_slab, slab_mask);
if (!frag)
@@ -290,6 +295,7 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
kmem_cache_free(rds_ib_frag_slab, frag);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_frags);
}
 
INIT_LIST_HEAD(>f_item);
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index 7e78dca..9252ad1 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -55,6 +55,8 @@
"ib_rx_refill_from_cq",
"ib_rx_refill_from_thread",
"ib_rx_alloc_limit",
+   "ib_rx_total_frags",
+   "ib_rx_total_incs",
"ib_rx_credit_updates",
"ib_ack_sent",
"ib_ack_send_failure",
-- 
1.9.1

[net-next][PATCH v3 17/17] RDS: add receive message trace used by application

2017-01-02 Thread Santosh Shilimkar

Socket option to tap receive path latency in various stages
in nano seconds. It can be enabled on selective sockets using
using SO_RDS_MSG_RXPATH_LATENCY socket option. RDS will return
the data to application with RDS_CMSG_RXPATH_LATENCY in defined
format. Scope is left to add more trace points for future
without need of change in the interface.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
---
 include/uapi/linux/rds.h | 33 +
 net/rds/af_rds.c | 28 
 net/rds/ib_recv.c|  4 
 net/rds/rds.h| 10 ++
 net/rds/recv.c   | 32 +---
 net/rds/tcp_recv.c   |  5 +
 6 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 0f9265c..3833113 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -52,6 +52,13 @@
 #define RDS_GET_MR_FOR_DEST7
 #define SO_RDS_TRANSPORT   8
 
+/* Socket option to tap receive path latency
+ * SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
+ * Format used struct rds_rx_trace_so
+ */
+#define SO_RDS_MSG_RXPATH_LATENCY  10
+
+
 /* supported values for SO_RDS_TRANSPORT */
 #defineRDS_TRANS_IB0
 #defineRDS_TRANS_IWARP 1
@@ -77,6 +84,12 @@
  * the same as for the GET_MR setsockopt.
  * RDS_CMSG_RDMA_STATUS (recvmsg)
  * Returns the status of a completed RDMA operation.
+ * RDS_CMSG_RXPATH_LATENCY(recvmsg)
+ * Returns rds message latencies in various stages of receive
+ * path in nS. Its set per socket using SO_RDS_MSG_RXPATH_LATENCY
+ * socket option. Legitimate points are defined in
+ * enum rds_message_rxpath_latency. More points can be added in
+ * future. CSMG format is struct rds_cmsg_rx_trace.
  */
 #define RDS_CMSG_RDMA_ARGS 1
 #define RDS_CMSG_RDMA_DEST 2
@@ -87,6 +100,7 @@
 #define RDS_CMSG_ATOMIC_CSWP   7
 #define RDS_CMSG_MASKED_ATOMIC_FADD8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
+#define RDS_CMSG_RXPATH_LATENCY11
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -171,6 +185,25 @@ struct rds_info_rdma_connection {
uint32_trdma_mr_size;
 };
 
+/* RDS message Receive Path Latency points */
+enum rds_message_rxpath_latency {
+   RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
+   RDS_MSG_RX_DGRAM_REASSEMBLE,
+   RDS_MSG_RX_DGRAM_DELIVERED,
+   RDS_MSG_RX_DGRAM_TRACE_MAX
+};
+
+struct rds_rx_trace_so {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
+struct rds_cmsg_rx_trace {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
 /*
  * Congestion monitoring.
  * Congestion control in RDS happens at the host connection
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2ac1e61..fd821740 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -298,6 +298,30 @@ static int rds_enable_recvtstamp(struct sock *sk, char 
__user *optval,
return 0;
 }
 
+static int rds_recv_track_latency(struct rds_sock *rs, char __user *optval,
+ int optlen)
+{
+   struct rds_rx_trace_so trace;
+   int i;
+
+   if (optlen != sizeof(struct rds_rx_trace_so))
+   return -EFAULT;
+
+   if (copy_from_user(, optval, sizeof(trace)))
+   return -EFAULT;
+
+   rs->rs_rx_traces = trace.rx_traces;
+   for (i = 0; i < rs->rs_rx_traces; i++) {
+   if (trace.rx_trace_pos[i] > RDS_MSG_RX_DGRAM_TRACE_MAX) {
+   rs->rs_rx_traces = 0;
+   return -EFAULT;
+   }
+   rs->rs_rx_trace[i] = trace.rx_trace_pos[i];
+   }
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -338,6 +362,9 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_RDS_MSG_RXPATH_LATENCY:
+   ret = rds_recv_track_latency(rs, optval, optlen);
+   break;
default:
ret = -ENOPROTOOPT;
}
@@ -484,6 +511,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_cong_list);
spin_lock_init(>rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
+   rs->rs_rx_traces = 0;
 
spin_lock_bh(_sock_lock);
list_add_tail(>rs_item, _sock_list);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 4b0f126..e10624a 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -911,8 +911,12 @@ static void

[net-next][PATCH v3 16/17] RDS: make message size limit compliant with spec

2017-01-02 Thread Santosh Shilimkar

From: Avinash Repaka 

RDS support max message size as 1M but the code doesn't check this
in all cases. Patch fixes it for RDMA & non-RDMA and RDS MR size
and its enforced irrespective of underlying transport.

Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c |  9 -
 net/rds/rds.h  |  3 +++
 net/rds/send.c | 31 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 138aef6..f06fac4 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -40,7 +40,6 @@
 /*
  * XXX
  *  - build with sparse
- *  - should we limit the size of a mr region?  let transport return failure?
  *  - should we detect duplicate keys on a socket?  hmm.
  *  - an rdma is an mlock, apply rlimit?
  */
@@ -200,6 +199,14 @@ static int __rds_rdma_map(struct rds_sock *rs, struct 
rds_get_mr_args *args,
goto out;
}
 
+   /* Restrict the size of mr irrespective of underlying transport
+* To account for unaligned mr regions, subtract one from nr_pages
+*/
+   if ((nr_pages - 1) > (RDS_MAX_MSG_SIZE >> PAGE_SHIFT)) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n",
args->vec.addr, args->vec.bytes, nr_pages);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 8ccd5a9..f713194 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -50,6 +50,9 @@ void rdsdebug(char *fmt, ...)
 #define RDS_FRAG_SHIFT 12
 #define RDS_FRAG_SIZE  ((unsigned int)(1 << RDS_FRAG_SHIFT))
 
+/* Used to limit both RDMA and non-RDMA RDS message to 1MB */
+#define RDS_MAX_MSG_SIZE   ((unsigned int)(1 << 20))
+
 #define RDS_CONG_MAP_BYTES (65536 / 8)
 #define RDS_CONG_MAP_PAGES (PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE)
 #define RDS_CONG_MAP_PAGE_BITS (PAGE_SIZE * 8)
diff --git a/net/rds/send.c b/net/rds/send.c
index 45e025b..5cc6403 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -994,6 +994,26 @@ static int rds_send_mprds_hash(struct rds_sock *rs, struct 
rds_connection *conn)
return hash;
 }
 
+static int rds_rdma_bytes(struct msghdr *msg, size_t *rdma_bytes)
+{
+   struct rds_rdma_args *args;
+   struct cmsghdr *cmsg;
+
+   for_each_cmsghdr(cmsg, msg) {
+   if (!CMSG_OK(msg, cmsg))
+   return -EINVAL;
+
+   if (cmsg->cmsg_level != SOL_RDS)
+   continue;
+
+   if (cmsg->cmsg_type == RDS_CMSG_RDMA_ARGS) {
+   args = CMSG_DATA(cmsg);
+   *rdma_bytes += args->remote_vec.bytes;
+   }
+   }
+   return 0;
+}
+
 int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 {
struct sock *sk = sock->sk;
@@ -1008,6 +1028,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
int nonblock = msg->msg_flags & MSG_DONTWAIT;
long timeo = sock_sndtimeo(sk, nonblock);
struct rds_conn_path *cpath;
+   size_t total_payload_len = payload_len, rdma_payload_len = 0;
 
/* Mirror Linux UDP mirror of BSD error message compatibility */
/* XXX: Perhaps MSG_MORE someday */
@@ -1040,6 +1061,16 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
}
release_sock(sk);
 
+   ret = rds_rdma_bytes(msg, _payload_len);
+   if (ret)
+   goto out;
+
+   total_payload_len += rdma_payload_len;
+   if (max_t(size_t, payload_len, rdma_payload_len) > RDS_MAX_MSG_SIZE) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
if (payload_len > rds_sk_sndbuf(rs)) {
ret = -EMSGSIZE;
goto out;
-- 
1.9.1

[net-next][PATCH v3 13/17] RDS: RDMA: Fix the composite message user notification

2017-01-02 Thread Santosh Shilimkar

When application sends an RDS RDMA composite message consist of
RDMA transfer to be followed up by non RDMA payload, it expect to
be notified *only* when the full message gets delivered. RDS RDMA
notification doesn't behave this way though.

Thanks to Venkat for debug and root casuing the issue
where only first part of the message(RDMA) was
successfully delivered but remainder payload delivery failed.
In that case, application should not be notified with
a false positive of message delivery success.

Fix this case by making sure the user gets notified only after
the full message delivery.

Reviewed-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_send.c | 25 +++--
 net/rds/rdma.c| 10 ++
 net/rds/rds.h |  1 +
 net/rds/send.c|  4 +++-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 19eca5c..5e72de1 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -69,16 +69,6 @@ static void rds_ib_send_complete(struct rds_message *rm,
complete(rm, notify_status);
 }
 
-static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
-  struct rm_data_op *op,
-  int wc_status)
-{
-   if (op->op_nents)
-   ib_dma_unmap_sg(ic->i_cm_id->device,
-   op->op_sg, op->op_nents,
-   DMA_TO_DEVICE);
-}
-
 static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
   struct rm_rdma_op *op,
   int wc_status)
@@ -139,6 +129,21 @@ static void rds_ib_send_unmap_atomic(struct 
rds_ib_connection *ic,
rds_ib_stats_inc(s_ib_atomic_fadd);
 }
 
+static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
+  struct rm_data_op *op,
+  int wc_status)
+{
+   struct rds_message *rm = container_of(op, struct rds_message, data);
+
+   if (op->op_nents)
+   ib_dma_unmap_sg(ic->i_cm_id->device,
+   op->op_sg, op->op_nents,
+   DMA_TO_DEVICE);
+
+   if (rm->rdma.op_active && rm->data.op_notify)
+   rds_ib_send_unmap_rdma(ic, >rdma, wc_status);
+}
+
 /*
  * Unmap the resources associated with a struct send_work.
  *
diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 4297f3f..138aef6 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -627,6 +627,16 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct 
rds_message *rm,
}
op->op_notifier->n_user_token = args->user_token;
op->op_notifier->n_status = RDS_RDMA_SUCCESS;
+
+   /* Enable rmda notification on data operation for composite
+* rds messages and make sure notification is enabled only
+* for the data operation which follows it so that application
+* gets notified only after full message gets delivered.
+*/
+   if (rm->data.op_sg) {
+   rm->rdma.op_notify = 0;
+   rm->data.op_notify = !!(args->flags & 
RDS_RDMA_NOTIFY_ME);
+   }
}
 
/* The cookie contains the R_Key of the remote memory region, and
diff --git a/net/rds/rds.h b/net/rds/rds.h
index ebbf909..0bb8213 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -419,6 +419,7 @@ struct rds_message {
} rdma;
struct rm_data_op {
unsigned intop_active:1;
+   unsigned intop_notify:1;
unsigned intop_nents;
unsigned intop_count;
unsigned intop_dmasg;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0a6f38b..45e025b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -476,12 +476,14 @@ void rds_rdma_send_complete(struct rds_message *rm, int 
status)
struct rm_rdma_op *ro;
struct rds_notifier *notifier;
unsigned long flags;
+   unsigned int notify = 0;
 
spin_lock_irqsave(>m_rs_lock, flags);
 
+   notify =  rm->rdma.op_notify | rm->data.op_notify;
ro = >rdma;
if (test_bit(RDS_MSG_ON_SOCK, >m_flags) &&
-   ro->op_active && ro->op_notify && ro->op_notifier) {
+   ro->op_active && notify && ro->op_notifier) {
notifier = ro->op_notifier;
rs = rm->m_rs;
sock_hold(rds_rs_to_sk(rs));
-- 
1.9.1

[net-next][PATCH v3 10/17] RDS: IB: track and log active side endpoint in connection

2017-01-02 Thread Santosh Shilimkar

Useful to know the active and passive end points in a
RDS IB connection.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  3 +++
 net/rds/ib_cm.c | 11 +++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f14c26d..5f02b4d 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -181,6 +181,9 @@ struct rds_ib_connection {
 
/* Batched completions */
unsigned inti_unsignaled_wrs;
+
+   /* Endpoint role in connection */
+   booli_active_side;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 3002acf..4d1bf04 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -120,16 +120,17 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
rds_conn_destroy(conn);
return;
} else {
-   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+   pr_notice("RDS/IB: %s conn connected <%pI4,%pI4> version 
%u.%u%s\n",
+ ic->i_active_side ? "Active" : "Passive",
  >c_laddr, >c_faddr,
  RDS_PROTOCOL_MAJOR(conn->c_version),
  RDS_PROTOCOL_MINOR(conn->c_version),
  ic->i_flowctl ? ", flow control" : "");
}
 
-   /*
-* Init rings and fill recv. this needs to wait until protocol 
negotiation
-* is complete, since ring layout is different from 3.0 to 3.1.
+   /* Init rings and fill recv. this needs to wait until protocol
+* negotiation is complete, since ring layout is different
+* from 3.1 to 4.1.
 */
rds_ib_send_init_ring(ic);
rds_ib_recv_init_ring(ic);
@@ -685,6 +686,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
if (ic->i_cm_id == cm_id)
ret = 0;
}
+   ic->i_active_side = true;
return ret;
 }
 
@@ -859,6 +861,7 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
ic->i_sends = NULL;
vfree(ic->i_recvs);
ic->i_recvs = NULL;
+   ic->i_active_side = false;
 }
 
 int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
-- 
1.9.1

[net-next][PATCH v3 14/17] RDS: IB: fix panic due to handlers running post teardown

2017-01-02 Thread Santosh Shilimkar

Shutdown code reaping loop takes care of emptying the
CQ's before they being destroyed. And once tasklets are
killed, the hanlders are not expected to run.

But because of core tasklet code issues, tasklet handler could
still run even after tasklet_kill,
RDS IB shutdown code already reaps the CQs before freeing
cq/qp resources so as such the handlers have nothing left
to do post shutdown.

On other hand any handler running after teardown and trying
to access already freed qp/cq resources causes issues
Patch fixes this race by  makes sure that handlers returns
without any action post teardown.

Reviewed-by: Wengang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  1 +
 net/rds/ib_cm.c | 12 
 2 files changed, 13 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 1fe9f79..5404589 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,7 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
booli_active_side;
+   atomic_ti_cq_quiesce;
 
/* Send/Recv vectors */
int i_scq_vector;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 33c8584..ce3775a 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -128,6 +128,8 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
  ic->i_flowctl ? ", flow control" : "");
}
 
+   atomic_set(>i_cq_quiesce, 0);
+
/* Init rings and fill recv. this needs to wait until protocol
 * negotiation is complete, since ring layout is different
 * from 3.1 to 4.1.
@@ -267,6 +269,10 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
@@ -308,6 +314,10 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
memset(, 0, sizeof(state));
poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
@@ -804,6 +814,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
+   atomic_set(>i_cq_quiesce, 1);
+
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-- 
1.9.1

[net-next][PATCH v3 12/17] RDS: IB: Add vector spreading for cqs

2017-01-02 Thread Santosh Shilimkar

Based on available device vectors, allocate cqs accordingly to
get better spread of completion vectors which helps performace
great deal..

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 11 +++
 net/rds/ib.h|  5 +
 net/rds/ib_cm.c | 40 +---
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 5680d90..8d70884 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -111,6 +111,9 @@ static void rds_ib_dev_free(struct work_struct *work)
kfree(i_ipaddr);
}
 
+   if (rds_ibdev->vector_load)
+   kfree(rds_ibdev->vector_load);
+
kfree(rds_ibdev);
 }
 
@@ -159,6 +162,14 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
 
+   rds_ibdev->vector_load = kzalloc(sizeof(int) * device->num_comp_vectors,
+GFP_KERNEL);
+   if (!rds_ibdev->vector_load) {
+   pr_err("RDS/IB: %s failed to allocate vector memory\n",
+   __func__);
+   goto put_dev;
+   }
+
rds_ibdev->dev = device;
rds_ibdev->pd = ib_alloc_pd(device, 0);
if (IS_ERR(rds_ibdev->pd)) {
diff --git a/net/rds/ib.h b/net/rds/ib.h
index c62e551..1fe9f79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,10 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
booli_active_side;
+
+   /* Send/Recv vectors */
+   int i_scq_vector;
+   int i_rcq_vector;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
@@ -227,6 +231,7 @@ struct rds_ib_device {
spinlock_t  spinlock;   /* protect the above */
atomic_trefcount;
struct work_struct  free_work;
+   int *vector_load;
 };
 
 #define ibdev_to_node(ibdev) dev_to_node(ibdev->dma_device)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 4d1bf04..33c8584 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -358,6 +358,28 @@ static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_send_tasklet);
 }
 
+static inline int ibdev_get_unused_vector(struct rds_ib_device *rds_ibdev)
+{
+   int min = rds_ibdev->vector_load[rds_ibdev->dev->num_comp_vectors - 1];
+   int index = rds_ibdev->dev->num_comp_vectors - 1;
+   int i;
+
+   for (i = rds_ibdev->dev->num_comp_vectors - 1; i >= 0; i--) {
+   if (rds_ibdev->vector_load[i] < min) {
+   index = i;
+   min = rds_ibdev->vector_load[i];
+   }
+   }
+
+   rds_ibdev->vector_load[index]++;
+   return index;
+}
+
+static inline void ibdev_put_vector(struct rds_ib_device *rds_ibdev, int index)
+{
+   rds_ibdev->vector_load[index]--;
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -399,25 +421,30 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
+   ic->i_scq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
-
+   cq_attr.comp_vector = ic->i_scq_vector;
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_send_cq)) {
ret = PTR_ERR(ic->i_send_cq);
ic->i_send_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
goto out;
}
 
+   ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_recv_ring.w_nr;
+   cq_attr.comp_vector = ic->i_rcq_vector;
ic->i_recv_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_recv,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_recv_cq)) {
ret = PTR_ERR(ic->i_recv_cq);
ic->i_recv_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
goto out;
}
@@ -780,10 +807,17 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-   if (ic->i_send_cq)
+   if (ic->i_send_cq) {
+   if (ic->rds_ibdev)
+

Re: [PATCH net 9/9] virtio-net: XDP support for small buffers

2017-01-02 Thread John Fastabend

On 16-12-23 06:37 AM, Jason Wang wrote:
> Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
> small receive buffer untouched. This will confuse the user who want to
> set XDP but use small buffers. Other than forbid XDP in small buffer
> mode, let's make it work. XDP then can only work at skb->data since
> virtio-net create skbs during refill, this is sub optimal which could
> be optimized in the future.
> 
> Cc: John Fastabend 
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/virtio_net.c | 112 
> ---
>  1 file changed, 87 insertions(+), 25 deletions(-)
> 

Hi Jason,

I was doing some more testing on this what do you think about doing this
so that free_unused_bufs() handles the buffer free with dev_kfree_skb()
instead of put_page in small receive mode. Seems more correct to me.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 783e842..27ff76c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info 
*vi)

 static bool is_xdp_queue(struct virtnet_info *vi, int q)
 {
+   /* For small receive mode always use kfree_skb variants */
+   if (!vi->mergeable_rx_bufs)
+   return false;
+
if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
return false;
else if (q < vi->curr_queue_pairs)


patch is untested just spotted doing code review.

Thanks,
John

[net PATCH] net: virtio: cap mtu when XDP programs are running

2017-01-02 Thread John Fastabend

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

This patch sets/clears the max_mtu value at XDP load/unload time.

Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..783e842 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
 };
 
+#define MIN_MTU ETH_MIN_MTU
+#define MAX_MTU ETH_MAX_MTU
+
 static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
@@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
}
+   dev->max_mtu = max_sz;
+   } else {
+   dev->max_mtu = ETH_MAX_MTU;
}
 
vi->xdp_queue_pairs = xdp_qp;
@@ -2133,9 +2139,6 @@ static bool virtnet_validate_features(struct 
virtio_device *vdev)
return true;
 }
 
-#define MIN_MTU ETH_MIN_MTU
-#define MAX_MTU ETH_MAX_MTU
-
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err;

Re: [PATCH v2 net-next 2/2] tools: test case for TPACKET_V3/TX_RING support

2017-01-02 Thread Willem de Bruijn

On Sun, Jan 1, 2017 at 5:45 PM, Sowmini Varadhan
 wrote:
> Add a test case and sample code for (TPACKET_V3, PACKET_TX_RING)

Thanks for adding this.

walk_v3_tx is almost identical to walk_v1_v2_tx. That function can
just be extended to add a v3 case where it already multiplexes between
v1 and v2.

[PATCH nf-next 7/7] xtables: extend matches and targets with .usersize

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

In matches and targets that define a kernel-only tail to their
xt_match and xt_target data structs, add a field .usersize that
specifies up to where data is to be shared with userspace.

Performed a search for comment "Used internally by the kernel" to find
relevant matches and targets. Manually inspected the structs to derive
a valid offsetof.

Signed-off-by: Willem de Bruijn 
---
 net/bridge/netfilter/ebt_limit.c   | 1 +
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 1 +
 net/ipv6/netfilter/ip6t_NPT.c  | 2 ++
 net/netfilter/xt_CT.c  | 3 +++
 net/netfilter/xt_RATEEST.c | 1 +
 net/netfilter/xt_TEE.c | 2 ++
 net/netfilter/xt_bpf.c | 2 ++
 net/netfilter/xt_cgroup.c  | 1 +
 net/netfilter/xt_connlimit.c   | 1 +
 net/netfilter/xt_hashlimit.c   | 4 
 net/netfilter/xt_limit.c   | 2 ++
 net/netfilter/xt_quota.c   | 1 +
 net/netfilter/xt_rateest.c | 1 +
 net/netfilter/xt_string.c  | 1 +
 14 files changed, 23 insertions(+)

diff --git a/net/bridge/netfilter/ebt_limit.c b/net/bridge/netfilter/ebt_limit.c
index 517e78b..61a9f1b 100644
--- a/net/bridge/netfilter/ebt_limit.c
+++ b/net/bridge/netfilter/ebt_limit.c
@@ -105,6 +105,7 @@ static struct xt_match ebt_limit_mt_reg __read_mostly = {
.match  = ebt_limit_mt,
.checkentry = ebt_limit_mt_check,
.matchsize  = sizeof(struct ebt_limit_info),
+   .usersize   = offsetof(struct ebt_limit_info, prev),
 #ifdef CONFIG_COMPAT
.compatsize = sizeof(struct ebt_compat_limit_info),
 #endif
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 21db00d..8a3d20e 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -468,6 +468,7 @@ static struct xt_target clusterip_tg_reg __read_mostly = {
.checkentry = clusterip_tg_check,
.destroy= clusterip_tg_destroy,
.targetsize = sizeof(struct ipt_clusterip_tgt_info),
+   .usersize   = offsetof(struct ipt_clusterip_tgt_info, config),
 #ifdef CONFIG_COMPAT
.compatsize = sizeof(struct compat_ipt_clusterip_tgt_info),
 #endif /* CONFIG_COMPAT */
diff --git a/net/ipv6/netfilter/ip6t_NPT.c b/net/ipv6/netfilter/ip6t_NPT.c
index 590f767..a379d2f 100644
--- a/net/ipv6/netfilter/ip6t_NPT.c
+++ b/net/ipv6/netfilter/ip6t_NPT.c
@@ -112,6 +112,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly 
= {
.table  = "mangle",
.target = ip6t_snpt_tg,
.targetsize = sizeof(struct ip6t_npt_tginfo),
+   .usersize   = offsetof(struct ip6t_npt_tginfo, adjustment),
.checkentry = ip6t_npt_checkentry,
.family = NFPROTO_IPV6,
.hooks  = (1 << NF_INET_LOCAL_IN) |
@@ -123,6 +124,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly 
= {
.table  = "mangle",
.target = ip6t_dnpt_tg,
.targetsize = sizeof(struct ip6t_npt_tginfo),
+   .usersize   = offsetof(struct ip6t_npt_tginfo, adjustment),
.checkentry = ip6t_npt_checkentry,
.family = NFPROTO_IPV6,
.hooks  = (1 << NF_INET_PRE_ROUTING) |
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 95c7503..26b0bccfa 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -373,6 +373,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.name   = "CT",
.family = NFPROTO_UNSPEC,
.targetsize = sizeof(struct xt_ct_target_info),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v0,
.destroy= xt_ct_tg_destroy_v0,
.target = xt_ct_target_v0,
@@ -384,6 +385,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.family = NFPROTO_UNSPEC,
.revision   = 1,
.targetsize = sizeof(struct xt_ct_target_info_v1),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v1,
.destroy= xt_ct_tg_destroy_v1,
.target = xt_ct_target_v1,
@@ -395,6 +397,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.family = NFPROTO_UNSPEC,
.revision   = 2,
.targetsize = sizeof(struct xt_ct_target_info_v1),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v2,
.destroy= xt_ct_tg_destroy_v1,
.target = xt_ct_target_v1,
diff

[PATCH nf-next 6/7] xtables: use match, target and data copy_to_user helpers in compat

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

Convert compat to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
---
 net/netfilter/x_tables.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index feccf52..016db6b 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -619,17 +619,14 @@ int xt_compat_match_to_user(const struct xt_entry_match 
*m,
int off = xt_compat_match_offset(match);
u_int16_t msize = m->u.user.match_size - off;
 
-   if (copy_to_user(cm, m, sizeof(*cm)) ||
-   put_user(msize, >u.user.match_size) ||
-   copy_to_user(cm->u.user.name, m->u.kernel.match->name,
-strlen(m->u.kernel.match->name) + 1))
+   if (XT_OBJ_TO_USER(cm, m, match, msize))
return -EFAULT;
 
if (match->compat_to_user) {
if (match->compat_to_user((void __user *)cm->data, m->data))
return -EFAULT;
} else {
-   if (copy_to_user(cm->data, m->data, msize - sizeof(*cm)))
+   if (XT_DATA_TO_USER(cm, m, match, msize - sizeof(*cm)))
return -EFAULT;
}
 
@@ -977,17 +974,14 @@ int xt_compat_target_to_user(const struct xt_entry_target 
*t,
int off = xt_compat_target_offset(target);
u_int16_t tsize = t->u.user.target_size - off;
 
-   if (copy_to_user(ct, t, sizeof(*ct)) ||
-   put_user(tsize, >u.user.target_size) ||
-   copy_to_user(ct->u.user.name, t->u.kernel.target->name,
-strlen(t->u.kernel.target->name) + 1))
+   if (XT_OBJ_TO_USER(ct, t, target, tsize))
return -EFAULT;
 
if (target->compat_to_user) {
if (target->compat_to_user((void __user *)ct->data, t->data))
return -EFAULT;
} else {
-   if (copy_to_user(ct->data, t->data, tsize - sizeof(*ct)))
+   if (XT_DATA_TO_USER(ct, t, target, tsize - sizeof(*ct)))
return -EFAULT;
}
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH nf-next 4/7] arptables: use match, target and data copy_to_user helpers

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

Convert arptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/netfilter/arp_tables.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index a467e12..6241a81 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -677,11 +677,6 @@ static int copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   /* ... then copy entire thing ... */
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -689,6 +684,10 @@ static int copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct arpt_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct arpt_entry, counters),
 [num],
@@ -698,11 +697,7 @@ static int copy_entries_to_user(unsigned int total_size,
}
 
t = arpt_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.8.0.rc3.226.g39d4020

[PATCH nf-next 3/7] ip6tables: use match, target and data copy_to_user helpers

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

Convert ip6tables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
---
 net/ipv6/netfilter/ip6_tables.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 25a022d..1e15c54 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -855,10 +855,6 @@ copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -868,6 +864,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct ip6t_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct ip6t_entry, counters),
 [num],
@@ -881,23 +881,14 @@ copy_entries_to_user(unsigned int total_size,
 i += m->u.match_size) {
m = (void *)e + i;
 
-   if (copy_to_user(userptr + off + i
-+ offsetof(struct xt_entry_match,
-   u.user.name),
-m->u.kernel.match->name,
-strlen(m->u.kernel.match->name)+1)
-   != 0) {
+   if (xt_match_to_user(m, userptr + off + i)) {
ret = -EFAULT;
goto free_counters;
}
}
 
t = ip6t_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.8.0.rc3.226.g39d4020

[PATCH nf-next 5/7] ebtables: use match, target and data copy_to_user helpers

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

Convert ebtables to copying entries, matches and targets one by one.

The solution is analogous to that of generic xt_(match|target)_to_user
helpers, but is applied to different structs.

Convert existing helpers ebt_make_XXXname helpers that overwrite
fields of an already copy_to_user'd struct with ebt_XXX_to_user
helpers that copy all relevant fields of the struct from scratch.

Signed-off-by: Willem de Bruijn 
---
 net/bridge/netfilter/ebtables.c | 78 +
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 537e3d5..79b6991 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1346,56 +1346,72 @@ static int update_counters(struct net *net, const void 
__user *user,
hlp.num_counters, user, len);
 }
 
-static inline int ebt_make_matchname(const struct ebt_entry_match *m,
-const char *base, char __user *ubase)
+static inline int ebt_obj_to_user(char __user *um, const char *_name,
+ const char *data, int entrysize,
+ int usersize, int datasize)
 {
-   char __user *hlp = ubase + ((char *)m - base);
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
+   char name[EBT_FUNCTION_MAXNAMELEN] = {0};
 
/* ebtables expects 32 bytes long names but xt_match names are 29 bytes
 * long. Copy 29 bytes and fill remaining bytes with zeroes.
 */
-   strlcpy(name, m->u.match->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
+   strlcpy(name, _name, sizeof(name));
+   if (copy_to_user(um, name, EBT_FUNCTION_MAXNAMELEN) ||
+   put_user(datasize, (int __user *)(um + EBT_FUNCTION_MAXNAMELEN)) ||
+   xt_data_to_user(um + entrysize, data, usersize, datasize))
return -EFAULT;
+
return 0;
 }
 
-static inline int ebt_make_watchername(const struct ebt_entry_watcher *w,
-  const char *base, char __user *ubase)
+static inline int ebt_match_to_user(const struct ebt_entry_match *m,
+   const char *base, char __user *ubase)
 {
-   char __user *hlp = ubase + ((char *)w - base);
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
+   return ebt_obj_to_user(ubase + ((char *)m - base),
+  m->u.match->name, m->data, sizeof(*m),
+  m->u.match->usersize, m->match_size);
+}
 
-   strlcpy(name, w->u.watcher->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-   return -EFAULT;
-   return 0;
+static inline int ebt_watcher_to_user(const struct ebt_entry_watcher *w,
+ const char *base, char __user *ubase)
+{
+   return ebt_obj_to_user(ubase + ((char *)w - base),
+  w->u.watcher->name, w->data, sizeof(*w),
+  w->u.watcher->usersize, w->watcher_size);
 }
 
-static inline int ebt_make_names(struct ebt_entry *e, const char *base,
-char __user *ubase)
+static inline int ebt_entry_to_user(struct ebt_entry *e, const char *base,
+   char __user *ubase)
 {
int ret;
char __user *hlp;
const struct ebt_entry_target *t;
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
 
-   if (e->bitmask == 0)
+   if (e->bitmask == 0) {
+   /* special case !EBT_ENTRY_OR_ENTRIES */
+   if (copy_to_user(ubase + ((char *)e - base), e,
+sizeof(struct ebt_entries)))
+   return -EFAULT;
return 0;
+   }
+
+   if (copy_to_user(ubase + ((char *)e - base), e, sizeof(*e)))
+   return -EFAULT;
 
hlp = ubase + (((char *)e + e->target_offset) - base);
t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
 
-   ret = EBT_MATCH_ITERATE(e, ebt_make_matchname, base, ubase);
+   ret = EBT_MATCH_ITERATE(e, ebt_match_to_user, base, ubase);
if (ret != 0)
return ret;
-   ret = EBT_WATCHER_ITERATE(e, ebt_make_watchername, base, ubase);
+   ret = EBT_WATCHER_ITERATE(e, ebt_watcher_to_user, base, ubase);
if (ret != 0)
return ret;
-   strlcpy(name, t->u.target->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-   return -EFAULT;
+   ret = ebt_obj_to_user(hlp, t->u.target->name, t->data, sizeof(*t),
+ t->u.target->usersize, t->target_size);
+   if (ret != 0)
+   return ret;
+
return 0;
 }
 
@@ -1475,13 +1491,9 @@ static int copy_everything_to_user(struct ebt_table *t, 
void

[PATCH nf-next 2/7] iptables: use match, target and data copy_to_user helpers

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

Convert iptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/netfilter/ip_tables.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 91656a1..384b857 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -826,10 +826,6 @@ copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -839,6 +835,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct ipt_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct ipt_entry, counters),
 [num],
@@ -852,23 +852,14 @@ copy_entries_to_user(unsigned int total_size,
 i += m->u.match_size) {
m = (void *)e + i;
 
-   if (copy_to_user(userptr + off + i
-+ offsetof(struct xt_entry_match,
-   u.user.name),
-m->u.kernel.match->name,
-strlen(m->u.kernel.match->name)+1)
-   != 0) {
+   if (xt_match_to_user(m, userptr + off + i)) {
ret = -EFAULT;
goto free_counters;
}
}
 
t = ipt_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.8.0.rc3.226.g39d4020

[PATCH nf-next 1/7] xtables: add xt_match, xt_target and data copy_to_user functions

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

xt_entry_target, xt_entry_match and their private data may contain
kernel data.

Introduce helper functions xt_match_to_user, xt_target_to_user and
xt_data_to_user that copy only the expected fields. These replace
existing logic that calls copy_to_user on entire structs, then
overwrites select fields.

Private data is defined in xt_match and xt_target. All matches and
targets that maintain kernel data store this at the tail of their
private structure. Extend xt_match and xt_target with .usersize to
limit how many bytes of data are copied. The remainder is cleared.

If compatsize is specified, usersize can only safely be used if all
fields up to usersize use platform-independent types. Otherwise, the
compat_to_user callback must be defined.

This patch does not yet enable the support logic.

Signed-off-by: Willem de Bruijn 
---
 include/linux/netfilter/x_tables.h |  9 +++
 net/netfilter/x_tables.c   | 54 ++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/netfilter/x_tables.h 
b/include/linux/netfilter/x_tables.h
index 5117e4d..be378cf 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -167,6 +167,7 @@ struct xt_match {
 
const char *table;
unsigned int matchsize;
+   unsigned int usersize;
 #ifdef CONFIG_COMPAT
unsigned int compatsize;
 #endif
@@ -207,6 +208,7 @@ struct xt_target {
 
const char *table;
unsigned int targetsize;
+   unsigned int usersize;
 #ifdef CONFIG_COMPAT
unsigned int compatsize;
 #endif
@@ -287,6 +289,13 @@ int xt_check_match(struct xt_mtchk_param *, unsigned int 
size, u_int8_t proto,
 int xt_check_target(struct xt_tgchk_param *, unsigned int size, u_int8_t proto,
bool inv_proto);
 
+int xt_match_to_user(const struct xt_entry_match *m,
+struct xt_entry_match __user *u);
+int xt_target_to_user(const struct xt_entry_target *t,
+ struct xt_entry_target __user *u);
+int xt_data_to_user(void __user *dst, const void *src,
+   int usersize, int size);
+
 void *xt_copy_counters_from_user(const void __user *user, unsigned int len,
 struct xt_counters_info *info, bool compat);
 
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 2ff4996..feccf52 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -262,6 +262,60 @@ struct xt_target *xt_request_find_target(u8 af, const char 
*name, u8 revision)
 }
 EXPORT_SYMBOL_GPL(xt_request_find_target);
 
+
+static int xt_obj_to_user(u16 __user *psize, u16 size,
+ void __user *pname, const char *name,
+ u8 __user *prev, u8 rev)
+{
+   if (put_user(size, psize))
+   return -EFAULT;
+   if (copy_to_user(pname, name, strlen(name) + 1))
+   return -EFAULT;
+   if (put_user(rev, prev))
+   return -EFAULT;
+
+   return 0;
+}
+
+#define XT_OBJ_TO_USER(U, K, TYPE, C_SIZE) \
+   xt_obj_to_user(>u.TYPE##_size, C_SIZE ? : K->u.TYPE##_size,  \
+  U->u.user.name, K->u.kernel.TYPE->name,  \
+  >u.user.revision, K->u.kernel.TYPE->revision)
+
+int xt_data_to_user(void __user *dst, const void *src,
+   int usersize, int size)
+{
+   usersize = usersize ? : size;
+   if (copy_to_user(dst, src, usersize))
+   return -EFAULT;
+   if (usersize != size && clear_user(dst + usersize, size - usersize))
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(xt_data_to_user);
+
+#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
+   xt_data_to_user(U->data, K->data,   \
+   K->u.kernel.TYPE->usersize, \
+   C_SIZE ? : K->u.kernel.TYPE->TYPE##size)
+
+int xt_match_to_user(const struct xt_entry_match *m,
+struct xt_entry_match __user *u)
+{
+   return XT_OBJ_TO_USER(u, m, match, 0) ||
+  XT_DATA_TO_USER(u, m, match, 0);
+}
+EXPORT_SYMBOL_GPL(xt_match_to_user);
+
+int xt_target_to_user(const struct xt_entry_target *t,
+ struct xt_entry_target __user *u)
+{
+   return XT_OBJ_TO_USER(u, t, target, 0) ||
+  XT_DATA_TO_USER(u, t, target, 0);
+}
+EXPORT_SYMBOL_GPL(xt_target_to_user);
+
 static int match_revfn(u8 af, const char *name, u8 revision, int *bestp)
 {
const struct xt_match *m;
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-02 Thread Jamal Hadi Salim


On 17-01-02 01:23 PM, John Fastabend wrote:



Additionally I would like to point out this is an arbitrary length binary
blob (for undefined use, without even a specified encoding) that gets pushed
between user space and hardware ;) This seemed to get folks fairly excited in
the past.



The binary blob size is a little strange - but i think there is value
in storing some "cookie" field. The challenge is whether the kernel
gets to intepret it; in which case encoding must be specified. Or
whether we should leave it up to user space - in which something
like tc could standardize its own encodings.


Some questions, exactly what do you mean by "port mappings" above? In
general the 'tc' API uses the netdev the netlink msg is processed on as
the port mapping. If you mean OVS port to netdev port I think this is
a OVS problem and nothing to do with 'tc'. For what its worth there is an
existing problem with 'tc' where rules only apply to a single ingress or
egress port which is limiting on hardware.



In our case the desire is to be able to correlate for a system wide
mostly identity/key mapping.


The UFID in my ovs code base is defined as best I can tell here,

[OVS_FLOW_ATTR_UFID] = { .type = NL_A_UNSPEC, .optional = true,
 .min_len = sizeof(ovs_u128) },

So you need 128 bits if you want a 1:1 mapping onto 'tc'. So rather
than an arbitrary blob why not make the case that 'tc' ids need to be
128 bits long? Even if its just initially done in flower call it
flower_flow_id and define it so its not opaque and at least at the code
level it isn't an arbitrary blob of data.



I dont know what this UFID is, but do note:
The idea is not new - the FIB for example has some such cookie
(albeit a tiny one) which will typically be populated to tell
you who/what installed the entry.
I could see f.e use for this cookie to simplify and pretty print in
a human language for the u32 classifier (i.e user space tc sets
some fields in the cookie when updating kernel and when user space
invokes get/dump it uses the cookie to intepret how to pretty print).

I have attached a compile tested version of the cookies on actions
(flat 64 bit; now that we have experienced the use when we have a
large number of counters - I would not mind a 128 bit field).


cheers,
jamal


And what are the "next" uses of this besides OVS. It would be really
valuable to see how this generalizes to other usage models. To avoid
embedding OVS syntax into 'tc'.

Finally if you want to see an example of binary data encodings look at
how drivers/hardware/users are currently using the user defined bits in
ethtools ntuple API. Also track down out of tree drivers to see other
interesting uses. And that was capped at 64bits :/

Thanks,
John







diff --git a/include/net/act_api.h b/include/net/act_api.h
index 1d71644..f299ed3 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -41,6 +41,7 @@ struct tc_action {
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
+   u64 cookie;
 };
 #define tcf_head   common.tcfa_head
 #define tcf_index  common.tcfa_index
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index cb4bcdc..2e968ee 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -67,6 +67,7 @@ enum {
TCA_ACT_INDEX,
TCA_ACT_STATS,
TCA_ACT_PAD,
+   TCA_ACT_COOKIE,
__TCA_ACT_MAX
 };
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 2095c83..97eae6b 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static void free_tcf(struct rcu_head *head)
 {
@@ -467,17 +468,21 @@ int tcf_action_destroy(struct list_head *actions, int 
bind)
return a->ops->dump(skb, a, bind, ref);
 }
 
-int
-tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
+int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind,
+ int ref)
 {
int err = -EINVAL;
unsigned char *b = skb_tail_pointer(skb);
struct nlattr *nest;
+   u64 cookie = a->cookie;
 
if (nla_put_string(skb, TCA_KIND, a->ops->kind))
goto nla_put_failure;
if (tcf_action_copy_stats(skb, a, 0))
goto nla_put_failure;
+   if (nla_put_u64_64bit(skb, TCA_ACT_COOKIE, cookie, TCA_ACT_PAD))
+   goto nla_put_failure;
+
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
goto nla_put_failure;
@@ -578,6 +583,11 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,
if (err < 0)
goto err_mod;
 
+   if (tb[TCA_ACT_COOKIE])
+   a->cookie = nla_get_u64(tb[TCA_ACT_COOKIE]);
+   else
+   a->cookie = 0; /* kernel uses 0 */
+
/* module

[PATCH nf-next 0/7] xtables: use dedicated copy_to_user helpers

2017-01-02 Thread Willem de Bruijn

From: Willem de Bruijn 

xtables list and save interfaces share xt_match and xt_target state
with userspace. The kernel and userspace definitions of these structs
differ. Currently, the structs are copied wholesale, then patched up.
The match and target structs contain a kernel pointer. Type-specific
data may contain additional kernel-only state.

Introduce xt_match_to_user and xt_target_to_user helper functions to
copy only fields intended to be shared with userspace.

Introduce xt_data_to_user to do the same for type-specific state. Add
a field .usersize to xt_match and xt_target to define the range of
bytes in .matchsize that should be shared with userspace. All matches
and targets that define kernel-only data store this at the tail of
their struct.

Tested:

  Ran iptables-test.py from iptables.git, with both a 64-bit and
  32-bit compat binary. 603/603 tests passed both before and after
  the patches (out of 705, but some CONFIGs were not enabled).

  Also ran the following example queries manually, again using 64-bit
  and 32-bit compat paths:

  iptables -A INPUT  -m string --algo bm --string 'xxx' -j LOG
  iptables -L
  iptables-save

  ip6tables -A INPUT  -m string --algo bm --string 'xxx' -j LOG
  ip6tables -L
  ip6tables-save

  ebtables -A INPUT --limit 3 -j ACCEPT
  ebtables -L

  arptables -A INPUT --source-mac 00:11:22:33:44:55 -j ACCEPT
  arptables -L

  An instrumented binary that initializes its buffer with 0x66 bytes
  shows the result of the patchset.

  iptables LOG target in hex before and after. The xt_target struct
  only has its size, name and revision specified. Trailing bytes in
  the name field are not zeroed:

40 00 4c 4f 47 00 00 00
40 e1 0a a0 ff ff ff ff
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00

40 00 4c 4f 47 00 66 66
66 66 66 66 66 66 66 66
66 66 66 66 66 66 66 66
66 66 66 66 66 66 66 00
04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00

  ebtables limit match in hex before and after. Only the avg and burst
  fields of ebt_limit_info are shared.

6c 69 6d 69 74 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
20 00 00 00 00 00 00 00
05 0d 00 00 05 00 00 00
66 de fc ff 00 00 00 00
50 d0 00 00 50 d0 00 00
a9 29 00 00 00 00 00 00

6c 69 6d 69 74 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
20 00 00 00 00 00 00 00
05 0d 00 00 05 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00


Willem de Bruijn (7):
  xtables: add xt_match, xt_target and data copy_to_user functions
  iptables: use match, target and data copy_to_user helpers
  ip6tables: use match, target and data copy_to_user helpers
  arptables: use match, target and data copy_to_user helpers
  ebtables: use match, target and data copy_to_user helpers
  xtables: use match, target and data copy_to_user helpers in compat
  xtables: extend matches and targets with .usersize

 include/linux/netfilter/x_tables.h |  9 +
 net/bridge/netfilter/ebt_limit.c   |  1 +
 net/bridge/netfilter/ebtables.c| 78 +++---
 net/ipv4/netfilter/arp_tables.c| 15 +++-
 net/ipv4/netfilter/ip_tables.c | 21 +++---
 net/ipv4/netfilter/ipt_CLUSTERIP.c |  1 +
 net/ipv6/netfilter/ip6_tables.c| 21 +++---
 net/ipv6/netfilter/ip6t_NPT.c  |  2 +
 net/netfilter/x_tables.c   | 68 -
 net/netfilter/xt_CT.c  |  3 ++
 net/netfilter/xt_RATEEST.c |  1 +
 net/netfilter/xt_TEE.c |  2 +
 net/netfilter/xt_bpf.c |  2 +
 net/netfilter/xt_cgroup.c  |  1 +
 net/netfilter/xt_connlimit.c   |  1 +
 net/netfilter/xt_hashlimit.c   |  4 ++
 net/netfilter/xt_limit.c   |  2 +
 net/netfilter/xt_quota.c   |  1 +
 net/netfilter/xt_rateest.c |  1 +
 net/netfilter/xt_string.c  |  1 +
 20 files changed, 154 insertions(+), 81 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[net PATCH] ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules

2017-01-02 Thread Alexander Duyck

From: Alexander Duyck 

In the case of custom rules being present we need to handle the case of the
LOCAL table being intialized after the new rule has been added.  To address
that I am adding a new check so that we can make certain we don't use an
alias of MAIN for LOCAL when allocating a new table.

Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Oliver Brunel 
Signed-off-by: Alexander Duyck 
---
 net/ipv4/fib_frontend.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 3ff8938893ec..eae0332b0e8c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -85,7 +85,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
if (tb)
return tb;
 
-   if (id == RT_TABLE_LOCAL)
+   if (id == RT_TABLE_LOCAL && !net->ipv4.fib_has_custom_rules)
alias = fib_new_table(net, RT_TABLE_MAIN);
 
tb = fib_trie_table(id, alias);

Re: [PATCH for-next V2 00/11] Mellanox mlx5 core and ODP updates 2017-01-01

2017-01-02 Thread David Miller

From: Saeed Mahameed 
Date: Mon,  2 Jan 2017 11:37:37 +0200

> The following eleven patches mainly come from Artemy Kovalyov
> who expanded mlx5 on-demand-paging (ODP) support. In addition
> there are three cleanup patches which don't change any functionality,
> but are needed to align codebase prior accepting other patches.

Series applied to net-next, thanks.

Re: [PATCH net] Documentation/networking: fix typo in mpls-sysctl

2017-01-02 Thread David Miller

From: Alexander Alemayhu 
Date: Mon,  2 Jan 2017 18:52:24 +0100

> s/utliziation/utilization
> 
> Signed-off-by: Alexander Alemayhu 

Applied, thanks.

Re: pull-request: wireless-drivers-next 2017-01-02

2017-01-02 Thread David Miller

From: Kalle Valo 
Date: Mon, 02 Jan 2017 15:20:59 +0200

> first pull request for 4.11. The tree is based on 4.9 but that shouldn't
> be a problem, at least my test pull to net-next worked ok. I'll fast
> forward my trees after you have pulled this.
> 
> Please let me know if you have any problems.

Happy new year, pulled, thanks!

Re: [PATCH] tcp: provide tx timestamps for partial writes

2017-01-02 Thread Soheil Hassas Yeganeh

On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
 wrote:
> From: Soheil Hassas Yeganeh 
>
> For TCP sockets, tx timestamps are only captured when the user data
> is successfully and fully written to the socket. In many cases,
> however, TCP writes can be partial for which no timestamp is
> collected.
>
> Collect timestamps when the user data is partially copied into
> the socket.
>
> Signed-off-by: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Martin KaFai Lau 
> ---
>  net/ipv4/tcp.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 2e3807d..c207b16 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
> page *page, int offset,
> return copied;
>
>  do_error:
> -   if (copied)
> +   if (copied) {
> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> tcp_write_queue_tail(sk));
> goto out;
> +   }
>  out_err:
> /* make sure we wake any epoll edge trigger waiter */
> if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
> }
>
>  do_error:
> -   if (copied + copied_syn)
> +   if (copied + copied_syn) {
> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> tcp_write_queue_tail(sk));
> goto out;
> +   }
>  out_err:
> err = sk_stream_error(sk, flags, err);
> /* make sure we wake any epoll edge trigger waiter */
> --
> 2.8.0.rc3.226.g39d4020
>

I'm sorry for the incomplete annotation. This is for [net-next].

Thanks,
Soheil

[PATCH] tcp: provide tx timestamps for partial writes

2017-01-02 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

For TCP sockets, tx timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps when the user data is partially copied into
the socket.

Signed-off-by: Soheil Hassas Yeganeh 
Cc: Willem de Bruijn 
Cc: Yuchung Cheng 
Cc: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Martin KaFai Lau 
---
 net/ipv4/tcp.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2e3807d..c207b16 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
return copied;
 
 do_error:
-   if (copied)
+   if (copied) {
+   tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
goto out;
+   }
 out_err:
/* make sure we wake any epoll edge trigger waiter */
if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
@@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
}
 
 do_error:
-   if (copied + copied_syn)
+   if (copied + copied_syn) {
+   tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
goto out;
+   }
 out_err:
err = sk_stream_error(sk, flags, err);
/* make sure we wake any epoll edge trigger waiter */
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH 02/12] Common functions and definitions

2017-01-02 Thread Stephen Hemminger


> +#define AQ_OBJ_SET(_OBJ_, _F_) \
> +{ unsigned long flags_old, flags_new; atomic_t *flags = &(_OBJ_)->flags; \
> +do { \
> + flags_old = atomic_read(flags); \
> + flags_new = flags_old | (_F_); \
> +} while (atomic_cmpxchg(flags, \
> + flags_old, flags_new) != flags_old); }
> +
> +#define AQ_OBJ_CLR(_OBJ_, _F_) \
> +{ unsigned long flags_old, flags_new; atomic_t *flags = &(_OBJ_)->flags; \
> +do { \
> + flags_old = atomic_read(flags); \
> + flags_new = flags_old & ~(_F_); \
> +} while (atomic_cmpxchg(flags, \
> + flags_old, flags_new) != flags_old); }
> +

These are way to complex to be macros. Can't the same logic be done
as inline functions.

Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-02 Thread John Fastabend

On 17-01-02 11:44 AM, John Fastabend wrote:
> Add support for XDP adjust head by allocating a 256B header region
> that XDP programs can grow into. This is only enabled when a XDP
> program is loaded.
> 
> In order to ensure that we do not have to unwind queue headroom push
> queue setup below bpf_prog_add. It reads better to do a prog ref
> unwind vs another queue setup call.
> 
> : There is a problem with this patch as is. When xdp prog is loaded
>   the old buffers without the 256B headers need to be flushed so that
>   the bpf prog has the necessary headroom. This patch does this by
>   calling the virtqueue_detach_unused_buf() and followed by the
>   virtnet_set_queues() call to reinitialize the buffers. However I
>   don't believe this is safe per comment in virtio_ring this API
>   is not valid on an active queue and the only thing we have done
>   here is napi_disable/napi_enable wrappers which doesn't do anything
>   to the emulation layer.
> 
>   So the RFC is really to find the best solution to this problem.
>   A couple things come to mind, (a) always allocate the necessary
>   headroom but this is a bit of a waste (b) add some bit somewhere
>   to check if the buffer has headroom but this would mean XDP programs
>   would be broke for a cycle through the ring, (c) figure out how
>   to deactivate a queue, free the buffers and finally reallocate.
>   I think (c) is the best choice for now but I'm not seeing the
>   API to do this so virtio/qemu experts anyone know off-hand
>   how to make this work? I started looking into the PCI callbacks
>   reset() and virtio_device_ready() or possibly hitting the right
>   set of bits with vp_set_status() but my first attempt just hung
>   the device.
> 
> Signed-off-by: John Fastabend 
> ---


[...]

> +
> + /* Changing the headroom in buffers is a disruptive operation because
> +  * existing buffers must be flushed and reallocated. This will happen
> +  * when a xdp program is initially added or xdp is disabled by removing
> +  * the xdp program.
> +  */
> + if (old_hr != vi->headroom) {
> + cancel_delayed_work_sync(>refill);
> + if (netif_running(vi->dev)) {
> + for (i = 0; i < vi->max_queue_pairs; i++)
> + napi_disable(>rq[i].napi);
> + }
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + struct virtqueue *vq = vi->rq[i].vq;
> + void *buf;
> +
> + while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) 
> {
  ^

Per API in virtio_ring.c queue must be deactivated to call this.


> + if (vi->mergeable_rx_bufs) {
> + unsigned long ctx = (unsigned long)buf;
> + void *base = 
> mergeable_ctx_to_buf_address(ctx);
> + put_page(virt_to_head_page(base));
> + } else if (vi->big_packets) {
> + give_pages(>rq[i], buf);
> + } else {
> + dev_kfree_skb(buf);
> + }
> + }
> + }
> + if (netif_running(vi->dev)) {
> + for (i = 0; i < vi->max_queue_pairs; i++)
> + virtnet_napi_enable(>rq[i]);
> + }
> + }
> +

Hi Jason, Michael,

Any hints on how to solve the above kludge to flush the existing ring and
reallocate with correct headroom? The above doesn't look safe to me per commit
message.

Thanks!
John

[PATCH] net: fealnx: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/fealnx.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index 9cb436c..766636a 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -1817,25 +1817,27 @@ static void netdev_get_drvinfo(struct net_device *dev, 
struct ethtool_drvinfo *i
strlcpy(info->bus_info, pci_name(np->pci_dev), sizeof(info->bus_info));
 }
 
-static int netdev_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct netdev_private *np = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_gset(>mii, cmd);
+   rc = mii_ethtool_get_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
 
return rc;
 }
 
-static int netdev_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct netdev_private *np = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_sset(>mii, cmd);
+   rc = mii_ethtool_set_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
 
return rc;
@@ -1865,12 +1867,12 @@ static void netdev_set_msglevel(struct net_device *dev, 
u32 value)
 
 static const struct ethtool_ops netdev_ethtool_ops = {
.get_drvinfo= netdev_get_drvinfo,
-   .get_settings   = netdev_get_settings,
-   .set_settings   = netdev_set_settings,
.nway_reset = netdev_nway_reset,
.get_link   = netdev_get_link,
.get_msglevel   = netdev_get_msglevel,
.set_msglevel   = netdev_set_msglevel,
+   .get_link_ksettings = netdev_get_link_ksettings,
+   .set_link_ksettings = netdev_set_link_ksettings,
 };
 
 static int mii_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

[RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-02 Thread John Fastabend

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

: There is a problem with this patch as is. When xdp prog is loaded
  the old buffers without the 256B headers need to be flushed so that
  the bpf prog has the necessary headroom. This patch does this by
  calling the virtqueue_detach_unused_buf() and followed by the
  virtnet_set_queues() call to reinitialize the buffers. However I
  don't believe this is safe per comment in virtio_ring this API
  is not valid on an active queue and the only thing we have done
  here is napi_disable/napi_enable wrappers which doesn't do anything
  to the emulation layer.

  So the RFC is really to find the best solution to this problem.
  A couple things come to mind, (a) always allocate the necessary
  headroom but this is a bit of a waste (b) add some bit somewhere
  to check if the buffer has headroom but this would mean XDP programs
  would be broke for a cycle through the ring, (c) figure out how
  to deactivate a queue, free the buffers and finally reallocate.
  I think (c) is the best choice for now but I'm not seeing the
  API to do this so virtio/qemu experts anyone know off-hand
  how to make this work? I started looking into the PCI callbacks
  reset() and virtio_device_ready() or possibly hitting the right
  set of bits with vp_set_status() but my first attempt just hung
  the device.

Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |  106 +++---
 1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..fcc5bd7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -159,6 +159,9 @@ struct virtnet_info {
/* Ethtool settings */
u8 duplex;
u32 speed;
+
+   /* Headroom allocated in RX Queue */
+   unsigned int headroom;
 };
 
 struct padded_vnet_hdr {
@@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
}
 
if (vi->mergeable_rx_bufs) {
+   xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
/* Zero header and leave csum up to XDP layers */
hdr = xdp->data;
memset(hdr, 0, vi->hdr_len);
@@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
num_sg = 2;
sg_init_table(sq->sg, 2);
sg_set_buf(sq->sg, hdr, vi->hdr_len);
-   skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - 
xdp->data);
}
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
   data, GFP_ATOMIC);
@@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
   struct bpf_prog *xdp_prog,
   void *data, int len)
 {
-   int hdr_padded_len;
struct xdp_buff xdp;
-   void *buf;
unsigned int qp;
u32 act;
 
+
if (vi->mergeable_rx_bufs) {
-   hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-   xdp.data = data + hdr_padded_len;
+   int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+   /* Allow consuming headroom but reserve enough space to push
+* the descriptor on if we get an XDP_TX return code.
+*/
+   xdp.data_hard_start = data - vi->headroom + desc_room;
+   xdp.data = data + desc_room;
xdp.data_end = xdp.data + (len - vi->hdr_len);
-   buf = data;
} else { /* small buffers */
struct sk_buff *skb = data;
 
-   xdp.data = skb->data;
+   xdp.data_hard_start = skb->data;
+   xdp.data = skb->data + vi->headroom;
xdp.data_end = xdp.data + len;
-   buf = skb->data;
}
 
act = bpf_prog_run_xdp(xdp_prog, );
switch (act) {
case XDP_PASS:
+   if (!vi->mergeable_rx_bufs)
+   __skb_pull((struct sk_buff *) data,
+  xdp.data - xdp.data_hard_start);
return XDP_PASS;
case XDP_TX:
qp = vi->curr_queue_pairs -
vi->xdp_queue_pairs +
smp_processor_id();
-   xdp.data = buf;
virtnet_xdp_xmit(vi, rq, >sq[qp], , data);
return XDP_TX;
default:
@@ -440,7 +449,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
struct bpf_prog *xdp_prog;
 
len -= vi->hdr_len;
-   skb_trim(skb, len);

[PATCH net-next] bridge: multicast to unicast

2017-01-02 Thread Linus Lüssing

Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.

multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.

This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).

However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.

The initial patch and idea is from Felix Fietkau.

Cc: Felix Fietkau 
Signed-off-by: Linus Lüssing 

---

This feature is used and enabled by default in OpenWRT and LEDE for AP
interfaces for more than a year now to allow both a more robust multicast
delivery and multicast at higher rates (e.g. multicast streaming).

In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
the network daemon enabling AP isolation and by that separating all STAs.
Delivery of STA-to-STA IP mulitcast is made possible again by
enabling and utilizing the bridge hairpin mode, which considers the
incoming port as a potential outgoing port, too.

Hairpin-mode is performed after multicast snooping, therefore leading to
only deliver reports to STAs running a multicast router.
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 44 +--
 net/bridge/br_mdb.c   |  2 +-
 net/bridge/br_multicast.c | 92 ++-
 net/bridge/br_private.h   |  4 ++-
 net/bridge/br_sysfs_if.c  |  2 ++
 6 files changed, 115 insertions(+), 30 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c6587c0..f1b0d78 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -46,6 +46,7 @@ struct br_ip_list {
 #define BR_LEARNING_SYNC   BIT(9)
 #define BR_PROXYARP_WIFI   BIT(10)
 #define BR_MCAST_FLOOD BIT(11)
+#define BR_MULTICAST_TO_UCAST  BIT(12)
 
 #define BR_DEFAULT_AGEING_TIME (300 * HZ)
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7cb41ae..49d742d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -174,6 +174,33 @@ static struct net_bridge_port *maybe_deliver(
return p;
 }
 
+static struct net_bridge_port *maybe_deliver_addr(
+   struct net_bridge_port *prev, struct net_bridge_port *p,
+   struct sk_buff *skb, const unsigned char *addr,
+   bool local_orig)
+{
+   struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
+   const unsigned char *src = eth_hdr(skb)->h_source;
+
+   if (!should_deliver(p, skb))
+   return prev;
+
+   /* Even with hairpin, no soliloquies - prevent breaking IPv6 DAD */
+   if (skb->dev == p->dev && ether_addr_equal(src, addr))
+   return prev;
+
+   skb = skb_copy(skb, GFP_ATOMIC);
+   if (!skb) {
+   dev->stats.tx_dropped++;
+   return prev;
+   }
+
+   memcpy(eth_hdr(skb)->h_dest, addr, ETH_ALEN);
+   __br_forward(p, skb, local_orig);
+
+   return prev;
+}
+
 /* called under rcu_read_lock */
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
  enum br_pkt_type pkt_type, bool local_rcv, bool local_orig)
@@ -231,6 +258,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
struct net_bridge_port *prev = NULL;
struct net_bridge_port_group *p;
struct hlist_node *rp;
+   const unsigned char *addr;
 
rp = rcu_dereference(hlist_first_rcu(>router_list));
p = mdst ? rcu_dereference(mdst->ports) : NULL;
@@ -241,10 +269,20 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
rport = rp ? hlist_entry(rp, struct net_bridge_port, rlist) :
 NULL;
 
-   port = (unsigned long)lport > (unsigned long)rport ?
-  lport : rport;
+   if ((unsigned long)lport > (unsigned long)rport) {
+   port = lport;
+   addr = p->unicast ? p->eth_addr : NULL;
+   } else {
+   port = rport;
+   addr = NULL;
+   }
+
+   if (addr)
+   prev = maybe_deliver_addr(prev, port, skb, addr,
+ local_orig);
+   else
+   prev = maybe_deliver(prev, port, skb, local_orig);
 
-   prev = maybe_deliver(prev, port, skb, local_orig);
if (IS_ERR(prev))
goto out;
if (prev == port)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c

Re: tcp_bbr: Forcing set of BBR congestion control as default

2017-01-02 Thread Sedat Dilek

On Mon, Jan 2, 2017 at 8:12 PM, Neal Cardwell  wrote:
> On Mon, Jan 2, 2017 at 1:49 PM, Sedat Dilek  wrote:
>> On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell  wrote:
>>> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek  wrote:

 Hi,

 I am trying to force the set of BBR congestion control as default.
 My old linux-config uses CUBIC as default.
 I want both BBR and CUBIC to be built but BBR shall be my default.

 I tried the below snippet.

 I refresh my new linux-config like this...

 $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
 HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
 $MAKE_OPTS silentoldconfig < /dev/null

 What am I doing wrong?
>>>
>>> Perhaps your build directory already has an old .config file that sets
>>> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
>>> "make silentoldconfig" are maintaining those lines from the old
>>> .config?
>>>
>>> If you want to start with your existing .config but then make BBR the
>>> default, then it might be simplest just to edit your .config directly:
>>>
>>> -CONFIG_DEFAULT_TCP_CONG="cubic"
>>> +CONFIG_DEFAULT_TCP_CONG="bbr"
>>>
>>
>> Just to clarify...
>>
>> I can have both TCP_CONG cubic and bbr built and switch for example via 
>> sysctl?
>
> Yes, you can set the default congestion control using sysctl, eg:
>
>   sysctl net.ipv4.tcp_congestion_control=bbr
>
> And (as mentioned in the quick-start guide) on many Linux systems you
> can make this "sticky" after reboots with something like:
>
>   sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> 
> /etc/sysctl.conf'
>
>> Which tc version is required?
>> Here tc is from iproute (20121211-2~precise).
>> Is that enough?
>
> You should not need a particular version of tc to install "fq".
>
> In the BBR quick-start guide we recommend setting fq to be the default qdisc:
>
>   sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'
>

OK, this looks now good.

# sysctl net.core.default_qdisc
net.core.default_qdisc = fq

# sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = bbr

I wondered why my Internet connection has so many stalls.

> But if that doesn't work in your environment, then I believe you
> should be able to install the fq qdisc with any version of tc, with
> something like:
>
>   tc qdisc replace dev eth0 root fq
>
> You just won't be able to set or view configuration options.
>
>>> BTW, I presume you have seen the warning in the BBR commit message or
>>> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>>>
>>>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>>>   enabled, since pacing is integral to the BBR design and
>>>   implementation. BBR without pacing would not function properly, and
>>>   may incur unnecessary high packet loss rates.
>>>
>>> The BBR quick-start guide has some details about how to build and
>>> enable BBR and fq:
>>>
>>>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>>>
>>
>> Hmmm...
>>
>> From [1] Section "Further reading"...
>>
>> egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config
>>
>> then you see exactly the following lines:
>>
>> CONFIG_TCP_CONG_BBR=y
>> CONFIG_NET_SCH_FQ=y
>>
>> Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
>> That would be safer.
>>
>> [1] 
>> https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>
> That would be a little safer, but not sufficient (since the qdisc
> still has to be configured to be in the transmit path somewhere).
>

Does BBR only work with fq-qdisc best?
What about fq_codel?

- Sedat -

Re: tcp_bbr: Forcing set of BBR congestion control as default

2017-01-02 Thread Neal Cardwell

On Mon, Jan 2, 2017 at 1:49 PM, Sedat Dilek  wrote:
> On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell  wrote:
>> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek  wrote:
>>>
>>> Hi,
>>>
>>> I am trying to force the set of BBR congestion control as default.
>>> My old linux-config uses CUBIC as default.
>>> I want both BBR and CUBIC to be built but BBR shall be my default.
>>>
>>> I tried the below snippet.
>>>
>>> I refresh my new linux-config like this...
>>>
>>> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
>>> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
>>> $MAKE_OPTS silentoldconfig < /dev/null
>>>
>>> What am I doing wrong?
>>
>> Perhaps your build directory already has an old .config file that sets
>> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
>> "make silentoldconfig" are maintaining those lines from the old
>> .config?
>>
>> If you want to start with your existing .config but then make BBR the
>> default, then it might be simplest just to edit your .config directly:
>>
>> -CONFIG_DEFAULT_TCP_CONG="cubic"
>> +CONFIG_DEFAULT_TCP_CONG="bbr"
>>
>
> Just to clarify...
>
> I can have both TCP_CONG cubic and bbr built and switch for example via 
> sysctl?

Yes, you can set the default congestion control using sysctl, eg:

  sysctl net.ipv4.tcp_congestion_control=bbr

And (as mentioned in the quick-start guide) on many Linux systems you
can make this "sticky" after reboots with something like:

  sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf'

> Which tc version is required?
> Here tc is from iproute (20121211-2~precise).
> Is that enough?

You should not need a particular version of tc to install "fq".

In the BBR quick-start guide we recommend setting fq to be the default qdisc:

  sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'

But if that doesn't work in your environment, then I believe you
should be able to install the fq qdisc with any version of tc, with
something like:

  tc qdisc replace dev eth0 root fq

You just won't be able to set or view configuration options.

>> BTW, I presume you have seen the warning in the BBR commit message or
>> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>>
>>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>>   enabled, since pacing is integral to the BBR design and
>>   implementation. BBR without pacing would not function properly, and
>>   may incur unnecessary high packet loss rates.
>>
>> The BBR quick-start guide has some details about how to build and
>> enable BBR and fq:
>>
>>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>>
>
> Hmmm...
>
> From [1] Section "Further reading"...
>
> egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config
>
> then you see exactly the following lines:
>
> CONFIG_TCP_CONG_BBR=y
> CONFIG_NET_SCH_FQ=y
>
> Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
> That would be safer.
>
> [1] https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md

That would be a little safer, but not sufficient (since the qdisc
still has to be configured to be in the transmit path somewhere).

thanks,
neal

Re: [PATCH iproute2 net-next] tc: flower: support matching flags

2017-01-02 Thread Jiri Benc

On Wed, 28 Dec 2016 15:06:49 +0200, Paul Blakey wrote:
> Enhance flower to support matching on flags.
> 
> The 1st flag allows to match on whether the packet is
> an IP fragment.
> 
> Example:
> 
>   # add a flower filter that will drop fragmented packets
>   # (bit 0 of control flags)
>   tc filter add dev ens4f0 protocol ip parent : \
>   flower \
>   src_mac e4:1d:2d:fd:8b:01 \
>   dst_mac e4:1d:2d:fd:8b:02 \
>   indev ens4f0 \
>   matching_flags 0x1/0x1 \
>   action drop

This is very poor API. First, how is the user supposed to know what
those magic values in "matching_flags" mean? At the very least, it
should be documented in the man page.

Second, why "matching_flags"? That name suggests that those modify the
way the matching is done (to illustrate my point, I'd expect things
like "if the packet is too short, match this rule anyway" to be a
"matching flag"). But this is not the case. What's wrong with plain
"flags"? Or, if you want to be more specific, perhaps packet_flags?

Third, all of this looks very wrong anyway. There should be separate
keywords for individual flags. In this case, there should be an
"ip_fragment" flag. The tc tool should be responsible for putting the
flags together and creating the appropriate mask. The example would
then be:

tc filter add dev ens4f0 protocol ip parent : \
flower \
src_mac e4:1d:2d:fd:8b:01 \
dst_mac e4:1d:2d:fd:8b:02 \
indev ens4f0 \
ip_fragment yes\
action drop

I don't care whether it's "ip_fragment yes/no", "ip_fragment 1/0",
"ip_fragment/noip_fragment" or similar. The important thing is it's a
boolean flag; if specified, it's set to 0/1 and unmasked, if not
specified, it's wildcarded.

Stephen, I understand that you already applied this patch but given how
horrible the proposed API is and that's even undocumented in this
patch, please reconsider this. If this is released, the API is set in
stone and, frankly, it's very user unfriendly this way.

Paul, could you please prepare a patch that would introduce a more sane
API? I'd strongly prefer what I described under "third" but should you
strongly disagree, at least implement "second" and document the
currently known flag values.

Thanks,

 Jiri

[PATCH] net: faraday: ftmac100: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/faraday/ftmac100.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftmac100.c 
b/drivers/net/ethernet/faraday/ftmac100.c
index dce5f7b..c0ddbbe 100644
--- a/drivers/net/ethernet/faraday/ftmac100.c
+++ b/drivers/net/ethernet/faraday/ftmac100.c
@@ -825,16 +825,18 @@ static void ftmac100_get_drvinfo(struct net_device 
*netdev,
strlcpy(info->bus_info, dev_name(>dev), sizeof(info->bus_info));
 }
 
-static int ftmac100_get_settings(struct net_device *netdev, struct ethtool_cmd 
*cmd)
+static int ftmac100_get_link_ksettings(struct net_device *netdev,
+  struct ethtool_link_ksettings *cmd)
 {
struct ftmac100 *priv = netdev_priv(netdev);
-   return mii_ethtool_gset(>mii, cmd);
+   return mii_ethtool_get_link_ksettings(>mii, cmd);
 }
 
-static int ftmac100_set_settings(struct net_device *netdev, struct ethtool_cmd 
*cmd)
+static int ftmac100_set_link_ksettings(struct net_device *netdev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct ftmac100 *priv = netdev_priv(netdev);
-   return mii_ethtool_sset(>mii, cmd);
+   return mii_ethtool_set_link_ksettings(>mii, cmd);
 }
 
 static int ftmac100_nway_reset(struct net_device *netdev)
@@ -850,11 +852,11 @@ static u32 ftmac100_get_link(struct net_device *netdev)
 }
 
 static const struct ethtool_ops ftmac100_ethtool_ops = {
-   .set_settings   = ftmac100_set_settings,
-   .get_settings   = ftmac100_get_settings,
.get_drvinfo= ftmac100_get_drvinfo,
.nway_reset = ftmac100_nway_reset,
.get_link   = ftmac100_get_link,
+   .get_link_ksettings = ftmac100_get_link_ksettings,
+   .set_link_ksettings = ftmac100_set_link_ksettings,
 };
 
 /**
-- 
1.7.4.4

Re: tcp_bbr: Forcing set of BBR congestion control as default

2017-01-02 Thread Sedat Dilek

On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell  wrote:
> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek  wrote:
>>
>> Hi,
>>
>> I am trying to force the set of BBR congestion control as default.
>> My old linux-config uses CUBIC as default.
>> I want both BBR and CUBIC to be built but BBR shall be my default.
>>
>> I tried the below snippet.
>>
>> I refresh my new linux-config like this...
>>
>> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
>> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
>> $MAKE_OPTS silentoldconfig < /dev/null
>>
>> What am I doing wrong?
>
> Perhaps your build directory already has an old .config file that sets
> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
> "make silentoldconfig" are maintaining those lines from the old
> .config?
>
> If you want to start with your existing .config but then make BBR the
> default, then it might be simplest just to edit your .config directly:
>
> -CONFIG_DEFAULT_TCP_CONG="cubic"
> +CONFIG_DEFAULT_TCP_CONG="bbr"
>

Just to clarify...

I can have both TCP_CONG cubic and bbr built and switch for example via sysctl?

Which tc version is required?
Here tc is from iproute (20121211-2~precise).
Is that enough?

> BTW, I presume you have seen the warning in the BBR commit message or
> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>
>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>   enabled, since pacing is integral to the BBR design and
>   implementation. BBR without pacing would not function properly, and
>   may incur unnecessary high packet loss rates.
>
> The BBR quick-start guide has some details about how to build and
> enable BBR and fq:
>
>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>

Hmmm...

>From [1] Section "Further reading"...

egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config

then you see exactly the following lines:

CONFIG_TCP_CONG_BBR=y
CONFIG_NET_SCH_FQ=y

Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
That would be safer.

[1] https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md

> Hope that helps,

Thanks you reply helped for further testings.

- Sedat -

P.S.: Note2myself: Enable NET_SCH_FQ

$ ./scripts/diffconfig /boot/config-4.9.0-2-iniza-amd64 .config
 NET_SCH_FQ n -> y

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-02 Thread John Fastabend

On 17-01-02 06:59 AM, Jamal Hadi Salim wrote:
> 
> We have been using a cookie as well for actions (which we have been
> using but have been too lazy to submit so far). I am going to port
> it over to the newer kernels and post it.
> In our case that is intended to be opaque to the kernel i.e kernel
> never inteprets it; in that case it is similar to the kernel
> FIB protocol field.
> 
> In your case - could this cookie have been a class/flowid
> (a 32 bit)?
> And would it not make more sense for it the cookie to be
> generic to all classifiers? i.e why is it specific to flower?
> 
> cheers,
> jamal
> 
> On 17-01-02 08:13 AM, Paul Blakey wrote:
>> This is to support saving extra data that might be helpful on retrieval.
>> First use case is upcoming openvswitch flow offloads, extra data will
>> include UFID and port mappings for each added flow.
>>
>> Signed-off-by: Paul Blakey 
>> Reviewed-by: Roi Dayan 
>> Acked-by: Jiri Pirko 
>> ---

Additionally I would like to point out this is an arbitrary length binary
blob (for undefined use, without even a specified encoding) that gets pushed
between user space and hardware ;) This seemed to get folks fairly excited in
the past.

Some questions, exactly what do you mean by "port mappings" above? In
general the 'tc' API uses the netdev the netlink msg is processed on as
the port mapping. If you mean OVS port to netdev port I think this is
a OVS problem and nothing to do with 'tc'. For what its worth there is an
existing problem with 'tc' where rules only apply to a single ingress or
egress port which is limiting on hardware.

The UFID in my ovs code base is defined as best I can tell here,

[OVS_FLOW_ATTR_UFID] = { .type = NL_A_UNSPEC, .optional = true,
 .min_len = sizeof(ovs_u128) },

So you need 128 bits if you want a 1:1 mapping onto 'tc'. So rather
than an arbitrary blob why not make the case that 'tc' ids need to be
128 bits long? Even if its just initially done in flower call it
flower_flow_id and define it so its not opaque and at least at the code
level it isn't an arbitrary blob of data.

And what are the "next" uses of this besides OVS. It would be really
valuable to see how this generalizes to other usage models. To avoid
embedding OVS syntax into 'tc'.

Finally if you want to see an example of binary data encodings look at
how drivers/hardware/users are currently using the user defined bits in
ethtools ntuple API. Also track down out of tree drivers to see other
interesting uses. And that was capped at 64bits :/

Thanks,
John

Re: tcp_bbr: Forcing set of BBR congestion control as default

2017-01-02 Thread Neal Cardwell

On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek  wrote:
>
> Hi,
>
> I am trying to force the set of BBR congestion control as default.
> My old linux-config uses CUBIC as default.
> I want both BBR and CUBIC to be built but BBR shall be my default.
>
> I tried the below snippet.
>
> I refresh my new linux-config like this...
>
> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
> $MAKE_OPTS silentoldconfig < /dev/null
>
> What am I doing wrong?

Perhaps your build directory already has an old .config file that sets
the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
"make silentoldconfig" are maintaining those lines from the old
.config?

If you want to start with your existing .config but then make BBR the
default, then it might be simplest just to edit your .config directly:

-CONFIG_DEFAULT_TCP_CONG="cubic"
+CONFIG_DEFAULT_TCP_CONG="bbr"

BTW, I presume you have seen the warning in the BBR commit message or
tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:

  NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
  enabled, since pacing is integral to the BBR design and
  implementation. BBR without pacing would not function properly, and
  may incur unnecessary high packet loss rates.

The BBR quick-start guide has some details about how to build and
enable BBR and fq:

  https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md

Hope that helps,
neal

Re: [PATCH - resubmit] igmp: Make igmp group member RFC 3376 compliant

2017-01-02 Thread David Miller

From: Michal Tesar 
Date: Mon, 2 Jan 2017 14:38:36 +0100

> 5.2. Action on Reception of a Query
> 
>  When a system receives a Query, it does not respond immediately.
>  Instead, it delays its response by a random amount of time, bounded
>  by the Max Resp Time value derived from the Max Resp Code in the
>  received Query message.  A system may receive a variety of Queries on
>  different interfaces and of different kinds (e.g., General Queries,
>  Group-Specific Queries, and Group-and-Source-Specific Queries), each
>  of which may require its own delayed response.
> 
>  Before scheduling a response to a Query, the system must first
>  consider previously scheduled pending responses and in many cases
>  schedule a combined response.  Therefore, the system must be able to
>  maintain the following state:
> 
>  o A timer per interface for scheduling responses to General Queries.
> 
>  o A per-group and interface timer for scheduling responses to Group-
>Specific and Group-and-Source-Specific Queries.
> 
>  o A per-group and interface list of sources to be reported in the
>response to a Group-and-Source-Specific Query.
> 
>  When a new Query with the Router-Alert option arrives on an
>  interface, provided the system has state to report, a delay for a
>  response is randomly selected in the range (0, [Max Resp Time]) where
>  Max Resp Time is derived from Max Resp Code in the received Query
>  message.  The following rules are then used to determine if a Report
>  needs to be scheduled and the type of Report to schedule.  The rules
>  are considered in order and only the first matching rule is applied.
> 
>  1. If there is a pending response to a previous General Query
> scheduled sooner than the selected delay, no additional response
> needs to be scheduled.
> 
>  2. If the received Query is a General Query, the interface timer is
> used to schedule a response to the General Query after the
> selected delay.  Any previously pending response to a General
> Query is canceled.
> --8<--
> 
> Currently the timer is rearmed with new random expiration time for
> every incoming query regardless of possibly already pending report.
> Which is not aligned with the above RFE.
> It also might happen that higher rate of incoming queries can
> postpone the report after the expiration time of the first query
> causing group membership loss.
> 
> Now the per interface general query timer is rearmed only
> when there is no pending report already scheduled on that interface or
> the newly selected expiration time is before the already pending
> scheduled report.
> 
> Signed-off-by: Michal Tesar 

Applied and queued up for -stable, thanks.

Re: [PATCH] Update pptp handling to avoid null pointer deref. v2

2017-01-02 Thread David Miller

From: Ian Kumlien 
Date: Mon,  2 Jan 2017 09:18:35 +0100

> __skb_flow_dissect can be called with a skb or a data packet, either
> can be NULL. All calls seems to have been moved to __skb_header_pointer
> except the pptp handling which is still calling skb_header_pointer.
 ...
> Fixes: ab10dccb1160 ("rps: Inspect PPTP encapsulated by GRE to get flow hash")
> Signed-off-by: Ian Kumlien 

Applied and queued up for -stable.

[PATCH net] Documentation/networking: fix typo in mpls-sysctl

2017-01-02 Thread Alexander Alemayhu

s/utliziation/utilization

Signed-off-by: Alexander Alemayhu 
---
 Documentation/networking/mpls-sysctl.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 9ed15f86c17c..15d8d16934fd 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -5,8 +5,8 @@ platform_labels - INTEGER
possible to configure forwarding for label values equal to or
greater than the number of platform labels.
 
-   A dense utliziation of the entries in the platform label table
-   is possible and expected aas the platform labels are locally
+   A dense utilization of the entries in the platform label table
+   is possible and expected as the platform labels are locally
allocated.
 
If the number of platform label table entries is set to 0 no
-- 
2.11.0

Re: pull-request: mac80211 2017-01-02

2017-01-02 Thread David Miller

From: Johannes Berg 
Date: Mon,  2 Jan 2017 15:39:03 +0100

> Happy New Year :-)

Same to you :-)

> Even though I was out for around two weeks, only a single fix came in,
> I guess everyone else was also out. But if people were out, then they
> won't be sending fixes soon again I suppose, and if they weren't out
> they haven't sent more fixes, so I decided to send this one to you now.
> 
> Please pull and let me know if there's any problem.

It certainly was a slow holiday break this year, kind of nice
actually...

Pulled, thanks Johannes.

[PATCH 1/1] igb: Fix hw_dbg logging in igb_update_flash_i210

2017-01-02 Thread Peter Senna Tschudin

From: Hannu Lounento 

Fix an if statement with hw_dbg lines where the logic was inverted with
regards to the corresponding return value used in the if statement.

Signed-off-by: Hannu Lounento 
Signed-off-by: Peter Senna Tschudin 
---
 drivers/net/ethernet/intel/igb/e1000_i210.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_i210.c 
b/drivers/net/ethernet/intel/igb/e1000_i210.c
index 8aa7987..07d48f2 100644
--- a/drivers/net/ethernet/intel/igb/e1000_i210.c
+++ b/drivers/net/ethernet/intel/igb/e1000_i210.c
@@ -699,9 +699,9 @@ static s32 igb_update_flash_i210(struct e1000_hw *hw)
 
ret_val = igb_pool_flash_update_done_i210(hw);
if (ret_val)
-   hw_dbg("Flash update complete\n");
-   else
hw_dbg("Flash update time out\n");
+   else
+   hw_dbg("Flash update complete\n");
 
 out:
return ret_val;
-- 
2.5.5

Re: Bug w/ (policy) routing

2017-01-02 Thread Olivier Brunel

On Mon, 2 Jan 2017 09:48:12 -0700
David Ahern  wrote:

> On 1/1/17 12:52 PM, Olivier Brunel wrote:
> > Indeed, if I first delete the rule for lookup local and recreate it
> > w/ higher prio than my "lookup 50", then no more issue.  
> 
> After the unshare or when creating a new network namespace, bringing
> the lo device up will create the local table and the rest of the
> commands will work properly. ie., instead of moving the local rule
> you can run:

Indeed, and that's a much better solution for me, since I bring lo up
anyways, I might as well do it first. Thank you.


> unshare -n bash
> 
> ip li set lo up
> ip rule add table 50 prio 50
> ip link add test type veth peer name test2
> ...
> 
> -
> 
> Alex: 
> 
> The order of commands is influencing whether the unmerge succeeds or
> not which is wrong. I took a quick look and I don't see a simple
> solution to this. Effectively:
> 
> Adding a rule before bringing up any interface does not unmerge the
> tables: $ unshare -n bash
> $ ip rule add table 50 prio 50
> $ ip li set lo up
> 
> In fib_unmerge(), fib_new_table(net, RT_TABLE_LOCAL) returns null.
> 
> 
> Where the reverse order works:
> $ unshare -n bash
> $ ip li set lo up
> $ ip rule add table 50 prio 50
> 
> David

Re: Bug w/ (policy) routing

2017-01-02 Thread David Ahern

On 1/1/17 12:52 PM, Olivier Brunel wrote:
> Indeed, if I first delete the rule for lookup local and recreate it
> w/ higher prio than my "lookup 50", then no more issue.

After the unshare or when creating a new network namespace, bringing the lo 
device up will create the local table and the rest of the commands will work 
properly. ie., instead of moving the local rule you can run:

unshare -n bash

ip li set lo up
ip rule add table 50 prio 50
ip link add test type veth peer name test2
...

-

Alex: 

The order of commands is influencing whether the unmerge succeeds or not which 
is wrong. I took a quick look and I don't see a simple solution to this. 
Effectively:

Adding a rule before bringing up any interface does not unmerge the tables:
$ unshare -n bash
$ ip rule add table 50 prio 50
$ ip li set lo up

In fib_unmerge(), fib_new_table(net, RT_TABLE_LOCAL) returns null.

Where the reverse order works:
$ unshare -n bash
$ ip li set lo up
$ ip rule add table 50 prio 50

David

[PATCH] net: emulex: benet: use new api ethtool_{get|set}_link_ksettings

2017-01-02 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/emulex/benet/be_ethtool.c |   73 +++-
 1 files changed, 34 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_ethtool.c 
b/drivers/net/ethernet/emulex/benet/be_ethtool.c
index 0a48a31..7d1819c 100644
--- a/drivers/net/ethernet/emulex/benet/be_ethtool.c
+++ b/drivers/net/ethernet/emulex/benet/be_ethtool.c
@@ -606,7 +606,8 @@ bool be_pause_supported(struct be_adapter *adapter)
false : true;
 }
 
-static int be_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+static int be_get_link_ksettings(struct net_device *netdev,
+struct ethtool_link_ksettings *cmd)
 {
struct be_adapter *adapter = netdev_priv(netdev);
u8 link_status;
@@ -614,13 +615,14 @@ static int be_get_settings(struct net_device *netdev, 
struct ethtool_cmd *ecmd)
int status;
u32 auto_speeds;
u32 fixed_speeds;
+   u32 supported = 0, advertising = 0;
 
if (adapter->phy.link_speed < 0) {
status = be_cmd_link_status_query(adapter, _speed,
  _status, 0);
if (!status)
be_link_status_update(adapter, link_status);
-   ethtool_cmd_speed_set(ecmd, link_speed);
+   cmd->base.speed = link_speed;
 
status = be_cmd_get_phy_info(adapter);
if (!status) {
@@ -629,58 +631,51 @@ static int be_get_settings(struct net_device *netdev, 
struct ethtool_cmd *ecmd)
 
be_cmd_query_cable_type(adapter);
 
-   ecmd->supported =
+   supported =
convert_to_et_setting(adapter,
  auto_speeds |
  fixed_speeds);
-   ecmd->advertising =
+   advertising =
convert_to_et_setting(adapter, auto_speeds);
 
-   ecmd->port = be_get_port_type(adapter);
+   cmd->base.port = be_get_port_type(adapter);
 
if (adapter->phy.auto_speeds_supported) {
-   ecmd->supported |= SUPPORTED_Autoneg;
-   ecmd->autoneg = AUTONEG_ENABLE;
-   ecmd->advertising |= ADVERTISED_Autoneg;
+   supported |= SUPPORTED_Autoneg;
+   cmd->base.autoneg = AUTONEG_ENABLE;
+   advertising |= ADVERTISED_Autoneg;
}
 
-   ecmd->supported |= SUPPORTED_Pause;
+   supported |= SUPPORTED_Pause;
if (be_pause_supported(adapter))
-   ecmd->advertising |= ADVERTISED_Pause;
-
-   switch (adapter->phy.interface_type) {
-   case PHY_TYPE_KR_10GB:
-   case PHY_TYPE_KX4_10GB:
-   ecmd->transceiver = XCVR_INTERNAL;
-   break;
-   default:
-   ecmd->transceiver = XCVR_EXTERNAL;
-   break;
-   }
+   advertising |= ADVERTISED_Pause;
} else {
-   ecmd->port = PORT_OTHER;
-   ecmd->autoneg = AUTONEG_DISABLE;
-   ecmd->transceiver = XCVR_DUMMY1;
+   cmd->base.port = PORT_OTHER;
+   cmd->base.autoneg = AUTONEG_DISABLE;
}
 
/* Save for future use */
-   adapter->phy.link_speed = ethtool_cmd_speed(ecmd);
-   adapter->phy.port_type = ecmd->port;
-   adapter->phy.transceiver = ecmd->transceiver;
-   adapter->phy.autoneg = ecmd->autoneg;
-   adapter->phy.advertising = ecmd->advertising;
-   adapter->phy.supported = ecmd->supported;
+   adapter->phy.link_speed = cmd->base.speed;
+   adapter->phy.port_type = cmd->base.port;
+   adapter->phy.autoneg = cmd->base.autoneg;
+   adapter->phy.advertising = advertising;
+   adapter->phy.supported = supported;
} else {
-   ethtool_cmd_speed_set(ecmd, adapter->phy.link_speed);
-   ecmd->port = adapter->phy.port_type;
-   ecmd->transceiver = adapter->phy.transceiver;
-   ecmd->autoneg = adapter->phy.autoneg;
-   ecmd->advertising = adapter->phy.advertising;
-   ecmd->supported = adapter->phy.supported;
+

Re: [RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.

2017-01-02 Thread Richard Cochran

On Mon, Jan 02, 2017 at 03:47:07PM +0100, Nicolas Ferre wrote:
> Le 02/01/2017 à 12:31, Richard Cochran a écrit :
> > This Cadence IP core is a complete disaster.
> 
> Well, it evolved and propose several options to different SoC
> integrators. This is not something unusual...
> I suspect as well that some other network adapters have the same
> weakness concerning PTP timestamp in single register as the early
> revisions of this IP.

It appears that this core can neither latch the time on read or write,
or even latch time stamps.  I have worked with many different PTP HW
implementations, even early ones like on the ixp4xx, and it is no
exaggeration to say that this one is uniquely broken.

> I suspect that Rafal tend to jump too quickly to the latest IP revisions
> and add more options to this series: let's not try to pour too much
> things into this code right now.

Why can't you check the IP version in the driver?

And is it really true that the registers don't latch the time stamps,
as Rafal said?  If so, then we cannot accept the non-descriptor driver
version, since it cannot possibly work correctly.

Thanks,
Richard

Re: [PATCH net-next rfc 0/6] convert tc_verd to integer bitfields

2017-01-02 Thread Jamal Hadi Salim


And a happy new year netdev.
No objections to new year resolution of slimming the skb.
But: i am still concerned about the recursion that getting rid of
some of these bits could embolden. i.e my suggestion was infact to
restore some of those bits taken away by Florian after the ingress
redirect patches from Shmulik.

The possibilities are: egress->egress, egress->ingress,
ingress->egress, ingress->ingress. The suggestion was
xmit_recursion with some skb magic would suffice.
Hannes promised around last netdevconf that he has a scheme to solve
it without using any extra skb state.

cheers,
jamal

On 16-12-28 02:13 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

The skb tc_verd field takes up two bytes but uses far fewer bits.
Convert the remaining use cases to bitfields that fit in existing
holes (depending on config options) and potentially save the two
bytes in struct sk_buff.

This patchset is based on an earlier set by Florian Westphal and its
discussion (http://www.spinics.net/lists/netdev/msg329181.html).

Patches 1 and 2 are low hanging fruit: removing the last traces of
  data that are no longer stored in tc_verd.

Patches 3 and 4 convert tc_verd to individual bitfields (5 bits).

Patch 5 reduces TC_AT to a single bitfield,
  as AT_STACK is not valid here (unlike in the case of TC_FROM).

Patch 6 changes TC_FROM to two bitfields with clearly defined purpose.

It may be possible to reduce storage further after this initial round.
If tc_skip_classify is set only by IFB, testing skb_iif may suffice.
The L2 header pushing/popping logic can perhaps be shared with
AF_PACKET, which currently not pkt_type for the same purpose.

Tested ingress mirred + netem + ifb:

  ip link set dev ifb0 up
  tc qdisc add dev eth0 ingress
  tc filter add dev eth0 parent : \
u32 match ip dport 8000 0x \
action mirred egress redirect dev ifb0
  tc qdisc add dev ifb0 root netem delay 1000ms
  nc -u -l 8000 &
  ssh $otherhost nc -u $host 8000

Tested egress mirred:

  ip link add veth1 type veth peer name veth2
  ip link set dev veth1 up
  ip link set dev veth2 up
  tcpdump -n -i veth2 udp and dst port 8000 &

  tc qdisc add dev eth0 root handle 1: prio
  tc filter add dev eth0 parent 1:0 \
u32 match ip dport 8000 0x \
action mirred egress redirect dev veth1
  tc qdisc add dev veth1 root netem delay 1000ms
  nc -u $otherhost 8000

Willem de Bruijn (6):
  net-tc: remove unused tc_verd fields
  net-tc: make MAX_RECLASSIFY_LOOP local
  net-tc: extract skip classify bit from tc_verd
  net-tc: convert tc_verd to integer bitfields
  net-tc: convert tc_at to tc_at_ingress
  net-tc: convert tc_from to tc_from_ingress and tc_redirected

 drivers/net/ifb.c| 16 ---
 drivers/staging/octeon/ethernet-tx.c |  5 ++--
 include/linux/skbuff.h   | 15 ++
 include/net/sch_generic.h| 20 -
 include/uapi/linux/pkt_cls.h | 55 
 net/core/dev.c   | 20 -
 net/core/pktgen.c|  4 +--
 net/core/skbuff.c|  3 --
 net/sched/act_api.c  |  8 ++
 net/sched/act_ife.c  |  7 ++---
 net/sched/act_mirred.c   | 21 +++---
 net/sched/sch_api.c  |  4 ++-
 net/sched/sch_netem.c|  2 +-
 13 files changed, 64 insertions(+), 116 deletions(-)

Re: [RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.

2017-01-02 Thread Richard Cochran

On Mon, Jan 02, 2017 at 05:13:34PM +0530, Harini Katakam wrote:
> From the revision history of Cadence spec, all versions starting
> r1p02 have ability to include timestamp in descriptors.

So why not add code to read the version, hm?

> For previous versions the event register is the only option.

And is it true that the regsiters do not latch the time stamp?

If so, then the IP core is more than useless.

Thanks,
Richard

Re: [PATCH net 0/3] net: stmmac: dwmac-oxnas: fix leaks and simplify pm

2017-01-02 Thread Neil Armstrong

On 01/02/2017 12:56 PM, Johan Hovold wrote:
> These patches fixes of-node and fixed-phydev leaks in the recently added
> dwmac-oxnas driver, and ultimately switches over to using the generic pm
> implementation as the required callbacks are now in place.
> 
> Note that this series has only been compile tested.
> 
> Johan
> 
> 
> Johan Hovold (3):
>   net: stmmac: dwmac-oxnas: fix of-node leak
>   net: stmmac: dwmac-oxnas: fix fixed-link-phydev leaks
>   net: stmmac: dwmac-oxnas: use generic pm implementation
> 
>  drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c | 89 
> +--
>  1 file changed, 33 insertions(+), 56 deletions(-)
> 

Hi Johan,

This series looks good, I will (hopefully) send a Tested-by in the next few 
days.

Thanks,
Neil

Re: [PATCH] rfkill: remove rfkill-regulator

2017-01-02 Thread Johannes Berg

On Mon, 2017-01-02 at 16:01 +0100, Johannes Berg wrote:
> From: Johannes Berg 
> 
> There are no users of this ("vrfkill") in the tree, so it's just
> dead code - remove it.
> 
> This also isn't really how rfkill is supposed to be used - it's
> intended as a signalling mechanism to/from the device, which the
> driver (and partially cfg80211) will handle - having a separate
> rfkill instance for a regulator is confusing, the driver should
> use the regulator instead to turn off the device when requested.

OTOH, the rfkill-gpio is essentially the same thing, and it *does* get
used - by ACPI even, to control a GPS chip. And I'm not even sure that
there's a clear place to put this since there probably aren't any GPS
drivers?

johannes

[PATCH] rfkill: remove rfkill-regulator

2017-01-02 Thread Johannes Berg

From: Johannes Berg 

There are no users of this ("vrfkill") in the tree, so it's just
dead code - remove it.

This also isn't really how rfkill is supposed to be used - it's
intended as a signalling mechanism to/from the device, which the
driver (and partially cfg80211) will handle - having a separate
rfkill instance for a regulator is confusing, the driver should
use the regulator instead to turn off the device when requested.

Signed-off-by: Johannes Berg 
---
 include/linux/rfkill-regulator.h |  48 
 net/rfkill/Kconfig   |  11 ---
 net/rfkill/Makefile  |   1 -
 net/rfkill/rfkill-regulator.c| 154 ---
 4 files changed, 214 deletions(-)
 delete mode 100644 include/linux/rfkill-regulator.h
 delete mode 100644 net/rfkill/rfkill-regulator.c

diff --git a/include/linux/rfkill-regulator.h b/include/linux/rfkill-regulator.h
deleted file mode 100644
index aca36bc83315..
--- a/include/linux/rfkill-regulator.h
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- * rfkill-regulator.c - Regulator consumer driver for rfkill
- *
- * Copyright (C) 2009  Guiming Zhuo 
- * Copyright (C) 2011  Antonio Ospite 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- */
-
-#ifndef __LINUX_RFKILL_REGULATOR_H
-#define __LINUX_RFKILL_REGULATOR_H
-
-/*
- * Use "vrfkill" as supply id when declaring the regulator consumer:
- *
- * static struct regulator_consumer_supply pcap_regulator_V6_consumers [] = {
- * { .dev_name = "rfkill-regulator.0", .supply = "vrfkill" },
- * };
- *
- * If you have several regulator driven rfkill, you can append a numerical id 
to
- * .dev_name as done above, and use the same id when declaring the platform
- * device:
- *
- * static struct rfkill_regulator_platform_data ezx_rfkill_bt_data = {
- * .name  = "ezx-bluetooth",
- * .type  = RFKILL_TYPE_BLUETOOTH,
- * };
- *
- * static struct platform_device a910_rfkill = {
- * .name  = "rfkill-regulator",
- * .id= 0,
- * .dev   = {
- * .platform_data = _rfkill_bt_data,
- * },
- * };
- */
-
-#include 
-
-struct rfkill_regulator_platform_data {
-   char *name; /* the name for the rfkill switch */
-   enum rfkill_type type;  /* the type as specified in rfkill.h */
-};
-
-#endif /* __LINUX_RFKILL_REGULATOR_H */
diff --git a/net/rfkill/Kconfig b/net/rfkill/Kconfig
index 868f1ad0415a..060600b03fad 100644
--- a/net/rfkill/Kconfig
+++ b/net/rfkill/Kconfig
@@ -23,17 +23,6 @@ config RFKILL_INPUT
depends on INPUT = y || RFKILL = INPUT
default y if !EXPERT
 
-config RFKILL_REGULATOR
-   tristate "Generic rfkill regulator driver"
-   depends on RFKILL || !RFKILL
-   depends on REGULATOR
-   help
-  This options enable controlling radio transmitters connected to
-  voltage regulator using the regulator framework.
-
-  To compile this driver as a module, choose M here: the module will
-  be called rfkill-regulator.
-
 config RFKILL_GPIO
tristate "GPIO RFKILL driver"
depends on RFKILL
diff --git a/net/rfkill/Makefile b/net/rfkill/Makefile
index 311768783f4a..87a80aded0b3 100644
--- a/net/rfkill/Makefile
+++ b/net/rfkill/Makefile
@@ -5,5 +5,4 @@
 rfkill-y   += core.o
 rfkill-$(CONFIG_RFKILL_INPUT)  += input.o
 obj-$(CONFIG_RFKILL)   += rfkill.o
-obj-$(CONFIG_RFKILL_REGULATOR) += rfkill-regulator.o
 obj-$(CONFIG_RFKILL_GPIO)  += rfkill-gpio.o
diff --git a/net/rfkill/rfkill-regulator.c b/net/rfkill/rfkill-regulator.c
deleted file mode 100644
index 50cd26a48e87..
--- a/net/rfkill/rfkill-regulator.c
+++ /dev/null
@@ -1,154 +0,0 @@
-/*
- * rfkill-regulator.c - Regulator consumer driver for rfkill
- *
- * Copyright (C) 2009  Guiming Zhuo 
- * Copyright (C) 2011  Antonio Ospite 
- *
- * Implementation inspired by leds-regulator driver.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-struct rfkill_regulator_data {
-   struct rfkill *rf_kill;
-   bool reg_enabled;
-
-   struct regulator *vcc;
-};
-
-static int rfkill_regulator_set_block(void *data, bool blocked)
-{
-   struct rfkill_regulator_data *rfkill_data = data;
-   int ret = 0;
-
-   pr_debug("%s: blocked: %d\n", __func__, blocked);
-
-   if (blocked) {
-   if (rfkill_data->reg_enabled) {
-   regulator_disable(rfkill_data->vcc);
-   rfkill_data->reg_enabled = false;
-   }

1 2 >

1 - 100 of 136 matches

Mail list logo