date:20161210

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Greg KH

On Sat, Dec 10, 2016 at 11:18:14PM +0100, Dan Lüdtke wrote:
> 
> > On 8 Dec 2016, at 05:34, Daniel Kahn Gillmor  wrote:
> > 
> > On Wed 2016-12-07 19:30:34 -0500, Hannes Frederic Sowa wrote:
> >> Your custom protocol should be designed in a way you get an aligned ip
> >> header. Most protocols of the IETF follow this mantra and it is always
> >> possible to e.g. pad options so you end up on aligned boundaries for the
> >> next header.
> > 
> > fwiw, i'm not convinced that "most protocols of the IETF follow this
> > mantra".  we've had multiple discussions in different protocol groups
> > about shaving or bloating by a few bytes here or there in different
> > protocols, and i don't think anyone has brought up memory alignment as
> > an argument in any of the discussions i've followed.
> > 
> 
> If the trade-off is between 1 padding byte and 2 byte alignment versus
> 3 padding bytes and 4 byte alignment I would definitely opt for 3
> padding bytes. I know how that waste feels like to a protocol
> designer, but I think it is worth it. Maybe the padding/reserved will
> be useful some day for an additional feature.

Note, if you do do this (hint, I think it is a good idea), require that
these reserved/pad fields always set to 0 for now, so that no one puts
garbage in them and then if you later want to use them, it will be a
mess.

thanks,

greg k-h

Re: [Patch net] e1000: use disable_hardirq() for e1000_netpoll()

2016-12-10 Thread David Miller

From: Cong Wang 
Date: Sat, 10 Dec 2016 14:22:42 -0800

> In commit 02cea3958664 ("genirq: Provide disable_hardirq()")
> Peter introduced disable_hardirq() for netpoll, but it is forgotten
> to use it for e1000.
> 
> This patch changes disable_irq() to disable_hardirq() for e1000.
> 
> Reported-by: Dave Jones 
> Suggested-by: Sabrina Dubroca 
> Cc: Peter Zijlstra (Intel) 
> Cc: Jeff Kirsher 
> Signed-off-by: Cong Wang 

Applied.

Re: [PATCH] i40e: don't truncate match_method assignment

2016-12-10 Thread David Miller

From: Jacob Keller 
Date: Fri,  9 Dec 2016 13:39:21 -0800

> The .match_method field is a u8, so we shouldn't be casting to a u16,
> and because it is only one byte, we do not need to byte swap anything.
> Just assign the value directly. This avoids issues on Big Endian
> architectures which would have byte swapped and then incorrectly
> truncated the value.
> 
> Signed-off-by: Jacob Keller 

Applied.

Re: [PATCH v2] net: ethernet: ti: netcp: add support of cpts

2016-12-10 Thread David Miller

From: Grygorii Strashko 
Date: Thu, 8 Dec 2016 16:21:56 -0600

> From: WingMan Kwok 
> 
> This patch adds support of the cpts device found in the
> gbe and 10gbe ethernet switches on the keystone 2 SoCs
> (66AK2E/L/Hx, 66AK2Gx).
> 
> Cc: Richard Cochran 
> Signed-off-by: WingMan Kwok 
> Signed-off-by: Grygorii Strashko 
> ---
> changes in v2:
>  - dropped bindings changes
> 
> link on v1:
>  https://lkml.org/lkml/2016/11/28/781

Applied.

Re: [PATCH] [v4] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-10 Thread David Miller

From: Timur Tabi 
Date: Wed,  7 Dec 2016 13:20:51 -0600

> Instead of having individual PHY drivers set the SUPPORTED_Pause and
> SUPPORTED_Asym_Pause flags, phylib itself should set those flags,
> unless there is a hardware erratum or other special case.  During
> autonegotiation, the PHYs will determine whether to enable pause
> frame support.
> 
> Pause frames are a feature that is supported by the MAC.  It is the MAC
> that generates the frames and that processes them.  The PHY can only be
> configured to allow them to pass through.
> 
> This commit also effectively reverts the recently applied c7a61319
> ("net: phy: dp83848: Support ethernet pause frames").
> 
> So the new process is:
> 
> 1) Unless the PHY driver overrides it, phylib sets the SUPPORTED_Pause
> and SUPPORTED_AsymPause bits in phydev->supported.  This indicates that
> the PHY supports pause frames.
> 
> 2) The MAC driver checks phydev->supported before it calls phy_start().
> If (SUPPORTED_Pause | SUPPORTED_AsymPause) is set, then the MAC driver
> sets those bits in phydev->advertising, if it wants to enable pause
> frame support.
> 
> 3) When the link state changes, the MAC driver checks phydev->pause and
> phydev->asym_pause,  If the bits are set, then it enables the corresponding
> features in the MAC.  The algorithm is:
> 
>   if (phydev->pause)
>   The MAC should be programmed to receive and honor
> pause frames it receives, i.e. enable receive flow control.
> 
>   if (phydev->pause != phydev->asym_pause)
>   The MAC should be programmed to transmit pause
>   frames when needed, i.e. enable transmit flow control.
> 
> Signed-off-by: Timur Tabi 

Applied.

Re: [PATCH net-next 1/3] net: l2tp: export debug flags to UAPI

2016-12-10 Thread David Miller

From: Asbjoern Sloth Toennesen 
Date: Sun, 11 Dec 2016 00:18:57 +

> Move the L2TP_MSG_* definitions to UAPI, as it is part of
> the netlink API.
> 
> Signed-off-by: Asbjoern Sloth Toennesen 

Applied.

Re: [PATCH net-next 3/3] net: l2tp: ppp: change PPPOL2TP_MSG_* => L2TP_MSG_*

2016-12-10 Thread David Miller

From: Asbjoern Sloth Toennesen 
Date: Sun, 11 Dec 2016 00:18:59 +

> Signed-off-by: Asbjoern Sloth Toennesen 

Applied.

Re: [PATCH net-next 2/3] net: l2tp: deprecate PPPOL2TP_MSG_* in favour of L2TP_MSG_*

2016-12-10 Thread David Miller

From: Asbjoern Sloth Toennesen 
Date: Sun, 11 Dec 2016 00:18:58 +

> PPPOL2TP_MSG_* and L2TP_MSG_* are duplicates, and are being used
> interchangeably in the kernel, so let's standardize on L2TP_MSG_*
> internally, and keep PPPOL2TP_MSG_* defined in UAPI for compatibility.
> 
> Signed-off-by: Asbjoern Sloth Toennesen 

Applied.

Re: Remove private tx queue locks

2016-12-10 Thread David Miller

From: Lino Sanfilippo 
Date: Fri,  9 Dec 2016 00:55:41 +0100

> this patch series removes unnecessary private locks in the sxgbe and the
> stmmac driver.
> 
> v2:
> - adjust commit message

Series applied to net-next, thanks.

Re: [PATCH net 0/3] net: bridge: fast ageing on topology change

2016-12-10 Thread David Miller

From: Vivien Didelot 
Date: Sat, 10 Dec 2016 13:44:26 -0500

> 802.1D [1] specifies that the bridges in a network must use a short
> value to age out dynamic entries in the Filtering Database for a period,
> once a topology change has been communicated by the root bridge.
> 
> This patchset fixes this for the in-kernel STP implementation.
 ...

Series applied.

Re: [PATCH 09/10] vsock/virtio: fix src/dst cid format

2016-12-10 Thread Michael S. Tsirkin

On Wed, Dec 07, 2016 at 12:31:56PM +0800, Jason Wang wrote:
> 
> 
> On 2016年12月06日 23:41, Michael S. Tsirkin wrote:
> > These fields are 64 bit, using le32_to_cpu and friends
> > on these will not do the right thing.
> > Fix this up.
> > 
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   net/vmw_vsock/virtio_transport_common.c | 14 +++---
> >   1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/net/vmw_vsock/virtio_transport_common.c 
> > b/net/vmw_vsock/virtio_transport_common.c
> > index 6120384..22e99c4 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -606,9 +606,9 @@ static int virtio_transport_reset_no_sock(struct 
> > virtio_vsock_pkt *pkt)
> > return 0;
> > pkt = virtio_transport_alloc_pkt(, 0,
> > -le32_to_cpu(pkt->hdr.dst_cid),
> > +le64_to_cpu(pkt->hdr.dst_cid),
> >  le32_to_cpu(pkt->hdr.dst_port),
> > -le32_to_cpu(pkt->hdr.src_cid),
> > +le64_to_cpu(pkt->hdr.src_cid),
> >  le32_to_cpu(pkt->hdr.src_port));
> 
> Looking at sockaddr_vm, svm_cid is "unsigned int", do we really want 64 bit
> here?

Can't change the protocol at this point.


> > if (!pkt)
> > return -ENOMEM;
> > @@ -823,7 +823,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
> > struct virtio_vsock_pkt_info info = {
> > .op = VIRTIO_VSOCK_OP_RESPONSE,
> > .type = VIRTIO_VSOCK_TYPE_STREAM,
> > -   .remote_cid = le32_to_cpu(pkt->hdr.src_cid),
> > +   .remote_cid = le64_to_cpu(pkt->hdr.src_cid),
> > .remote_port = le32_to_cpu(pkt->hdr.src_port),
> > .reply = true,
> > };
> > @@ -863,9 +863,9 @@ virtio_transport_recv_listen(struct sock *sk, struct 
> > virtio_vsock_pkt *pkt)
> > child->sk_state = SS_CONNECTED;
> > vchild = vsock_sk(child);
> > -   vsock_addr_init(>local_addr, le32_to_cpu(pkt->hdr.dst_cid),
> > +   vsock_addr_init(>local_addr, le64_to_cpu(pkt->hdr.dst_cid),
> > le32_to_cpu(pkt->hdr.dst_port));
> > -   vsock_addr_init(>remote_addr, le32_to_cpu(pkt->hdr.src_cid),
> > +   vsock_addr_init(>remote_addr, le64_to_cpu(pkt->hdr.src_cid),
> > le32_to_cpu(pkt->hdr.src_port));
> > vsock_insert_connected(vchild);
> > @@ -904,9 +904,9 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt 
> > *pkt)
> > struct sock *sk;
> > bool space_available;
> > -   vsock_addr_init(, le32_to_cpu(pkt->hdr.src_cid),
> > +   vsock_addr_init(, le64_to_cpu(pkt->hdr.src_cid),
> > le32_to_cpu(pkt->hdr.src_port));
> > -   vsock_addr_init(, le32_to_cpu(pkt->hdr.dst_cid),
> > +   vsock_addr_init(, le64_to_cpu(pkt->hdr.dst_cid),
> > le32_to_cpu(pkt->hdr.dst_port));
> > trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,

[PATCH net-next 3/3] net: l2tp: ppp: change PPPOL2TP_MSG_* => L2TP_MSG_*

2016-12-10 Thread Asbjoern Sloth Toennesen

Signed-off-by: Asbjoern Sloth Toennesen 
---
 net/l2tp/l2tp_ppp.c | 54 ++---
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 2ddfec1..36cc56f 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -231,14 +231,14 @@ static void pppol2tp_recv(struct l2tp_session *session, 
struct sk_buff *skb, int
if (sk->sk_state & PPPOX_BOUND) {
struct pppox_sock *po;
 
-   l2tp_dbg(session, PPPOL2TP_MSG_DATA,
+   l2tp_dbg(session, L2TP_MSG_DATA,
 "%s: recv %d byte data frame, passing to ppp\n",
 session->name, data_len);
 
po = pppox_sk(sk);
ppp_input(>chan, skb);
} else {
-   l2tp_dbg(session, PPPOL2TP_MSG_DATA,
+   l2tp_dbg(session, L2TP_MSG_DATA,
 "%s: recv %d byte data frame, passing to L2TP 
socket\n",
 session->name, data_len);
 
@@ -251,7 +251,7 @@ static void pppol2tp_recv(struct l2tp_session *session, 
struct sk_buff *skb, int
return;
 
 no_sock:
-   l2tp_info(session, PPPOL2TP_MSG_DATA, "%s: no socket\n", session->name);
+   l2tp_info(session, L2TP_MSG_DATA, "%s: no socket\n", session->name);
kfree_skb(skb);
 }
 
@@ -773,7 +773,7 @@ static int pppol2tp_connect(struct socket *sock, struct 
sockaddr *uservaddr,
/* This is how we get the session context from the socket. */
sk->sk_user_data = session;
sk->sk_state = PPPOX_CONNECTED;
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: created\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: created\n",
  session->name);
 
 end:
@@ -827,7 +827,7 @@ static int pppol2tp_session_create(struct net *net, u32 
tunnel_id, u32 session_i
ps = l2tp_session_priv(session);
ps->tunnel_sock = tunnel->sock;
 
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: created\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: created\n",
  session->name);
 
error = 0;
@@ -989,7 +989,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
struct l2tp_tunnel *tunnel = session->tunnel;
struct pppol2tp_ioc_stats stats;
 
-   l2tp_dbg(session, PPPOL2TP_MSG_CONTROL,
+   l2tp_dbg(session, L2TP_MSG_CONTROL,
 "%s: pppol2tp_session_ioctl(cmd=%#x, arg=%#lx)\n",
 session->name, cmd, arg);
 
@@ -1009,7 +1009,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
if (copy_to_user((void __user *) arg, , sizeof(struct 
ifreq)))
break;
 
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: get mtu=%d\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: get mtu=%d\n",
  session->name, session->mtu);
err = 0;
break;
@@ -1025,7 +1025,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
 
session->mtu = ifr.ifr_mtu;
 
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: set mtu=%d\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: set mtu=%d\n",
  session->name, session->mtu);
err = 0;
break;
@@ -1039,7 +1039,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
if (put_user(session->mru, (int __user *) arg))
break;
 
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: get mru=%d\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: get mru=%d\n",
  session->name, session->mru);
err = 0;
break;
@@ -1054,7 +1054,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
break;
 
session->mru = val;
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: set mru=%d\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: set mru=%d\n",
  session->name, session->mru);
err = 0;
break;
@@ -1064,7 +1064,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
if (put_user(ps->flags, (int __user *) arg))
break;
 
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: get flags=%d\n",
+   l2tp_info(session, L2TP_MSG_CONTROL, "%s: get flags=%d\n",
  session->name, ps->flags);
err = 0;
break;
@@ -1074,7 +1074,7 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
if (get_user(val, (int __user *) arg))
break;
ps->flags = val;
-   l2tp_info(session, PPPOL2TP_MSG_CONTROL, "%s: set flags=%d\n",
+

[PATCH net-next 1/3] net: l2tp: export debug flags to UAPI

2016-12-10 Thread Asbjoern Sloth Toennesen

Move the L2TP_MSG_* definitions to UAPI, as it is part of
the netlink API.

Signed-off-by: Asbjoern Sloth Toennesen 
---
 include/uapi/linux/l2tp.h | 17 -
 net/l2tp/l2tp_core.h  | 10 --
 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/l2tp.h b/include/uapi/linux/l2tp.h
index 5daa48e..85ddb74 100644
--- a/include/uapi/linux/l2tp.h
+++ b/include/uapi/linux/l2tp.h
@@ -108,7 +108,7 @@ enum {
L2TP_ATTR_VLAN_ID,  /* u16 */
L2TP_ATTR_COOKIE,   /* 0, 4 or 8 bytes */
L2TP_ATTR_PEER_COOKIE,  /* 0, 4 or 8 bytes */
-   L2TP_ATTR_DEBUG,/* u32 */
+   L2TP_ATTR_DEBUG,/* u32, enum l2tp_debug_flags */
L2TP_ATTR_RECV_SEQ, /* u8 */
L2TP_ATTR_SEND_SEQ, /* u8 */
L2TP_ATTR_LNS_MODE, /* u8 */
@@ -175,6 +175,21 @@ enum l2tp_seqmode {
L2TP_SEQ_ALL = 2,
 };
 
+/**
+ * enum l2tp_debug_flags - debug message categories for L2TP tunnels/sessions
+ *
+ * @L2TP_MSG_DEBUG: verbose debug (if compiled in)
+ * @L2TP_MSG_CONTROL: userspace - kernel interface
+ * @L2TP_MSG_SEQ: sequence numbers
+ * @L2TP_MSG_DATA: data packets
+ */
+enum l2tp_debug_flags {
+   L2TP_MSG_DEBUG  = (1 << 0),
+   L2TP_MSG_CONTROL= (1 << 1),
+   L2TP_MSG_SEQ= (1 << 2),
+   L2TP_MSG_DATA   = (1 << 3),
+};
+
 /*
  * NETLINK_GENERIC related info
  */
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 2599af6..8f560f7 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -23,16 +23,6 @@
 #define L2TP_HASH_BITS_2   8
 #define L2TP_HASH_SIZE_2   (1 << L2TP_HASH_BITS_2)
 
-/* Debug message categories for the DEBUG socket option */
-enum {
-   L2TP_MSG_DEBUG  = (1 << 0), /* verbose debug (if
-* compiled in) */
-   L2TP_MSG_CONTROL= (1 << 1), /* userspace - kernel
-* interface */
-   L2TP_MSG_SEQ= (1 << 2), /* sequence numbers */
-   L2TP_MSG_DATA   = (1 << 3), /* data packets */
-};
-
 struct sk_buff;
 
 struct l2tp_stats {
-- 
2.10.2

[PATCH net-next 2/3] net: l2tp: deprecate PPPOL2TP_MSG_* in favour of L2TP_MSG_*

2016-12-10 Thread Asbjoern Sloth Toennesen

PPPOL2TP_MSG_* and L2TP_MSG_* are duplicates, and are being used
interchangeably in the kernel, so let's standardize on L2TP_MSG_*
internally, and keep PPPOL2TP_MSG_* defined in UAPI for compatibility.

Signed-off-by: Asbjoern Sloth Toennesen 
---
 Documentation/networking/l2tp.txt |  8 
 include/uapi/linux/if_pppol2tp.h  | 13 ++---
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/l2tp.txt 
b/Documentation/networking/l2tp.txt
index 4650a00..9bc271c 100644
--- a/Documentation/networking/l2tp.txt
+++ b/Documentation/networking/l2tp.txt
@@ -177,10 +177,10 @@ setsockopt on the PPPoX socket to set a debug mask.
 
 The following debug mask bits are available:
 
-PPPOL2TP_MSG_DEBUGverbose debug (if compiled in)
-PPPOL2TP_MSG_CONTROL  userspace - kernel interface
-PPPOL2TP_MSG_SEQ  sequence numbers handling
-PPPOL2TP_MSG_DATA data packets
+L2TP_MSG_DEBUGverbose debug (if compiled in)
+L2TP_MSG_CONTROL  userspace - kernel interface
+L2TP_MSG_SEQ  sequence numbers handling
+L2TP_MSG_DATA data packets
 
 If enabled, files under a l2tp debugfs directory can be used to dump
 kernel state about L2TP tunnels and sessions. To access it, the
diff --git a/include/uapi/linux/if_pppol2tp.h b/include/uapi/linux/if_pppol2tp.h
index 4bd1f55..6418c4d 100644
--- a/include/uapi/linux/if_pppol2tp.h
+++ b/include/uapi/linux/if_pppol2tp.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Structure used to connect() the socket to a particular tunnel UDP
  * socket over IPv4.
@@ -90,14 +91,12 @@ enum {
PPPOL2TP_SO_REORDERTO   = 5,
 };
 
-/* Debug message categories for the DEBUG socket option */
+/* Debug message categories for the DEBUG socket option (deprecated) */
 enum {
-   PPPOL2TP_MSG_DEBUG  = (1 << 0), /* verbose debug (if
-* compiled in) */
-   PPPOL2TP_MSG_CONTROL= (1 << 1), /* userspace - kernel
-* interface */
-   PPPOL2TP_MSG_SEQ= (1 << 2), /* sequence numbers */
-   PPPOL2TP_MSG_DATA   = (1 << 3), /* data packets */
+   PPPOL2TP_MSG_DEBUG  = L2TP_MSG_DEBUG,
+   PPPOL2TP_MSG_CONTROL= L2TP_MSG_CONTROL,
+   PPPOL2TP_MSG_SEQ= L2TP_MSG_SEQ,
+   PPPOL2TP_MSG_DATA   = L2TP_MSG_DATA,
 };
 
 
-- 
2.10.2

Re: [iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread Daniel Borkmann


On 12/10/2016 11:15 PM, David Ahern wrote:

On 12/10/16 2:21 PM, Daniel Borkmann wrote:


Please name it bpf_prog_create() then, it would be consistent to
bpf_map_create() and shorter as well.


Sorry, lack of coffee, scratch that.

Can't the current bpf_prog_attach() stay as is, and you name the above new
functions bpf_prog_attach_fd() and bpf_prog_detach_fd()? I think that would
be better.


ok. no concerns about consistency with libbpf in the kernel repo?

Seems like making iproute2 and the kernel version the same will allow samples 
and code to move between them much easier.


I think the lib/bpf.c code is quite different anyway, so I don't think it's
much of a concern or even requirement to look exactly the same as the samples
code (it was also never designed with such requirement). But besides that,
it's also trivial enough from reading the code due to the BPF_PROG_ATTACH
and BPF_PROG_DETACH anyway.

Re: [PATCH] net: nicvf: use new api ethtool_{get|set}_link_ksettings

2016-12-10 Thread David Miller

From: Philippe Reynes 
Date: Sat, 10 Dec 2016 15:00:48 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH 1/5] net: ethernet: ti: cpsw: improve re-split policy

2016-12-10 Thread David Miller

From: Ivan Khoronzhuk 
Date: Sat, 10 Dec 2016 14:23:45 +0200

> This patches add several simplifications and improvements to set
> maximum rate for channels taking in account switch and dual emac mode.
> 
> Don't re-split res in the following cases:
> - speed of phys is not changed
> - speed of phys is changed and no rate limited channels
> - speed of phys is changed and all channels are rate limited
> - phy is unlinked while dev is open
> - phy is linked back but speed is not changed
> 
> The maximum speed is sum of "linked" phys, thus res are split taken
> into account two interfaces, both for dual emac mode and for
> switch mode.
> 
> Tested on am572x
> 
> Based on net-next/master

Applied.

Re: [PATCH 3/5] net: ethernet: ti: cpsw: combine budget and weight split and check

2016-12-10 Thread David Miller

From: Ivan Khoronzhuk 
Date: Sat, 10 Dec 2016 14:23:48 +0200

> Re-split weight along with budget. It simplify code a little
> and update state after every rate change. Also it's necessarily
> to move arguments checks to this combined function. Replace
> maximum rate check for an interface on maximum possible rate.
> 
> Signed-off-by: Ivan Khoronzhuk 

Applied.

Re: [PATCH 4/5] net: ethernet: ti: cpsw: re-split res only when speed is changed

2016-12-10 Thread David Miller

From: Ivan Khoronzhuk 
Date: Sat, 10 Dec 2016 14:23:49 +0200

> Don't re-split res in the following cases:
> - speed of phys is not changed
> - speed of phys is changed and no rate limited channels
> - speed of phys is changed and all channels are rate limited
> - phy is unlinked while dev is open
> - phy is linked back but speed is not changed
> 
> The maximum speed is sum of "linked" phys, thus res are split taken
> in account two interfaces, both for dual emac mode and for
> switch mode.
> 
> Signed-off-by: Ivan Khoronzhuk 

Applied.

Re: [PATCH 5/5] net: ethernet: ti: cpsw: sync rates for channels in dual emac mode

2016-12-10 Thread David Miller

From: Ivan Khoronzhuk 
Date: Sat, 10 Dec 2016 14:23:50 +0200

> The channels are common for both ndevs in dual emac mode. Hence, keep
> in sync their rates.
> 
> Signed-off-by: Ivan Khoronzhuk 

Applied.

Re: [PATCH 2/5] net: ethernet: ti: cpsw: don't start queue twice

2016-12-10 Thread David Miller

From: Ivan Khoronzhuk 
Date: Sat, 10 Dec 2016 14:23:47 +0200

> No need to start queues after cpsw is started as it will be done
> while cpsw_adjust_link(), after phy connection.
> 
> Signed-off-by: Ivan Khoronzhuk 

Applied.

Re: [PATCH net-next] net: mvneta: select GENERIC_ALLOCATOR

2016-12-10 Thread David Miller

From: Arnd Bergmann 
Date: Sat, 10 Dec 2016 11:38:32 +0100

> We previously relied on GENERIC_ALLOCATOR to be selected by CONFIG_ARM,
> but now we can compile-test the driver on other architectures that
> don't select it:
> 
> drivers/net/built-in.o: In function `mvneta_bm_remove':
> mvneta_bm.c:(.text+0x4ee35): undefined reference to `gen_pool_free'
> 
> This adds an explicit select for the part of the driver that has
> the dependency.
> 
> Fixes: a0627f776a45 ("net: marvell: Allow drivers to be built with 
> COMPILE_TEST")
> Signed-off-by: Arnd Bergmann 

Applied.

Re: [PATCH] net: socket: removed an unnecessary newline

2016-12-10 Thread David Miller

From: kushwah...@samsung.com
Date: Sat, 10 Dec 2016 11:14:47 +0530

> From: Amit Kushwaha 
> 
> This patch removes a newline which was added
> in socket.c file in net-next
> 
> Signed-off-by: Amit Kushwaha 

Applied.

Re: [Patch net-next] netlink: use blocking notifier

2016-12-10 Thread David Miller

From: Cong Wang 
Date: Fri,  9 Dec 2016 21:10:59 -0800

> netlink_chain is called in ->release(), which is apparently
> a process context, so we don't have to use an atomic notifier
> here.
> 
> Signed-off-by: Cong Wang 

Applied.

[Patch net] e1000: use disable_hardirq() for e1000_netpoll()

2016-12-10 Thread Cong Wang

In commit 02cea3958664 ("genirq: Provide disable_hardirq()")
Peter introduced disable_hardirq() for netpoll, but it is forgotten
to use it for e1000.

This patch changes disable_irq() to disable_hardirq() for e1000.

Reported-by: Dave Jones 
Suggested-by: Sabrina Dubroca 
Cc: Peter Zijlstra (Intel) 
Cc: Jeff Kirsher 
Signed-off-by: Cong Wang 
---
 drivers/net/ethernet/intel/e1000/e1000_main.c | 4 ++--
 drivers/net/ethernet/intel/e1000e/netdev.c| 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..164c3bb 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -5257,8 +5257,8 @@ static void e1000_netpoll(struct net_device *netdev)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
 
-   disable_irq(adapter->pdev->irq);
-   e1000_intr(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
 }
 #endif
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 7017281..9a0be77 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6762,13 +6762,13 @@ static void e1000_netpoll(struct net_device *netdev)
e1000_intr_msix(adapter->pdev->irq, netdev);
break;
case E1000E_INT_MODE_MSI:
-   disable_irq(adapter->pdev->irq);
-   e1000_intr_msi(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr_msi(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
default:/* E1000E_INT_MODE_LEGACY */
-   disable_irq(adapter->pdev->irq);
-   e1000_intr(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
}
-- 
2.5.5

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Dan Lüdtke

> On 8 Dec 2016, at 05:34, Daniel Kahn Gillmor  wrote:
> 
> On Wed 2016-12-07 19:30:34 -0500, Hannes Frederic Sowa wrote:
>> Your custom protocol should be designed in a way you get an aligned ip
>> header. Most protocols of the IETF follow this mantra and it is always
>> possible to e.g. pad options so you end up on aligned boundaries for the
>> next header.
> 
> fwiw, i'm not convinced that "most protocols of the IETF follow this
> mantra".  we've had multiple discussions in different protocol groups
> about shaving or bloating by a few bytes here or there in different
> protocols, and i don't think anyone has brought up memory alignment as
> an argument in any of the discussions i've followed.
> 

If the trade-off is between 1 padding byte and 2 byte alignment versus 3 
padding bytes and 4 byte alignment I would definitely opt for 3 padding bytes. 
I know how that waste feels like to a protocol designer, but I think it is 
worth it. Maybe the padding/reserved will be useful some day for an additional 
feature.

I remember alignment being discussed and taken very seriously in 6man a couple 
of times. Often, though, protocol designers did align without much discussion. 
Implementing unaligned protocols is a pain I've experienced first hand.

Re: [iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread David Ahern

On 12/10/16 2:21 PM, Daniel Borkmann wrote:
>>
>> Please name it bpf_prog_create() then, it would be consistent to
>> bpf_map_create() and shorter as well.
> 
> Sorry, lack of coffee, scratch that.
> 
> Can't the current bpf_prog_attach() stay as is, and you name the above new
> functions bpf_prog_attach_fd() and bpf_prog_detach_fd()? I think that would
> be better.

ok. no concerns about consistency with libbpf in the kernel repo?

Seems like making iproute2 and the kernel version the same will allow samples 
and code to move between them much easier.

Re: [iproute2 net-next 3/8] Add libbpf.h header with BPF_ macros

2016-12-10 Thread Daniel Borkmann


On 12/10/2016 09:32 PM, David Ahern wrote:

Based on version in kernel repo, samples/bpf/libbpf.h

Signed-off-by: David Ahern 
---
  include/libbpf.h | 184 +++
  1 file changed, 184 insertions(+)
  create mode 100644 include/libbpf.h

diff --git a/include/libbpf.h b/include/libbpf.h
new file mode 100644
index ..37951f509a10
--- /dev/null
+++ b/include/libbpf.h
@@ -0,0 +1,184 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H


Creating include/libbpf.h is a bit confusing, since all the function
declarations of the current bpf lib code are located in include/bpf_util.h.
Please add all this there as well instead of creating a new file.

Re: [PATCH] sh_eth: add wake-on-lan support via magic packet

2016-12-10 Thread Sergei Shtylyov


Hello!

On 12/08/2016 05:56 PM, Niklas Söderlund wrote:


  You only enable the WOL support fo the R-Car gen2 chips but never say that
explicitly, neither in the subject nor here.


Signed-off-by: Niklas Söderlund 
---
 drivers/net/ethernet/renesas/sh_eth.c | 120 +++---
 drivers/net/ethernet/renesas/sh_eth.h |   4 ++
 2 files changed, 116 insertions(+), 8 deletions(-)



diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..3974046 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c

[...]

@@ -1657,6 +1658,10 @@ static irqreturn_t sh_eth_interrupt(int irq, void 
*netdev)
goto out;

if (!likely(mdp->irq_enabled)) {


   Oops, I guess unlikely(!mdp->irq_enabled) was meant here...


I can correct this in a separate patch if you wish.


   I'll look into this myself, I think.


+   /* Handle MagicPacket interrupt */
+   if (sh_eth_read(ndev, ECSR) & ECSR_MPD)


   What if it wasn't enabled ATM?

[...]

@@ -3111,6 +3150,10 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
if (ret)
goto out_napi_del;

+   mdp->wol_enabled = false;


   No need, the '*mdp' was kzalloc'ed.


OK, i prefer to explicitly set for easier reading of the code. But if
you wish I will remove this in v2.


   Yes, remove it please.


@@ -3150,15 +3193,71 @@ static int sh_eth_drv_remove(struct platform_device 
*pdev)

 #ifdef CONFIG_PM
 #ifdef CONFIG_PM_SLEEP
+static int sh_eth_wol_setup(struct net_device *ndev)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+
+   /* Only allow ECI interrupts */
+   mdp->irq_enabled = false;


   Why 'false' if you enable IRQs below?


I mask all interrupts except MagicPacket (ECSIPR_MPDIP) interrupts form
the ECI (DMAC_M_ECI) and by setting irq_enabled to false the interrupt
handler will only ack any residue interrupt.


   I don't see where it ack's anything, it just clears EESIPR and returns in 
this case.



This is how it's done in
other parts of the driver when disabling interrupts.


   Not in all parts of the driver that disable EESIPR interrupts... I must 
confess that I never liked that 'mdp->irq_enabled' flag and still suspect we 
can get things done without it... I need to look at this code again, sigh...



This is also why I only check for MagicPacket interrupts if irq_enabled
is false.


  I would have preferred that this was done with the other EMAC interrupts, 
in sh_eth_error().



+   synchronize_irq(ndev->irq);
+   napi_disable(>napi);
+   sh_eth_write(ndev, DMAC_M_ECI, EESIPR);
+
+   /* Enable ECI MagicPacket interrupt */
+   sh_eth_write(ndev, ECSIPR_MPDIP, ECSIPR);


   I'd prefer if it was always enabled via 'ecsipr_value'.


+
+   /* Enable MagicPacket */
+   sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE);
+
+   /* Increased clock usage so device won't be suspended */
+   clk_enable(mdp->clk);


   Hum, intermixiggn runtime PM with clock API doesn't look good...


I agree it looks weird but I need a way to increment the usage count for
the clock otherwise the PM code will disable the module clock and WoL
will not work.


   How will it do it if you don't call sh_eth_close() in this case?


Note that this call will not enable the clock just
increase the usage count so it won't be disabled when the PM code
decrease it after the sh_eth suspend function is run.


   You mean that the PM code calls RPM or clk API on its own? That's strange...


If you know of a different way of ensuring that the clock is not turned
off I be happy to look at it. I did some investigation into this and
calling clk_enable() directly is for example what happens in the
enable_irq_wake() call path to ensure the clock for the irq_chip is not
turned off if it is a wakeup source, se for example
gpio_rcar_irq_set_wake() in drivers/gpio/gpio-rcar.c.


   Thanks, will look into it...

[...]

MBR, Sergei

Re: [iproute2 net-next 2/8] bpf: export bpf_prog_load

2016-12-10 Thread Daniel Borkmann


On 12/10/2016 09:32 PM, David Ahern wrote:

Code move only; no functional change intended.

Signed-off-by: David Ahern 
---
  include/bpf_util.h |  3 +++
  lib/bpf.c  | 40 
  2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 49b96bbc208f..dcbdca6978d6 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);

  void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);

+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t size_insns, const char *license, char *log,
+ size_t size_log);


Just a really minor nit: please add a newline here.


  int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
  int bpf_prog_detach(int target_fd, enum bpf_attach_type type);

Re: [iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread Daniel Borkmann


On 12/10/2016 10:16 PM, Daniel Borkmann wrote:

On 12/10/2016 09:32 PM, David Ahern wrote:

For consistency with other bpf commands, the functions are named
bpf_prog_attach and bpf_prog_detach. The existing bpf_prog_attach is
renamed to bpf_prog_load_and_report since it calls bpf_prog_load and
bpf_prog_report.

Signed-off-by: David Ahern 
---
  include/bpf_util.h |  3 +++
  lib/bpf.c  | 31 ++-
  2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 05baeecda57f..49b96bbc208f 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);

  void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);

+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
+
  #ifdef HAVE_ELF
  int bpf_send_map_fds(const char *path, const char *obj);
  int bpf_recv_map_fds(const char *path, int *fds, struct bpf_map_aux *aux,
diff --git a/lib/bpf.c b/lib/bpf.c
index 2a8cd51d4dae..103fc1ef0593 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -850,6 +850,27 @@ int bpf_graft_map(const char *map_path, uint32_t *key, int 
argc, char **argv)
  return ret;
  }

+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+union bpf_attr attr = {
+.target_fd = target_fd,
+.attach_bpf_fd = prog_fd,
+.attach_type = type,
+};


Please make this consistent with the other bpf(2) cmds we
have in the current lib code. There were some gcc issues in
the past, see:

https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=67584e3ab289a22eb9a2e51f90d23e2ced2e76b0

F.e. bpf_map_create() currently looks like:

 union bpf_attr attr = {};

 attr.map_type = type;
 attr.key_size = size_key;
 attr.value_size = size_value;
 attr.max_entries = max_elem;
 attr.map_flags = flags;


+return bpf(BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+union bpf_attr attr = {
+.target_fd = target_fd,
+.attach_type = type,
+};


Ditto.


+return bpf(BPF_PROG_DETACH, , sizeof(attr));
+}
+
  #ifdef HAVE_ELF
  struct bpf_elf_prog {
  enum bpf_prog_typetype;
@@ -1262,9 +1283,9 @@ static void bpf_prog_report(int fd, const char *section,
  bpf_dump_error(ctx, "Verifier analysis:\n\n");
  }

-static int bpf_prog_attach(const char *section,
-   const struct bpf_elf_prog *prog,
-   struct bpf_elf_ctx *ctx)
+static int bpf_prog_load_and_report(const char *section,
+const struct bpf_elf_prog *prog,
+struct bpf_elf_ctx *ctx)
  {


Please name it bpf_prog_create() then, it would be consistent to
bpf_map_create() and shorter as well.


Sorry, lack of coffee, scratch that.

Can't the current bpf_prog_attach() stay as is, and you name the above new
functions bpf_prog_attach_fd() and bpf_prog_detach_fd()? I think that would
be better.


  int tries = 0, fd;
  retry:
@@ -1656,7 +1677,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const 
char *section,
  prog.size= data.sec_data->d_size;
  prog.license = ctx->license;

-fd = bpf_prog_attach(section, , ctx);
+fd = bpf_prog_load_and_report(section, , ctx);
  if (fd < 0)
  return fd;

@@ -1755,7 +1776,7 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, 
const char *section,
  prog.size= data_insn.sec_data->d_size;
  prog.license = ctx->license;

-fd = bpf_prog_attach(section, , ctx);
+fd = bpf_prog_load_and_report(section, , ctx);
  if (fd < 0) {
  *lderr = true;
  return fd;

Re: [iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread Daniel Borkmann


On 12/10/2016 09:32 PM, David Ahern wrote:

For consistency with other bpf commands, the functions are named
bpf_prog_attach and bpf_prog_detach. The existing bpf_prog_attach is
renamed to bpf_prog_load_and_report since it calls bpf_prog_load and
bpf_prog_report.

Signed-off-by: David Ahern 
---
  include/bpf_util.h |  3 +++
  lib/bpf.c  | 31 ++-
  2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 05baeecda57f..49b96bbc208f 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);

  void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);

+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
+
  #ifdef HAVE_ELF
  int bpf_send_map_fds(const char *path, const char *obj);
  int bpf_recv_map_fds(const char *path, int *fds, struct bpf_map_aux *aux,
diff --git a/lib/bpf.c b/lib/bpf.c
index 2a8cd51d4dae..103fc1ef0593 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -850,6 +850,27 @@ int bpf_graft_map(const char *map_path, uint32_t *key, int 
argc, char **argv)
return ret;
  }

+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };


Please make this consistent with the other bpf(2) cmds we
have in the current lib code. There were some gcc issues in
the past, see:

https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=67584e3ab289a22eb9a2e51f90d23e2ced2e76b0

F.e. bpf_map_create() currently looks like:

union bpf_attr attr = {};

attr.map_type = type;
attr.key_size = size_key;
attr.value_size = size_value;
attr.max_entries = max_elem;
attr.map_flags = flags;


+   return bpf(BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };


Ditto.


+   return bpf(BPF_PROG_DETACH, , sizeof(attr));
+}
+
  #ifdef HAVE_ELF
  struct bpf_elf_prog {
enum bpf_prog_type  type;
@@ -1262,9 +1283,9 @@ static void bpf_prog_report(int fd, const char *section,
bpf_dump_error(ctx, "Verifier analysis:\n\n");
  }

-static int bpf_prog_attach(const char *section,
-  const struct bpf_elf_prog *prog,
-  struct bpf_elf_ctx *ctx)
+static int bpf_prog_load_and_report(const char *section,
+   const struct bpf_elf_prog *prog,
+   struct bpf_elf_ctx *ctx)
  {


Please name it bpf_prog_create() then, it would be consistent to
bpf_map_create() and shorter as well.


int tries = 0, fd;
  retry:
@@ -1656,7 +1677,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const 
char *section,
prog.size= data.sec_data->d_size;
prog.license = ctx->license;

-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0)
return fd;

@@ -1755,7 +1776,7 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, 
const char *section,
prog.size= data_insn.sec_data->d_size;
prog.license = ctx->license;

-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0) {
*lderr = true;
return fd;

Re: [PATCH net v3] ibmveth: set correct gso_size and gso_type

2016-12-10 Thread David Miller

From: Thomas Falcon 
Date: Sat, 10 Dec 2016 12:39:48 -0600

> v3: include a check for non-zero mss when calculating gso_segs
> 
> v2: calculate gso_segs after Eric Dumazet's comments on the earlier patch
> and make sure everyone is included on CC

I already applied v1 which made it all the way even to Linus's
tree.  So you'll have to send me relative fixups if there are
things to fix or change since v1.

You must always generate patches against the current 'net' tree.

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau

On 2016-12-10 21:32, Måns Rullgård wrote:
> Felix Fietkau  writes:
> 
>> On 2016-12-10 14:25, Måns Rullgård wrote:
>>> Felix Fietkau  writes:
>>> 
 On 2016-12-07 19:54, Jason A. Donenfeld wrote:
> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>> It's so much better to analyze properly where the misalignment comes from
>> and address it at the source, as we have for various cases that trip up
>> Sparc too.
> 
> That's sort of my attitude too, hence starting this thread. Any
> pointers you have about this would be most welcome, so as not to
> perpetuate what already seems like an issue in other parts of the
> stack.
 Hi Jason,

 I'm the author of that hackish LEDE/OpenWrt patch that works around the
 misalignment issues. Here's some context regarding that patch:

 I intentionally put it in the target specific patches for only one of
 our MIPS targets. There are a few ar71xx devices where the misalignment
 cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
 requirement, and does not support inserting 2 bytes of padding to
 correct the IP header misalignment.

 With these limitations the choice was between this ugly network stack
 patch or inserting a very expensive memmove in the data path (which is
 better than taking the mis-alignment traps, but still hurts routing
 performance significantly).
>>> 
>>> I solved this problem in an Ethernet driver by copying the initial part
>>> of the packet to an aligned skb and appending the remainder using
>>> skb_add_rx_frag().  The kernel network stack only cares about the
>>> headers, so the alignment of the packet payload doesn't matter.
>>
>> I considered that as well, but it's bad for routing performance if the
>> ethernet MAC does not support scatter/gather for xmit.
>> Unfortunately that limitation is quite common on embedded hardware.
> 
> Yes, I can see that being an issue.  However, if you're doing zero-copy
> routing, the header part of the original buffer should still be there,
> unused, so you could presumably copy the header of the outgoing packet
> there and then do dma as usual.  Maybe there's something in the network
> stack that makes this impossible though.
That still puts more pressure on the ridiculously small dcache sizes
that are typical for embedded MIPS routers.

- Felix

[iproute2 net-next 8/8] Introduce ip vrf command

2016-12-10 Thread David Ahern

'ip vrf' follows the user semnatics established by 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
   ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the VRF
   device index. From there the current process (ip's pid) is addded to
   the cgroups.proc file and the given command is exected. In doing so
   all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically bound to
   the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
   ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry with the VRF name following.

3. Show process ids bound to a VRF
   ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

Signed-off-by: David Ahern 
---
 ip/Makefile   |   3 +-
 ip/ip.c   |   4 +-
 ip/ip_common.h|   2 +
 ip/ipvrf.c| 289 ++
 man/man8/ip-vrf.8 |  88 +
 5 files changed, 384 insertions(+), 2 deletions(-)
 create mode 100644 ip/ipvrf.c
 create mode 100644 man/man8/ip-vrf.8

diff --git a/ip/Makefile b/ip/Makefile
index c8e6c6172741..1928489e7f90 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o
+iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o \
+ipvrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index cb3adcb3f57d..07050b07592a 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -51,7 +51,8 @@ static void usage(void)
 "   ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | address | addrlabel | route | rule | neigh | ntable 
|\n"
 "   tunnel | tuntap | maddress | mroute | mrule | monitor | 
xfrm |\n"
-"   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila }\n"
+"   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila |\n"
+"   vrf }\n"
 "   OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "-h[uman-readable] | -iec |\n"
 "-f[amily] { inet | inet6 | ipx | dnet | mpls | bridge | 
link } |\n"
@@ -99,6 +100,7 @@ static const struct cmd {
{ "mrule",  do_multirule },
{ "netns",  do_netns },
{ "netconf",do_ipnetconf },
+   { "vrf",do_ipvrf},
{ "help",   do_help },
{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 3162f1ca5b2c..28763e81e4a4 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -57,6 +57,8 @@ extern int do_ipila(int argc, char **argv);
 int do_tcp_metrics(int argc, char **argv);
 int do_ipnetconf(int argc, char **argv);
 int do_iptoken(int argc, char **argv);
+int do_ipvrf(int argc, char **argv);
+
 int iplink_get(unsigned int flags, char *name, __u32 filt_mask);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
new file mode 100644
index ..c4f0e53532e2
--- /dev/null
+++ b/ip/ipvrf.c
@@ -0,0 +1,289 @@
+/*
+ * ipvrf.c "ip vrf"
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:David Ahern 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define CGRP_PROC_FILE  "/cgroup.procs"
+
+static void usage(void)
+{
+   fprintf(stderr, "Usage: ip vrf exec [NAME] cmd ...\n");
+   fprintf(stderr, "   ip vrf identify [PID]\n");
+   fprintf(stderr, "   ip vrf pids [NAME]\n");
+
+   exit(-1);
+}
+
+static int ipvrf_identify(int argc, char **argv)
+{
+   char path[PATH_MAX];
+   char buf[4096];
+   char *vrf, *end;
+   int fd, rc = -1;
+   unsigned int pid;
+   ssize_t n;
+
+   if (argc < 1)
+   pid = getpid();
+   else

[iproute2 net-next 3/8] Add libbpf.h header with BPF_ macros

2016-12-10 Thread David Ahern

Based on version in kernel repo, samples/bpf/libbpf.h

Signed-off-by: David Ahern 
---
 include/libbpf.h | 184 +++
 1 file changed, 184 insertions(+)
 create mode 100644 include/libbpf.h

diff --git a/include/libbpf.h b/include/libbpf.h
new file mode 100644
index ..37951f509a10
--- /dev/null
+++ b/include/libbpf.h
@@ -0,0 +1,184 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_OP(OP) | BPF_X,  \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,\
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_OP(OP) | BPF_K,  \
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_MOV | BPF_X,   \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_MOV | BPF_X, \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_MOV | BPF_K,   \
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_MOV | BPF_K, \
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM) \
+   BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_LD | BPF_DW | BPF_IMM, \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = (__u32) (IMM) }),  \
+   ((struct bpf_insn) {\
+   .code  = 0, /* zero is reserved opcode */   \
+   .dst_reg = 0,

[iproute2 net-next 6/8] change name_is_vrf to return index

2016-12-10 Thread David Ahern

index of 0 means name is not a valid vrf.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  2 +-
 ip/iplink_vrf.c | 15 +--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 0147f45a7a31..3162f1ca5b2c 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -91,7 +91,7 @@ struct link_util *get_link_kind(const char *kind);
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
 __u32 ipvrf_get_table(const char *name);
-bool name_is_vrf(const char *name);
+int name_is_vrf(const char *name);
 
 #ifndefINFINITY_LIFE_TIME
 #define INFINITY_LIFE_TIME  0xU
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index a238b2906805..c101ed770f87 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -159,7 +159,7 @@ __u32 ipvrf_get_table(const char *name)
return tb_id;
 }
 
-bool name_is_vrf(const char *name)
+int name_is_vrf(const char *name)
 {
struct {
struct nlmsghdr n;
@@ -187,24 +187,27 @@ bool name_is_vrf(const char *name)
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
if (rtnl_talk(, , , sizeof(answer)) < 0)
-   return false;
+   return 0;
 
ifi = NLMSG_DATA();
len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
if (len < 0) {
fprintf(stderr, "BUG: Invalid response to link query.\n");
-   return false;
+   return 0;
}
 
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
 
if (!tb[IFLA_LINKINFO])
-   return false;
+   return 0;
 
parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
 
if (!li[IFLA_INFO_KIND])
-   return false;
+   return 0;
+
+   if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
+   return 0;
 
-   return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0;
+   return ifi->ifi_index;
 }
-- 
2.1.4

[iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread David Ahern

For consistency with other bpf commands, the functions are named
bpf_prog_attach and bpf_prog_detach. The existing bpf_prog_attach is
renamed to bpf_prog_load_and_report since it calls bpf_prog_load and
bpf_prog_report.

Signed-off-by: David Ahern 
---
 include/bpf_util.h |  3 +++
 lib/bpf.c  | 31 ++-
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 05baeecda57f..49b96bbc208f 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
+
 #ifdef HAVE_ELF
 int bpf_send_map_fds(const char *path, const char *obj);
 int bpf_recv_map_fds(const char *path, int *fds, struct bpf_map_aux *aux,
diff --git a/lib/bpf.c b/lib/bpf.c
index 2a8cd51d4dae..103fc1ef0593 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -850,6 +850,27 @@ int bpf_graft_map(const char *map_path, uint32_t *key, int 
argc, char **argv)
return ret;
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return bpf(BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return bpf(BPF_PROG_DETACH, , sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
enum bpf_prog_type  type;
@@ -1262,9 +1283,9 @@ static void bpf_prog_report(int fd, const char *section,
bpf_dump_error(ctx, "Verifier analysis:\n\n");
 }
 
-static int bpf_prog_attach(const char *section,
-  const struct bpf_elf_prog *prog,
-  struct bpf_elf_ctx *ctx)
+static int bpf_prog_load_and_report(const char *section,
+   const struct bpf_elf_prog *prog,
+   struct bpf_elf_ctx *ctx)
 {
int tries = 0, fd;
 retry:
@@ -1656,7 +1677,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const 
char *section,
prog.size= data.sec_data->d_size;
prog.license = ctx->license;
 
-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0)
return fd;
 
@@ -1755,7 +1776,7 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, 
const char *section,
prog.size= data_insn.sec_data->d_size;
prog.license = ctx->license;
 
-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0) {
*lderr = true;
return fd;
-- 
2.1.4

[iproute2 net-next 7/8] libnetlink: Add variant of rtnl_talk that does not display RTNETLINK answers error

2016-12-10 Thread David Ahern

iplink_vrf has 2 functions used to validate a user given device name is
a VRF device and to return the table id. If the user string is not a
device name ip commands with a vrf keyword show a confusing error
message: "RTNETLINK answers: No such device".

Add a variant of rtnl_talk that does not display the "RTNETLINK answers"
message and update iplink_vrf to use it.

Signed-off-by: David Ahern 
---
 include/libnetlink.h |  3 +++
 ip/iplink_vrf.c  | 14 +++---
 lib/libnetlink.c | 20 +---
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 751ebf186dd4..bd0267dfcc02 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -81,6 +81,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
  struct nlmsghdr *answer, size_t len)
__attribute__((warn_unused_result));
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr 
*n,
+  struct nlmsghdr *answer, size_t len)
+   __attribute__((warn_unused_result));
 int rtnl_send(struct rtnl_handle *rth, const void *buf, int)
__attribute__((warn_unused_result));
 int rtnl_send_check(struct rtnl_handle *rth, const void *buf, int)
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index c101ed770f87..917630e85337 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rt_names.h"
 #include "utils.h"
@@ -126,8 +127,14 @@ __u32 ipvrf_get_table(const char *name)
 
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-   if (rtnl_talk(, , , sizeof(answer)) < 0)
-   return 0;
+   if (rtnl_talk_suppress_rtnl_errmsg(, ,
+  , sizeof(answer)) < 0) {
+   /* special case "default" vrf to be the main table */
+   if (errno == ENODEV && !strcmp(name, "default"))
+   rtnl_rttable_a2n(_id, "main");
+
+   return tb_id;
+   }
 
ifi = NLMSG_DATA();
len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
@@ -186,7 +193,8 @@ int name_is_vrf(const char *name)
 
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   if (rtnl_talk_suppress_rtnl_errmsg(, ,
+  , sizeof(answer)) < 0)
return 0;
 
ifi = NLMSG_DATA();
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index a5db168e50eb..9d7e89aebbd0 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -397,8 +398,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
return rtnl_dump_filter_l(rth, a);
 }
 
-int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
- struct nlmsghdr *answer, size_t maxlen)
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+  struct nlmsghdr *answer, size_t maxlen,
+  bool show_rtnl_err)
 {
int status;
unsigned int seq;
@@ -485,7 +487,7 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
return 0;
}
 
-   if (rtnl->proto != NETLINK_SOCK_DIAG)
+   if (rtnl->proto != NETLINK_SOCK_DIAG && 
show_rtnl_err)
fprintf(stderr,
"RTNETLINK answers: %s\n",
strerror(-err->error));
@@ -517,6 +519,18 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
}
 }
 
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+ struct nlmsghdr *answer, size_t maxlen)
+{
+   return __rtnl_talk(rtnl, n, answer, maxlen, true);
+}
+
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr 
*n,
+  struct nlmsghdr *answer, size_t maxlen)
+{
+   return __rtnl_talk(rtnl, n, answer, maxlen, false);
+}
+
 int rtnl_listen_all_nsid(struct rtnl_handle *rth)
 {
unsigned int on = 1;
-- 
2.1.4

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Måns Rullgård

Felix Fietkau  writes:

> On 2016-12-10 14:25, Måns Rullgård wrote:
>> Felix Fietkau  writes:
>> 
>>> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
 On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
> It's so much better to analyze properly where the misalignment comes from
> and address it at the source, as we have for various cases that trip up
> Sparc too.
 
 That's sort of my attitude too, hence starting this thread. Any
 pointers you have about this would be most welcome, so as not to
 perpetuate what already seems like an issue in other parts of the
 stack.
>>> Hi Jason,
>>>
>>> I'm the author of that hackish LEDE/OpenWrt patch that works around the
>>> misalignment issues. Here's some context regarding that patch:
>>>
>>> I intentionally put it in the target specific patches for only one of
>>> our MIPS targets. There are a few ar71xx devices where the misalignment
>>> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
>>> requirement, and does not support inserting 2 bytes of padding to
>>> correct the IP header misalignment.
>>>
>>> With these limitations the choice was between this ugly network stack
>>> patch or inserting a very expensive memmove in the data path (which is
>>> better than taking the mis-alignment traps, but still hurts routing
>>> performance significantly).
>> 
>> I solved this problem in an Ethernet driver by copying the initial part
>> of the packet to an aligned skb and appending the remainder using
>> skb_add_rx_frag().  The kernel network stack only cares about the
>> headers, so the alignment of the packet payload doesn't matter.
>
> I considered that as well, but it's bad for routing performance if the
> ethernet MAC does not support scatter/gather for xmit.
> Unfortunately that limitation is quite common on embedded hardware.

Yes, I can see that being an issue.  However, if you're doing zero-copy
routing, the header part of the original buffer should still be there,
unused, so you could presumably copy the header of the outgoing packet
there and then do dma as usual.  Maybe there's something in the network
stack that makes this impossible though.

-- 
Måns Rullgård

[iproute2 net-next 2/8] bpf: export bpf_prog_load

2016-12-10 Thread David Ahern

Code move only; no functional change intended.

Signed-off-by: David Ahern 
---
 include/bpf_util.h |  3 +++
 lib/bpf.c  | 40 
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 49b96bbc208f..dcbdca6978d6 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t size_insns, const char *license, char *log,
+ size_t size_log);
 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
 int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
 
diff --git a/lib/bpf.c b/lib/bpf.c
index 103fc1ef0593..b04c3a678b9c 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -871,6 +871,26 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type 
type)
return bpf(BPF_PROG_DETACH, , sizeof(attr));
 }
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t size_insns, const char *license, char *log,
+ size_t size_log)
+{
+   union bpf_attr attr = {};
+
+   attr.prog_type = type;
+   attr.insns = bpf_ptr_to_u64(insns);
+   attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
+   attr.license = bpf_ptr_to_u64(license);
+
+   if (size_log > 0) {
+   attr.log_buf = bpf_ptr_to_u64(log);
+   attr.log_size = size_log;
+   attr.log_level = 1;
+   }
+
+   return bpf(BPF_PROG_LOAD, , sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
enum bpf_prog_type  type;
@@ -988,26 +1008,6 @@ static int bpf_map_create(enum bpf_map_type type, 
uint32_t size_key,
return bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
 
-static int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
-size_t size_insns, const char *license, char *log,
-size_t size_log)
-{
-   union bpf_attr attr = {};
-
-   attr.prog_type = type;
-   attr.insns = bpf_ptr_to_u64(insns);
-   attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
-   attr.license = bpf_ptr_to_u64(license);
-
-   if (size_log > 0) {
-   attr.log_buf = bpf_ptr_to_u64(log);
-   attr.log_size = size_log;
-   attr.log_level = 1;
-   }
-
-   return bpf(BPF_PROG_LOAD, , sizeof(attr));
-}
-
 static int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {};
-- 
2.1.4

[iproute2 net-next 4/8] move cmd_exec to lib utils

2016-12-10 Thread David Ahern

Signed-off-by: David Ahern 
---
 include/utils.h |  2 ++
 ip/ipnetns.c| 34 --
 lib/Makefile|  2 +-
 lib/exec.c  | 41 +
 4 files changed, 44 insertions(+), 35 deletions(-)
 create mode 100644 lib/exec.c

diff --git a/include/utils.h b/include/utils.h
index 26c970daa5d0..ac4517a3bde1 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -256,4 +256,6 @@ char *int_to_str(int val, char *buf);
 int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
+int cmd_exec(const char *cmd, char **argv, bool do_fork);
+
 #endif /* __UTILS_H__ */
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index bd1e9013706c..db9a541769f1 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -357,40 +357,6 @@ static int netns_list(int argc, char **argv)
return 0;
 }
 
-static int cmd_exec(const char *cmd, char **argv, bool do_fork)
-{
-   fflush(stdout);
-   if (do_fork) {
-   int status;
-   pid_t pid;
-
-   pid = fork();
-   if (pid < 0) {
-   perror("fork");
-   exit(1);
-   }
-
-   if (pid != 0) {
-   /* Parent  */
-   if (waitpid(pid, , 0) < 0) {
-   perror("waitpid");
-   exit(1);
-   }
-
-   if (WIFEXITED(status)) {
-   return WEXITSTATUS(status);
-   }
-
-   exit(1);
-   }
-   }
-
-   if (execvp(cmd, argv)  < 0)
-   fprintf(stderr, "exec of \"%s\" failed: %s\n",
-   cmd, strerror(errno));
-   _exit(1);
-}
-
 static int on_netns_exec(char *nsname, void *arg)
 {
char **argv = arg;
diff --git a/lib/Makefile b/lib/Makefile
index 5b7ec169048a..749073261c49 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o \
-   names.o color.o bpf.o
+   names.o color.o bpf.o exec.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/exec.c b/lib/exec.c
new file mode 100644
index ..96edbc422e84
--- /dev/null
+++ b/lib/exec.c
@@ -0,0 +1,41 @@
+#define _ATFILE_SOURCE
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+int cmd_exec(const char *cmd, char **argv, bool do_fork)
+{
+   fflush(stdout);
+   if (do_fork) {
+   int status;
+   pid_t pid;
+
+   pid = fork();
+   if (pid < 0) {
+   perror("fork");
+   exit(1);
+   }
+
+   if (pid != 0) {
+   /* Parent  */
+   if (waitpid(pid, , 0) < 0) {
+   perror("waitpid");
+   exit(1);
+   }
+
+   if (WIFEXITED(status)) {
+   return WEXITSTATUS(status);
+   }
+
+   exit(1);
+   }
+   }
+
+   if (execvp(cmd, argv)  < 0)
+   fprintf(stderr, "exec of \"%s\" failed: %s\n",
+   cmd, strerror(errno));
+   _exit(1);
+}
-- 
2.1.4

[iproute2 net-next 5/8] Add filesystem APIs to lib

2016-12-10 Thread David Ahern

Add make_path to recursively call mkdir as needed to create a given
path with the given mode.

Add find_cgroup2_mount to lookup path where cgroup2 is mounted. If it
is not already mounted, cgroup2 is mounted under /var/run/cgroup2 for
use by iproute2.

Signed-off-by: David Ahern 
---
 include/utils.h |   2 +
 lib/Makefile|   2 +-
 lib/fs.c| 143 
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 lib/fs.c

diff --git a/include/utils.h b/include/utils.h
index ac4517a3bde1..dc1d6b9607dd 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -257,5 +257,7 @@ int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
 int cmd_exec(const char *cmd, char **argv, bool do_fork);
+int make_path(const char *path, mode_t mode);
+char *find_cgroup2_mount(void);
 
 #endif /* __UTILS_H__ */
diff --git a/lib/Makefile b/lib/Makefile
index 749073261c49..0c57662b4f8f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o \
-   names.o color.o bpf.o exec.o
+   names.o color.o bpf.o exec.o fs.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/fs.c b/lib/fs.c
new file mode 100644
index ..39cc96dccca9
--- /dev/null
+++ b/lib/fs.c
@@ -0,0 +1,143 @@
+/*
+ * fs.c filesystem APIs
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:David Ahern 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+#define CGROUP2_FS_NAME "cgroup2"
+
+/* if not already mounted cgroup2 is mounted here for iproute2's use */
+#define MNT_CGRP2_PATH  "/var/run/cgroup2"
+
+/* return mount path of first occurrence of given fstype */
+static char *find_fs_mount(const char *fs_to_find)
+{
+   char path[4096];
+   char fstype[128];/* max length of any filesystem name */
+   char *mnt = NULL;
+   FILE *fp;
+
+   fp = fopen("/proc/mounts", "r");
+   if (!fp) {
+   fprintf(stderr,
+   "Failed to open mounts file: %s\n", strerror(errno));
+   return NULL;
+   }
+
+   while (fscanf(fp, "%*s %4096s %127s %*s %*d %*d\n",
+ path, fstype) == 2) {
+   if (strcmp(fstype, fs_to_find) == 0) {
+   mnt = strdup(path);
+   break;
+   }
+   }
+
+   fclose(fp);
+
+   return mnt;
+}
+
+/* caller needs to free string returned */
+char *find_cgroup2_mount(void)
+{
+   char *mnt = find_fs_mount(CGROUP2_FS_NAME);
+
+   if (mnt)
+   return mnt;
+
+   mnt = strdup(MNT_CGRP2_PATH);
+   if (!mnt) {
+   fprintf(stderr, "Failed to allocate memory for cgroup2 path\n");
+   return NULL;
+
+   }
+
+   if (make_path(mnt, 0755)) {
+   fprintf(stderr, "Failed to setup vrf cgroup2 directory\n");
+   free(mnt);
+   return NULL;
+   }
+
+   if (mount("none", mnt, CGROUP2_FS_NAME, 0, NULL)) {
+   /* EBUSY means already mounted */
+   if (errno != EBUSY) {
+   fprintf(stderr,
+   "Failed to mount cgroup2. Are CGROUPS enabled 
in your kernel?\n");
+   free(mnt);
+   return NULL;
+   }
+   }
+   return mnt;
+}
+
+int make_path(const char *path, mode_t mode)
+{
+   char *dir, *delim;
+   struct stat sbuf;
+   int rc = -1;
+
+   delim = dir = strdup(path);
+   if (dir == NULL) {
+   fprintf(stderr, "strdup failed copying path");
+   return -1;
+   }
+
+   /* skip '/' -- it had better exist */
+   if (*delim == '/')
+   delim++;
+
+   while (1) {
+   delim = strchr(delim, '/');
+   if (delim)
+   *delim = '\0';
+
+   if (stat(dir, ) != 0) {
+   if (errno != ENOENT) {
+   fprintf(stderr,
+   "stat failed for %s: %s\n",
+   dir, strerror(errno));
+   goto out;
+   }
+
+   if (mkdir(dir, mode) != 0) {
+   fprintf(stderr,
+   "mkdir failed for %s: %s",
+   dir, strerror(errno));

[iproute2 v2 net-next 0/8] Add support for vrf helper

2016-12-10 Thread David Ahern

This series adds support to iproute2 to run a command against a specific
VRF. The user semnatics are similar to 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
   ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
   device index of the VRF. From there the current process (ip's pid) is
   addded to the cgroups.proc file and the given command is exected. In
   doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
   bound to the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
   ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry.

3. Show process ids bound to a VRF
   ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

v2
- updated suject of patch 3 to avoid spam filters on vger

David Ahern (8):
  lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
  bpf: export bpf_prog_load
  Add libbpf.h header with BPF_ macros
  move cmd_exec to lib utils
  Add filesystem APIs to lib
  change name_is_vrf to return index
  libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
answers error
  Introduce ip vrf command

 include/bpf_util.h   |   6 ++
 include/libbpf.h | 184 
 include/libnetlink.h |   3 +
 include/utils.h  |   4 +
 ip/Makefile  |   3 +-
 ip/ip.c  |   4 +-
 ip/ip_common.h   |   4 +-
 ip/iplink_vrf.c  |  29 --
 ip/ipnetns.c |  34 --
 ip/ipvrf.c   | 289 +++
 lib/Makefile |   2 +-
 lib/bpf.c|  71 -
 lib/exec.c   |  41 
 lib/fs.c | 143 +
 lib/libnetlink.c |  20 +++-
 man/man8/ip-vrf.8|  88 
 16 files changed, 850 insertions(+), 75 deletions(-)
 create mode 100644 include/libbpf.h
 create mode 100644 ip/ipvrf.c
 create mode 100644 lib/exec.c
 create mode 100644 lib/fs.c
 create mode 100644 man/man8/ip-vrf.8

-- 
2.1.4

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau

On 2016-12-10 14:25, Måns Rullgård wrote:
> Felix Fietkau  writes:
> 
>> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
>>> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
 It's so much better to analyze properly where the misalignment comes from
 and address it at the source, as we have for various cases that trip up
 Sparc too.
>>> 
>>> That's sort of my attitude too, hence starting this thread. Any
>>> pointers you have about this would be most welcome, so as not to
>>> perpetuate what already seems like an issue in other parts of the
>>> stack.
>> Hi Jason,
>>
>> I'm the author of that hackish LEDE/OpenWrt patch that works around the
>> misalignment issues. Here's some context regarding that patch:
>>
>> I intentionally put it in the target specific patches for only one of
>> our MIPS targets. There are a few ar71xx devices where the misalignment
>> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
>> requirement, and does not support inserting 2 bytes of padding to
>> correct the IP header misalignment.
>>
>> With these limitations the choice was between this ugly network stack
>> patch or inserting a very expensive memmove in the data path (which is
>> better than taking the mis-alignment traps, but still hurts routing
>> performance significantly).
> 
> I solved this problem in an Ethernet driver by copying the initial part
> of the packet to an aligned skb and appending the remainder using
> skb_add_rx_frag().  The kernel network stack only cares about the
> headers, so the alignment of the packet payload doesn't matter.
I considered that as well, but it's bad for routing performance if the
ethernet MAC does not support scatter/gather for xmit.
Unfortunately that limitation is quite common on embedded hardware.

- Felix

[PATCH net 1/3] net: bridge: add helper to offload ageing time

2016-12-10 Thread Vivien Didelot

The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME switchdev attr is actually set
when initializing a bridge port, and when configuring the bridge ageing
time from ioctl/netlink/sysfs.

Add a __set_ageing_time helper to offload the ageing time to physical
switches, and add the SWITCHDEV_F_DEFER flag since it can be called
under bridge lock.

Signed-off-by: Vivien Didelot 
---
 net/bridge/br_private.h |  1 +
 net/bridge/br_stp.c | 28 
 net/bridge/br_stp_if.c  | 12 +++-
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 1b63177..3c294b4 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -992,6 +992,7 @@ void __br_set_forward_delay(struct net_bridge *br, unsigned 
long t);
 int br_set_forward_delay(struct net_bridge *br, unsigned long x);
 int br_set_hello_time(struct net_bridge *br, unsigned long x);
 int br_set_max_age(struct net_bridge *br, unsigned long x);
+int __set_ageing_time(struct net_device *dev, unsigned long t);
 int br_set_ageing_time(struct net_bridge *br, clock_t ageing_time);
 
 
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 9258b8e..6ebe2a0 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -562,6 +562,24 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
val)
 
 }
 
+/* called under bridge lock */
+int __set_ageing_time(struct net_device *dev, unsigned long t)
+{
+   struct switchdev_attr attr = {
+   .orig_dev = dev,
+   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
+   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP | SWITCHDEV_F_DEFER,
+   .u.ageing_time = jiffies_to_clock_t(t),
+   };
+   int err;
+
+   err = switchdev_port_attr_set(dev, );
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   return 0;
+}
+
 /* Set time interval that dynamic forwarding entries live
  * For pure software bridge, allow values outside the 802.1
  * standard specification for special cases:
@@ -572,17 +590,11 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
val)
  */
 int br_set_ageing_time(struct net_bridge *br, clock_t ageing_time)
 {
-   struct switchdev_attr attr = {
-   .orig_dev = br->dev,
-   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
-   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
-   .u.ageing_time = ageing_time,
-   };
unsigned long t = clock_t_to_jiffies(ageing_time);
int err;
 
-   err = switchdev_port_attr_set(br->dev, );
-   if (err && err != -EOPNOTSUPP)
+   err = __set_ageing_time(br->dev, t);
+   if (err)
return err;
 
br->ageing_time = t;
diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index d8ad73b..2efbba5 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -36,12 +36,6 @@ static inline port_id br_make_port_id(__u8 priority, __u16 
port_no)
 /* called under bridge lock */
 void br_init_port(struct net_bridge_port *p)
 {
-   struct switchdev_attr attr = {
-   .orig_dev = p->dev,
-   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
-   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP | SWITCHDEV_F_DEFER,
-   .u.ageing_time = jiffies_to_clock_t(p->br->ageing_time),
-   };
int err;
 
p->port_id = br_make_port_id(p->priority, p->port_no);
@@ -50,9 +44,9 @@ void br_init_port(struct net_bridge_port *p)
p->topology_change_ack = 0;
p->config_pending = 0;
 
-   err = switchdev_port_attr_set(p->dev, );
-   if (err && err != -EOPNOTSUPP)
-   netdev_err(p->dev, "failed to set HW ageing time\n");
+   err = __set_ageing_time(p->dev, p->br->ageing_time);
+   if (err)
+   netdev_err(p->dev, "failed to offload ageing time\n");
 }
 
 /* NO locks held */
-- 
2.10.2

[PATCH net 2/3] net: bridge: add helper to set topology change

2016-12-10 Thread Vivien Didelot

Add a __br_set_topology_change helper to set the topology change value.

This can be later extended to add actions when the topology change flag
is set or cleared.

Signed-off-by: Vivien Didelot 
---
 net/bridge/br_private_stp.h |  1 +
 net/bridge/br_stp.c | 10 --
 net/bridge/br_stp_if.c  |  2 +-
 net/bridge/br_stp_timer.c   |  2 +-
 4 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_private_stp.h b/net/bridge/br_private_stp.h
index 2fe910c..3f7543a 100644
--- a/net/bridge/br_private_stp.h
+++ b/net/bridge/br_private_stp.h
@@ -61,6 +61,7 @@ void br_received_tcn_bpdu(struct net_bridge_port *p);
 void br_transmit_config(struct net_bridge_port *p);
 void br_transmit_tcn(struct net_bridge *br);
 void br_topology_change_detection(struct net_bridge *br);
+void __br_set_topology_change(struct net_bridge *br, unsigned char val);
 
 /* br_stp_bpdu.c */
 void br_send_config_bpdu(struct net_bridge_port *, struct br_config_bpdu *);
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 6ebe2a0..8d7b4c7 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -234,7 +234,7 @@ static void br_record_config_timeout_values(struct 
net_bridge *br,
br->max_age = bpdu->max_age;
br->hello_time = bpdu->hello_time;
br->forward_delay = bpdu->forward_delay;
-   br->topology_change = bpdu->topology_change;
+   __br_set_topology_change(br, bpdu->topology_change);
 }
 
 /* called under bridge lock */
@@ -344,7 +344,7 @@ void br_topology_change_detection(struct net_bridge *br)
isroot ? "propagating" : "sending tcn bpdu");
 
if (isroot) {
-   br->topology_change = 1;
+   __br_set_topology_change(br, 1);
mod_timer(>topology_change_timer, jiffies
  + br->bridge_forward_delay + br->bridge_max_age);
} else if (!br->topology_change_detected) {
@@ -603,6 +603,12 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
return 0;
 }
 
+/* called under bridge lock */
+void __br_set_topology_change(struct net_bridge *br, unsigned char val)
+{
+   br->topology_change = val;
+}
+
 void __br_set_forward_delay(struct net_bridge *br, unsigned long t)
 {
br->bridge_forward_delay = t;
diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index 2efbba5..6c1e214 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -81,7 +81,7 @@ void br_stp_disable_bridge(struct net_bridge *br)
 
}
 
-   br->topology_change = 0;
+   __br_set_topology_change(br, 0);
br->topology_change_detected = 0;
spin_unlock_bh(>lock);
 
diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
index da058b8..7ddb38e 100644
--- a/net/bridge/br_stp_timer.c
+++ b/net/bridge/br_stp_timer.c
@@ -125,7 +125,7 @@ static void br_topology_change_timer_expired(unsigned long 
arg)
br_debug(br, "topo change timer expired\n");
spin_lock(>lock);
br->topology_change_detected = 0;
-   br->topology_change = 0;
+   __br_set_topology_change(br, 0);
spin_unlock(>lock);
 }
 
-- 
2.10.2

[PATCH net 3/3] net: bridge: shorten ageing time on topology change

2016-12-10 Thread Vivien Didelot

802.1D [1] specifies that the bridges must use a short value to age out
dynamic entries in the Filtering Database for a period, once a topology
change has been communicated by the root bridge.

Add a bridge_ageing_time member in the net_bridge structure to store the
bridge ageing time value configured by the user (ioctl/netlink/sysfs).

If we are using in-kernel STP, shorten the ageing time value to twice
the forward delay used by the topology when the topology change flag is
set. When the flag is cleared, restore the configured ageing time.

[1] "8.3.5 Notifying topology changes ",
http://profesores.elo.utfsm.cl/~agv/elo309/doc/802.1D-1998.pdf

Signed-off-by: Vivien Didelot 
---
 net/bridge/br_device.c  |  2 +-
 net/bridge/br_private.h |  3 ++-
 net/bridge/br_stp.c | 27 +++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 89a687f..207318a 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -409,7 +409,7 @@ void br_dev_setup(struct net_device *dev)
br->bridge_max_age = br->max_age = 20 * HZ;
br->bridge_hello_time = br->hello_time = 2 * HZ;
br->bridge_forward_delay = br->forward_delay = 15 * HZ;
-   br->ageing_time = BR_DEFAULT_AGEING_TIME;
+   br->bridge_ageing_time = br->ageing_time = BR_DEFAULT_AGEING_TIME;
 
br_netfilter_rtable_init(br);
br_stp_timer_init(br);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3c294b4..43efeb9 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -300,10 +300,11 @@ struct net_bridge
unsigned long   max_age;
unsigned long   hello_time;
unsigned long   forward_delay;
-   unsigned long   bridge_max_age;
unsigned long   ageing_time;
+   unsigned long   bridge_max_age;
unsigned long   bridge_hello_time;
unsigned long   bridge_forward_delay;
+   unsigned long   bridge_ageing_time;
 
u8  group_addr[ETH_ALEN];
boolgroup_addr_set;
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 8d7b4c7..71fd1a4 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -597,7 +597,11 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
if (err)
return err;
 
+   spin_lock_bh(>lock);
+   br->bridge_ageing_time = t;
br->ageing_time = t;
+   spin_unlock_bh(>lock);
+
mod_timer(>gc_timer, jiffies);
 
return 0;
@@ -606,6 +610,29 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
 /* called under bridge lock */
 void __br_set_topology_change(struct net_bridge *br, unsigned char val)
 {
+   unsigned long t;
+   int err;
+
+   if (br->stp_enabled == BR_KERNEL_STP && br->topology_change != val) {
+   /* On topology change, set the bridge ageing time to twice the
+* forward delay. Otherwise, restore its default ageing time.
+*/
+
+   if (val) {
+   t = 2 * br->forward_delay;
+   br_debug(br, "decreasing ageing time to %lu\n", t);
+   } else {
+   t = br->bridge_ageing_time;
+   br_debug(br, "restoring ageing time to %lu\n", t);
+   }
+
+   err = __set_ageing_time(br->dev, t);
+   if (err)
+   br_warn(br, "error offloading ageing time\n");
+   else
+   br->ageing_time = t;
+   }
+
br->topology_change = val;
 }
 
-- 
2.10.2

[PATCH net 0/3] net: bridge: fast ageing on topology change

2016-12-10 Thread Vivien Didelot

802.1D [1] specifies that the bridges in a network must use a short
value to age out dynamic entries in the Filtering Database for a period,
once a topology change has been communicated by the root bridge.

This patchset fixes this for the in-kernel STP implementation.

Once the topology change flag is set in a net_bridge instance, the
ageing time value is shorten to twice the forward delay used by the
topology.

When the topology change flag is cleared, the ageing time configured for
the bridge is restored.

To accomplish that, a new bridge_ageing_time member is added to the
net_bridge structure, to store the user configured bridge ageing time.

Two helpers are added to offload the ageing time and set the topology
change flag in the net_bridge instance. Then the required logic is added
in the topology change helper if in-kernel STP is used.

This has been tested on the following topology:

+--+
| root bridge  |
|  1  2  3  4  |
+--+--+--+--+--+
   |  |  |  |  ++
   |  |  |  +--| laptop |
   |  |  | ++
+--+--+--+-+
|  1  2  3 |
| slave bridge |
+--+

When unplugging/replugging the laptop, the slave bridge (under test)
gets the topology change flag sent by the root bridge, and fast ageing
is triggered on the bridges. Once the topology change timer of the root
bridge expires, the topology change flag is cleared and the configured
ageing time is restored on the bridges.

A similar test has been done between two bridges under test.
When changing the forward delay of the root bridge with:

# echo 3000 > /sys/class/net/br0/bridge/forward_delay

the ageing time correctly changes on both bridges from 300s to 60s while
the TOPOLOGY_CHANGE flag is present.

[1] "8.3.5 Notifying topology changes",
http://profesores.elo.utfsm.cl/~agv/elo309/doc/802.1D-1998.pdf

No change since RFC: https://lkml.org/lkml/2016/10/19/828

Vivien Didelot (3):
  net: bridge: add helper to offload ageing time
  net: bridge: add helper to set topology change
  net: bridge: shorten ageing time on topology change

 net/bridge/br_device.c  |  2 +-
 net/bridge/br_private.h |  4 ++-
 net/bridge/br_private_stp.h |  1 +
 net/bridge/br_stp.c | 65 ++---
 net/bridge/br_stp_if.c  | 14 +++---
 net/bridge/br_stp_timer.c   |  2 +-
 6 files changed, 65 insertions(+), 23 deletions(-)

-- 
2.10.2

[PATCH net v3] ibmveth: set correct gso_size and gso_type

2016-12-10 Thread Thomas Falcon

This patch is based on an earlier one submitted
by Jon Maxwell with the following commit message:

"We recently encountered a bug where a few customers using ibmveth on the
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the
one side was advertising a zero window repeatedly.

We narrowed this down to the fact the ibmveth driver did not set gso_size
which is translated by TCP into the MSS later up the stack. The MSS is
used to calculate the TCP window size and as that was abnormally large,
it was calculating a zero window, even although the sockets receive buffer
was completely empty."

We rely on the Virtual I/O Server partition in a pseries
environment to provide the MSS through the TCP header checksum
field. The stipulation is that users should not disable checksum
offloading if rx packet aggregation is enabled through VIOS.

Some firmware offerings provide the MSS in the RX buffer.
This is signalled by a bit in the RX queue descriptor.

Reviewed-by: Brian King 
Reviewed-by: Pradeep Satyanarayana 
Reviewed-by: Marcelo Ricardo Leitner 
Reviewed-by: Jonathan Maxwell 
Reviewed-by: David Dai 
Signed-off-by: Thomas Falcon 
---
v3: include a check for non-zero mss when calculating gso_segs

v2: calculate gso_segs after Eric Dumazet's comments on the earlier patch
and make sure everyone is included on CC
---
 drivers/net/ethernet/ibm/ibmveth.c | 72 --
 drivers/net/ethernet/ibm/ibmveth.h |  1 +
 2 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index ebe6071..6dc24a1 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -58,7 +58,7 @@
 
 static const char ibmveth_driver_name[] = "ibmveth";
 static const char ibmveth_driver_string[] = "IBM Power Virtual Ethernet 
Driver";
-#define ibmveth_driver_version "1.05"
+#define ibmveth_driver_version "1.06"
 
 MODULE_AUTHOR("Santiago Leon ");
 MODULE_DESCRIPTION("IBM Power Virtual Ethernet Driver");
@@ -137,6 +137,11 @@ static inline int ibmveth_rxq_frame_offset(struct 
ibmveth_adapter *adapter)
return ibmveth_rxq_flags(adapter) & IBMVETH_RXQ_OFF_MASK;
 }
 
+static inline int ibmveth_rxq_large_packet(struct ibmveth_adapter *adapter)
+{
+   return ibmveth_rxq_flags(adapter) & IBMVETH_RXQ_LRG_PKT;
+}
+
 static inline int ibmveth_rxq_frame_length(struct ibmveth_adapter *adapter)
 {
return 
be32_to_cpu(adapter->rx_queue.queue_addr[adapter->rx_queue.index].length);
@@ -1174,6 +1179,52 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff 
*skb,
goto retry_bounce;
 }
 
+static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
+{
+   struct tcphdr *tcph;
+   int offset = 0;
+   int hdr_len;
+
+   /* only TCP packets will be aggregated */
+   if (skb->protocol == htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)skb->data;
+
+   if (iph->protocol == IPPROTO_TCP) {
+   offset = iph->ihl * 4;
+   skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+   } else {
+   return;
+   }
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   struct ipv6hdr *iph6 = (struct ipv6hdr *)skb->data;
+
+   if (iph6->nexthdr == IPPROTO_TCP) {
+   offset = sizeof(struct ipv6hdr);
+   skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+   } else {
+   return;
+   }
+   } else {
+   return;
+   }
+   /* if mss is not set through Large Packet bit/mss in rx buffer,
+* expect that the mss will be written to the tcp header checksum.
+*/
+   tcph = (struct tcphdr *)(skb->data + offset);
+   hdr_len = offset + tcph->doff * 4;
+   if (lrg_pkt) {
+   skb_shinfo(skb)->gso_size = mss;
+   } else if (offset) {
+   skb_shinfo(skb)->gso_size = ntohs(tcph->check);
+   tcph->check = 0;
+   }
+
+   if (skb_shinfo(skb)->gso_size)
+   skb_shinfo(skb)->gso_segs =
+   DIV_ROUND_UP(skb->len - hdr_len,
+skb_shinfo(skb)->gso_size);
+}
+
 static int ibmveth_poll(struct napi_struct *napi, int budget)
 {
struct ibmveth_adapter *adapter =
@@ -1182,6 +1233,7 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
int frames_processed = 0;
unsigned long lpar_rc;
struct iphdr *iph;
+   u16 mss = 0;
 
 restart_poll:
while (frames_processed < budget) {
@@ -1199,9 +1251,21 @@ static int

[iproute2 net-next 5/8] Add filesystem APIs to lib

2016-12-10 Thread David Ahern

Add make_path to recursively call mkdir as needed to create a given
path with the given mode.

Add find_cgroup2_mount to lookup path where cgroup2 is mounted. If it
is not already mounted, cgroup2 is mounted under /var/run/cgroup2 for
use by iproute2.

Signed-off-by: David Ahern 
---
 include/utils.h |   2 +
 lib/Makefile|   2 +-
 lib/fs.c| 143 
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 lib/fs.c

diff --git a/include/utils.h b/include/utils.h
index ac4517a3bde1..dc1d6b9607dd 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -257,5 +257,7 @@ int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
 int cmd_exec(const char *cmd, char **argv, bool do_fork);
+int make_path(const char *path, mode_t mode);
+char *find_cgroup2_mount(void);
 
 #endif /* __UTILS_H__ */
diff --git a/lib/Makefile b/lib/Makefile
index 749073261c49..0c57662b4f8f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o \
-   names.o color.o bpf.o exec.o
+   names.o color.o bpf.o exec.o fs.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/fs.c b/lib/fs.c
new file mode 100644
index ..39cc96dccca9
--- /dev/null
+++ b/lib/fs.c
@@ -0,0 +1,143 @@
+/*
+ * fs.c filesystem APIs
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:David Ahern 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+#define CGROUP2_FS_NAME "cgroup2"
+
+/* if not already mounted cgroup2 is mounted here for iproute2's use */
+#define MNT_CGRP2_PATH  "/var/run/cgroup2"
+
+/* return mount path of first occurrence of given fstype */
+static char *find_fs_mount(const char *fs_to_find)
+{
+   char path[4096];
+   char fstype[128];/* max length of any filesystem name */
+   char *mnt = NULL;
+   FILE *fp;
+
+   fp = fopen("/proc/mounts", "r");
+   if (!fp) {
+   fprintf(stderr,
+   "Failed to open mounts file: %s\n", strerror(errno));
+   return NULL;
+   }
+
+   while (fscanf(fp, "%*s %4096s %127s %*s %*d %*d\n",
+ path, fstype) == 2) {
+   if (strcmp(fstype, fs_to_find) == 0) {
+   mnt = strdup(path);
+   break;
+   }
+   }
+
+   fclose(fp);
+
+   return mnt;
+}
+
+/* caller needs to free string returned */
+char *find_cgroup2_mount(void)
+{
+   char *mnt = find_fs_mount(CGROUP2_FS_NAME);
+
+   if (mnt)
+   return mnt;
+
+   mnt = strdup(MNT_CGRP2_PATH);
+   if (!mnt) {
+   fprintf(stderr, "Failed to allocate memory for cgroup2 path\n");
+   return NULL;
+
+   }
+
+   if (make_path(mnt, 0755)) {
+   fprintf(stderr, "Failed to setup vrf cgroup2 directory\n");
+   free(mnt);
+   return NULL;
+   }
+
+   if (mount("none", mnt, CGROUP2_FS_NAME, 0, NULL)) {
+   /* EBUSY means already mounted */
+   if (errno != EBUSY) {
+   fprintf(stderr,
+   "Failed to mount cgroup2. Are CGROUPS enabled 
in your kernel?\n");
+   free(mnt);
+   return NULL;
+   }
+   }
+   return mnt;
+}
+
+int make_path(const char *path, mode_t mode)
+{
+   char *dir, *delim;
+   struct stat sbuf;
+   int rc = -1;
+
+   delim = dir = strdup(path);
+   if (dir == NULL) {
+   fprintf(stderr, "strdup failed copying path");
+   return -1;
+   }
+
+   /* skip '/' -- it had better exist */
+   if (*delim == '/')
+   delim++;
+
+   while (1) {
+   delim = strchr(delim, '/');
+   if (delim)
+   *delim = '\0';
+
+   if (stat(dir, ) != 0) {
+   if (errno != ENOENT) {
+   fprintf(stderr,
+   "stat failed for %s: %s\n",
+   dir, strerror(errno));
+   goto out;
+   }
+
+   if (mkdir(dir, mode) != 0) {
+   fprintf(stderr,
+   "mkdir failed for %s: %s",
+   dir, strerror(errno));

[iproute2 net-next 2/8] bpf: export bpf_prog_load

2016-12-10 Thread David Ahern

Code move only; no functional change intended.

Signed-off-by: David Ahern 
---
 include/bpf_util.h |  3 +++
 lib/bpf.c  | 40 
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 49b96bbc208f..dcbdca6978d6 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t size_insns, const char *license, char *log,
+ size_t size_log);
 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
 int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
 
diff --git a/lib/bpf.c b/lib/bpf.c
index 103fc1ef0593..b04c3a678b9c 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -871,6 +871,26 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type 
type)
return bpf(BPF_PROG_DETACH, , sizeof(attr));
 }
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t size_insns, const char *license, char *log,
+ size_t size_log)
+{
+   union bpf_attr attr = {};
+
+   attr.prog_type = type;
+   attr.insns = bpf_ptr_to_u64(insns);
+   attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
+   attr.license = bpf_ptr_to_u64(license);
+
+   if (size_log > 0) {
+   attr.log_buf = bpf_ptr_to_u64(log);
+   attr.log_size = size_log;
+   attr.log_level = 1;
+   }
+
+   return bpf(BPF_PROG_LOAD, , sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
enum bpf_prog_type  type;
@@ -988,26 +1008,6 @@ static int bpf_map_create(enum bpf_map_type type, 
uint32_t size_key,
return bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
 
-static int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
-size_t size_insns, const char *license, char *log,
-size_t size_log)
-{
-   union bpf_attr attr = {};
-
-   attr.prog_type = type;
-   attr.insns = bpf_ptr_to_u64(insns);
-   attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
-   attr.license = bpf_ptr_to_u64(license);
-
-   if (size_log > 0) {
-   attr.log_buf = bpf_ptr_to_u64(log);
-   attr.log_size = size_log;
-   attr.log_level = 1;
-   }
-
-   return bpf(BPF_PROG_LOAD, , sizeof(attr));
-}
-
 static int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {};
-- 
2.1.4

[iproute2 net-next 4/8] move cmd_exec to lib utils

2016-12-10 Thread David Ahern

Signed-off-by: David Ahern 
---
 include/utils.h |  2 ++
 ip/ipnetns.c| 34 --
 lib/Makefile|  2 +-
 lib/exec.c  | 41 +
 4 files changed, 44 insertions(+), 35 deletions(-)
 create mode 100644 lib/exec.c

diff --git a/include/utils.h b/include/utils.h
index 26c970daa5d0..ac4517a3bde1 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -256,4 +256,6 @@ char *int_to_str(int val, char *buf);
 int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
+int cmd_exec(const char *cmd, char **argv, bool do_fork);
+
 #endif /* __UTILS_H__ */
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index bd1e9013706c..db9a541769f1 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -357,40 +357,6 @@ static int netns_list(int argc, char **argv)
return 0;
 }
 
-static int cmd_exec(const char *cmd, char **argv, bool do_fork)
-{
-   fflush(stdout);
-   if (do_fork) {
-   int status;
-   pid_t pid;
-
-   pid = fork();
-   if (pid < 0) {
-   perror("fork");
-   exit(1);
-   }
-
-   if (pid != 0) {
-   /* Parent  */
-   if (waitpid(pid, , 0) < 0) {
-   perror("waitpid");
-   exit(1);
-   }
-
-   if (WIFEXITED(status)) {
-   return WEXITSTATUS(status);
-   }
-
-   exit(1);
-   }
-   }
-
-   if (execvp(cmd, argv)  < 0)
-   fprintf(stderr, "exec of \"%s\" failed: %s\n",
-   cmd, strerror(errno));
-   _exit(1);
-}
-
 static int on_netns_exec(char *nsname, void *arg)
 {
char **argv = arg;
diff --git a/lib/Makefile b/lib/Makefile
index 5b7ec169048a..749073261c49 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o \
-   names.o color.o bpf.o
+   names.o color.o bpf.o exec.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/exec.c b/lib/exec.c
new file mode 100644
index ..96edbc422e84
--- /dev/null
+++ b/lib/exec.c
@@ -0,0 +1,41 @@
+#define _ATFILE_SOURCE
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+int cmd_exec(const char *cmd, char **argv, bool do_fork)
+{
+   fflush(stdout);
+   if (do_fork) {
+   int status;
+   pid_t pid;
+
+   pid = fork();
+   if (pid < 0) {
+   perror("fork");
+   exit(1);
+   }
+
+   if (pid != 0) {
+   /* Parent  */
+   if (waitpid(pid, , 0) < 0) {
+   perror("waitpid");
+   exit(1);
+   }
+
+   if (WIFEXITED(status)) {
+   return WEXITSTATUS(status);
+   }
+
+   exit(1);
+   }
+   }
+
+   if (execvp(cmd, argv)  < 0)
+   fprintf(stderr, "exec of \"%s\" failed: %s\n",
+   cmd, strerror(errno));
+   _exit(1);
+}
-- 
2.1.4

[iproute2 net-next 6/8] change name_is_vrf to return index

2016-12-10 Thread David Ahern

index of 0 means name is not a valid vrf.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  2 +-
 ip/iplink_vrf.c | 15 +--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 0147f45a7a31..3162f1ca5b2c 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -91,7 +91,7 @@ struct link_util *get_link_kind(const char *kind);
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
 __u32 ipvrf_get_table(const char *name);
-bool name_is_vrf(const char *name);
+int name_is_vrf(const char *name);
 
 #ifndefINFINITY_LIFE_TIME
 #define INFINITY_LIFE_TIME  0xU
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index a238b2906805..c101ed770f87 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -159,7 +159,7 @@ __u32 ipvrf_get_table(const char *name)
return tb_id;
 }
 
-bool name_is_vrf(const char *name)
+int name_is_vrf(const char *name)
 {
struct {
struct nlmsghdr n;
@@ -187,24 +187,27 @@ bool name_is_vrf(const char *name)
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
if (rtnl_talk(, , , sizeof(answer)) < 0)
-   return false;
+   return 0;
 
ifi = NLMSG_DATA();
len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
if (len < 0) {
fprintf(stderr, "BUG: Invalid response to link query.\n");
-   return false;
+   return 0;
}
 
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
 
if (!tb[IFLA_LINKINFO])
-   return false;
+   return 0;
 
parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
 
if (!li[IFLA_INFO_KIND])
-   return false;
+   return 0;
+
+   if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
+   return 0;
 
-   return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0;
+   return ifi->ifi_index;
 }
-- 
2.1.4

[iproute2 net-next 8/8] Introduce ip vrf command

2016-12-10 Thread David Ahern

'ip vrf' follows the user semnatics established by 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
   ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the VRF
   device index. From there the current process (ip's pid) is addded to
   the cgroups.proc file and the given command is exected. In doing so
   all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically bound to
   the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
   ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry with the VRF name following.

3. Show process ids bound to a VRF
   ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

Signed-off-by: David Ahern 
---
 ip/Makefile   |   3 +-
 ip/ip.c   |   4 +-
 ip/ip_common.h|   2 +
 ip/ipvrf.c| 289 ++
 man/man8/ip-vrf.8 |  88 +
 5 files changed, 384 insertions(+), 2 deletions(-)
 create mode 100644 ip/ipvrf.c
 create mode 100644 man/man8/ip-vrf.8

diff --git a/ip/Makefile b/ip/Makefile
index c8e6c6172741..1928489e7f90 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o
+iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o \
+ipvrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index cb3adcb3f57d..07050b07592a 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -51,7 +51,8 @@ static void usage(void)
 "   ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | address | addrlabel | route | rule | neigh | ntable 
|\n"
 "   tunnel | tuntap | maddress | mroute | mrule | monitor | 
xfrm |\n"
-"   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila }\n"
+"   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila |\n"
+"   vrf }\n"
 "   OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "-h[uman-readable] | -iec |\n"
 "-f[amily] { inet | inet6 | ipx | dnet | mpls | bridge | 
link } |\n"
@@ -99,6 +100,7 @@ static const struct cmd {
{ "mrule",  do_multirule },
{ "netns",  do_netns },
{ "netconf",do_ipnetconf },
+   { "vrf",do_ipvrf},
{ "help",   do_help },
{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 3162f1ca5b2c..28763e81e4a4 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -57,6 +57,8 @@ extern int do_ipila(int argc, char **argv);
 int do_tcp_metrics(int argc, char **argv);
 int do_ipnetconf(int argc, char **argv);
 int do_iptoken(int argc, char **argv);
+int do_ipvrf(int argc, char **argv);
+
 int iplink_get(unsigned int flags, char *name, __u32 filt_mask);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
new file mode 100644
index ..c4f0e53532e2
--- /dev/null
+++ b/ip/ipvrf.c
@@ -0,0 +1,289 @@
+/*
+ * ipvrf.c "ip vrf"
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:David Ahern 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define CGRP_PROC_FILE  "/cgroup.procs"
+
+static void usage(void)
+{
+   fprintf(stderr, "Usage: ip vrf exec [NAME] cmd ...\n");
+   fprintf(stderr, "   ip vrf identify [PID]\n");
+   fprintf(stderr, "   ip vrf pids [NAME]\n");
+
+   exit(-1);
+}
+
+static int ipvrf_identify(int argc, char **argv)
+{
+   char path[PATH_MAX];
+   char buf[4096];
+   char *vrf, *end;
+   int fd, rc = -1;
+   unsigned int pid;
+   ssize_t n;
+
+   if (argc < 1)
+   pid = getpid();
+   else

[iproute2 net-next 7/8] libnetlink: Add variant of rtnl_talk that does not display RTNETLINK answers error

2016-12-10 Thread David Ahern

iplink_vrf has 2 functions used to validate a user given device name is
a VRF device and to return the table id. If the user string is not a
device name ip commands with a vrf keyword show a confusing error
message: "RTNETLINK answers: No such device".

Add a variant of rtnl_talk that does not display the "RTNETLINK answers"
message and update iplink_vrf to use it.

Signed-off-by: David Ahern 
---
 include/libnetlink.h |  3 +++
 ip/iplink_vrf.c  | 14 +++---
 lib/libnetlink.c | 20 +---
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 751ebf186dd4..bd0267dfcc02 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -81,6 +81,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
  struct nlmsghdr *answer, size_t len)
__attribute__((warn_unused_result));
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr 
*n,
+  struct nlmsghdr *answer, size_t len)
+   __attribute__((warn_unused_result));
 int rtnl_send(struct rtnl_handle *rth, const void *buf, int)
__attribute__((warn_unused_result));
 int rtnl_send_check(struct rtnl_handle *rth, const void *buf, int)
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index c101ed770f87..917630e85337 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rt_names.h"
 #include "utils.h"
@@ -126,8 +127,14 @@ __u32 ipvrf_get_table(const char *name)
 
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-   if (rtnl_talk(, , , sizeof(answer)) < 0)
-   return 0;
+   if (rtnl_talk_suppress_rtnl_errmsg(, ,
+  , sizeof(answer)) < 0) {
+   /* special case "default" vrf to be the main table */
+   if (errno == ENODEV && !strcmp(name, "default"))
+   rtnl_rttable_a2n(_id, "main");
+
+   return tb_id;
+   }
 
ifi = NLMSG_DATA();
len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
@@ -186,7 +193,8 @@ int name_is_vrf(const char *name)
 
addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   if (rtnl_talk_suppress_rtnl_errmsg(, ,
+  , sizeof(answer)) < 0)
return 0;
 
ifi = NLMSG_DATA();
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index a5db168e50eb..9d7e89aebbd0 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -397,8 +398,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
return rtnl_dump_filter_l(rth, a);
 }
 
-int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
- struct nlmsghdr *answer, size_t maxlen)
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+  struct nlmsghdr *answer, size_t maxlen,
+  bool show_rtnl_err)
 {
int status;
unsigned int seq;
@@ -485,7 +487,7 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
return 0;
}
 
-   if (rtnl->proto != NETLINK_SOCK_DIAG)
+   if (rtnl->proto != NETLINK_SOCK_DIAG && 
show_rtnl_err)
fprintf(stderr,
"RTNETLINK answers: %s\n",
strerror(-err->error));
@@ -517,6 +519,18 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
}
 }
 
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+ struct nlmsghdr *answer, size_t maxlen)
+{
+   return __rtnl_talk(rtnl, n, answer, maxlen, true);
+}
+
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr 
*n,
+  struct nlmsghdr *answer, size_t maxlen)
+{
+   return __rtnl_talk(rtnl, n, answer, maxlen, false);
+}
+
 int rtnl_listen_all_nsid(struct rtnl_handle *rth)
 {
unsigned int on = 1;
-- 
2.1.4

[iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH

2016-12-10 Thread David Ahern

For consistency with other bpf commands, the functions are named
bpf_prog_attach and bpf_prog_detach. The existing bpf_prog_attach is
renamed to bpf_prog_load_and_report since it calls bpf_prog_load and
bpf_prog_report.

Signed-off-by: David Ahern 
---
 include/bpf_util.h |  3 +++
 lib/bpf.c  | 31 ++-
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 05baeecda57f..49b96bbc208f 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
+
 #ifdef HAVE_ELF
 int bpf_send_map_fds(const char *path, const char *obj);
 int bpf_recv_map_fds(const char *path, int *fds, struct bpf_map_aux *aux,
diff --git a/lib/bpf.c b/lib/bpf.c
index 2a8cd51d4dae..103fc1ef0593 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -850,6 +850,27 @@ int bpf_graft_map(const char *map_path, uint32_t *key, int 
argc, char **argv)
return ret;
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return bpf(BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return bpf(BPF_PROG_DETACH, , sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
enum bpf_prog_type  type;
@@ -1262,9 +1283,9 @@ static void bpf_prog_report(int fd, const char *section,
bpf_dump_error(ctx, "Verifier analysis:\n\n");
 }
 
-static int bpf_prog_attach(const char *section,
-  const struct bpf_elf_prog *prog,
-  struct bpf_elf_ctx *ctx)
+static int bpf_prog_load_and_report(const char *section,
+   const struct bpf_elf_prog *prog,
+   struct bpf_elf_ctx *ctx)
 {
int tries = 0, fd;
 retry:
@@ -1656,7 +1677,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const 
char *section,
prog.size= data.sec_data->d_size;
prog.license = ctx->license;
 
-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0)
return fd;
 
@@ -1755,7 +1776,7 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, 
const char *section,
prog.size= data_insn.sec_data->d_size;
prog.license = ctx->license;
 
-   fd = bpf_prog_attach(section, , ctx);
+   fd = bpf_prog_load_and_report(section, , ctx);
if (fd < 0) {
*lderr = true;
return fd;
-- 
2.1.4

[iproute2 net-next 0/8] Add support for vrf helper

2016-12-10 Thread David Ahern

This series adds support to iproute2 to run a command against a specific
VRF. The user semnatics are similar to 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
   ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
   device index of the VRF. From there the current process (ip's pid) is
   addded to the cgroups.proc file and the given command is exected. In
   doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
   bound to the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
   ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry.

3. Show process ids bound to a VRF
   ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

David Ahern (8):
  lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
  bpf: export bpf_prog_load
  Add libbpf.h header with BPF_ macros
  move cmd_exec to lib utils
  Add filesystem APIs to lib
  change name_is_vrf to return index
  libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
answers error
  Introduce ip vrf command

 include/bpf_util.h   |   6 ++
 include/libbpf.h | 184 
 include/libnetlink.h |   3 +
 include/utils.h  |   4 +
 ip/Makefile  |   3 +-
 ip/ip.c  |   4 +-
 ip/ip_common.h   |   4 +-
 ip/iplink_vrf.c  |  29 --
 ip/ipnetns.c |  34 --
 ip/ipvrf.c   | 289 +++
 lib/Makefile |   2 +-
 lib/bpf.c|  71 -
 lib/exec.c   |  41 
 lib/fs.c | 143 +
 lib/libnetlink.c |  20 +++-
 man/man8/ip-vrf.8|  88 
 16 files changed, 850 insertions(+), 75 deletions(-)
 create mode 100644 include/libbpf.h
 create mode 100644 ip/ipvrf.c
 create mode 100644 lib/exec.c
 create mode 100644 lib/fs.c
 create mode 100644 man/man8/ip-vrf.8

-- 
2.1.4

Mr. Mathiang Puk

2016-12-10 Thread mathiang puk044

My Dear Friend,

How are you and your family? I hope you all are fine

I need your urgent assistance in transferring the sum of Eight Million United 
States Dollars ($8,000,000:00) into your account within 14 working banking days.

I don't want the money to go into our bank treasury as an abandoned fund. So 
this is the reason why I contacted you so that our bank will know you and 
release this 

money to you as the next of kin to the deceased  customer.

I am looking forward to your response.

My regards,
Mr. Mathiang Puk

Re: [PATCH/RFC net-next] net: fec: allow "mini jumbo" frames

2016-12-10 Thread Vivien Didelot

Hi Nikita,

Nikita Yushchenko  writes:

> This adds support for MTU slightly larger than default, on modern
> FEC flavours.
>
> Currently FEC driver uses single hardware Rx buffer per frame. On most
> FEC flavours, size of single buffer is limited by 11-bit field, and
> has to be multiple of 64 (in the worst case). Thus maximum usable Rx
> buffer size is 1984 bytes.
>
> Of those:
> - 2 bytes are used for IP header alignment,
> - 14 bytes are used by ethhdr,
> - up to 8 bytes are needed for VLAN and/or DSA tags,
> - 4 bytes are needed for CRC.
>
> Thus maximum MTU possible within current RX architecture is 1956.
>
> This patch allows exactly that. For further increase, Rx architecture
> change is needed.
>
> Use of MTU=1956 gives about 1.5% throughput improvement between two Vybrid
> boards, compared to default MTU=1500.
>
> Signed-off-by: Nikita Yushchenko 

For what it's worth, I have tested your patch on my ZII Rev B boards
(see vf610-zii-dev-rev-b.dts) which have a FEC as the master net device
of their DSA trees. They still work as expected.

Tested-by: Vivien Didelot 

Thanks,

Vivien

Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset

2016-12-10 Thread Eric Dumazet

On Sat, 2016-12-10 at 15:25 +0100, Pablo Neira Ayuso wrote:
> On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
=
>  
> -   nft_counter_fetch(priv, , reset);
> +   nft_counter_fetch(priv, );
> +   if (reset)
> +   nft_counter_reset(priv, );
>  
> if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
> cpu_to_be64(total.bytes),
>  NFTA_COUNTER_PAD) ||

Night be nitpicking, but you might reset the stats only if the
nla_put_be64() succeeded.

But regardless of this detail, patch looks good and is very close to the
one I cooked and was about to send this morning.

Thanks Pablo !

Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset

2016-12-10 Thread Pablo Neira Ayuso

On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
> On Sat, Dec 10, 2016 at 03:05:41PM +0100, Pablo Neira Ayuso wrote:
> [...]
> > -static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> > - struct nft_counter *total)
> > -{
> > -   struct nft_counter_percpu *cpu_stats;
> > -   u64 bytes, packets;
> > -   unsigned int seq;
> > -   int cpu;
> > -
> > -   memset(total, 0, sizeof(*total));
> > -   for_each_possible_cpu(cpu) {
> > -   bytes = packets = 0;
> > -
> > -   cpu_stats = per_cpu_ptr(counter, cpu);
> > -   do {
> > -   seq = u64_stats_fetch_begin_irq(_stats->syncp);
> > -   packets += 
> > __nft_counter_reset(_stats->counter.packets);
> > -   bytes   += 
> > __nft_counter_reset(_stats->counter.bytes);
> > -   } while (u64_stats_fetch_retry_irq(_stats->syncp, seq));
> > -
> > -   total->packets += packets;
> > -   total->bytes += bytes;
> > +   seq = read_seqcount_begin(myseq);
> > +   bytes   = this_cpu->bytes;
> > +   packets = this_cpu->packets;
> > +   } while (read_seqcount_retry(myseq, seq));
> > +
> > +   total->bytes+= bytes;
> > +   total->packets  += packets;
> > +
> > +   if (reset) {
> > +   local_bh_disable();
> > +   this_cpu->packets -= packets;
> > +   this_cpu->bytes -= bytes;
> > +   local_bh_enable();
> > +   }
> 
> Actually this is not right either, Eric proposed this instead:
> 
> static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
>   struct nft_counter *total)
> {
>   struct nft_counter_percpu *cpu_stats;
> 
>   local_bh_disable();
>   cpu_stats = this_cpu_ptr(counter);
>   cpu_stats->counter.packets -= total->packets;
>   cpu_stats->counter.bytes -= total->bytes;
>   local_bh_enable();
> }
> 
> The cpu that running over the reset code is guaranteed to own this
> stats exclusively, but this is not guaranteed by my patch.
> 
> I'm going to send a v2. I think I need to turn packet and byte
> counters into s64, otherwise a sufficient large total->packets may
> underflow and confuse stats.

So my plan is to fold this incremental change on this patch and send a
v2.

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index c37983d0a141..5647feb43f43 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -18,8 +18,8 @@
 #include 
 
 struct nft_counter {
-   u64 bytes;
-   u64 packets;
+   s64 bytes;
+   s64 packets;
 };
 
 struct nft_counter_percpu_priv {
@@ -102,8 +102,20 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
nft_counter_do_destroy(priv);
 }
 
+static void nft_counter_reset(struct nft_counter_percpu_priv __percpu *priv,
+ struct nft_counter *total)
+{
+   struct nft_counter *this_cpu;
+
+   local_bh_disable();
+   this_cpu = this_cpu_ptr(priv->counter);
+   this_cpu->packets -= total->packets;
+   this_cpu->bytes -= total->bytes;
+   local_bh_enable();
+}
+
 static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
- struct nft_counter *total, bool reset)
+ struct nft_counter *total)
 {
struct nft_counter *this_cpu;
const seqcount_t *myseq;
@@ -123,13 +135,6 @@ static void nft_counter_fetch(struct 
nft_counter_percpu_priv *priv,
 
total->bytes+= bytes;
total->packets  += packets;
-
-   if (reset) {
-   local_bh_disable();
-   this_cpu->packets -= packets;
-   this_cpu->bytes -= bytes;
-   local_bh_enable();
-   }
}
 }

@@ -139,7 +144,9 @@ static int nft_counter_do_dump(struct sk_buff
*skb,
 {
struct nft_counter total;
 
-   nft_counter_fetch(priv, , reset);
+   nft_counter_fetch(priv, );
+   if (reset)
+   nft_counter_reset(priv, );
 
if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
cpu_to_be64(total.bytes),
 NFTA_COUNTER_PAD) ||
@@ -218,7 +225,7 @@ static int nft_counter_clone(struct nft_expr *dst,
const struct nft_expr *src)
struct nft_counter *this_cpu;
struct nft_counter total;
 
-   nft_counter_fetch(priv, , false);
+   nft_counter_fetch(priv, );
 
cpu_stats = alloc_percpu_gfp(struct nft_counter, GFP_ATOMIC);
if (cpu_stats == NULL)

Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset

2016-12-10 Thread Pablo Neira Ayuso

On Sat, Dec 10, 2016 at 03:05:41PM +0100, Pablo Neira Ayuso wrote:
[...]
> -static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> -   struct nft_counter *total)
> -{
> - struct nft_counter_percpu *cpu_stats;
> - u64 bytes, packets;
> - unsigned int seq;
> - int cpu;
> -
> - memset(total, 0, sizeof(*total));
> - for_each_possible_cpu(cpu) {
> - bytes = packets = 0;
> -
> - cpu_stats = per_cpu_ptr(counter, cpu);
> - do {
> - seq = u64_stats_fetch_begin_irq(_stats->syncp);
> - packets += 
> __nft_counter_reset(_stats->counter.packets);
> - bytes   += 
> __nft_counter_reset(_stats->counter.bytes);
> - } while (u64_stats_fetch_retry_irq(_stats->syncp, seq));
> -
> - total->packets += packets;
> - total->bytes += bytes;
> + seq = read_seqcount_begin(myseq);
> + bytes   = this_cpu->bytes;
> + packets = this_cpu->packets;
> + } while (read_seqcount_retry(myseq, seq));
> +
> + total->bytes+= bytes;
> + total->packets  += packets;
> +
> + if (reset) {
> + local_bh_disable();
> + this_cpu->packets -= packets;
> + this_cpu->bytes -= bytes;
> + local_bh_enable();
> + }

Actually this is not right either, Eric proposed this instead:

static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
  struct nft_counter *total)
{
  struct nft_counter_percpu *cpu_stats;

  local_bh_disable();
  cpu_stats = this_cpu_ptr(counter);
  cpu_stats->counter.packets -= total->packets;
  cpu_stats->counter.bytes -= total->bytes;
  local_bh_enable();
}

The cpu that running over the reset code is guaranteed to own this
stats exclusively, but this is not guaranteed by my patch.

I'm going to send a v2. I think I need to turn packet and byte
counters into s64, otherwise a sufficient large total->packets may
underflow and confuse stats.

[PATCH net-next] netfilter: nft_counter: rework atomic dump and reset

2016-12-10 Thread Pablo Neira Ayuso

Dump and reset doesn't work unless cmpxchg64() is used both from both
packet and control plane paths. This approach is going to be slow
though. Instead, use a percpu seqcount to fetch counters consistently,
then subtract bytes and packets in case a reset was requested.

This patch is based on original sketch from Eric Dumazet.

Suggested-by: Eric Dumazet 
Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful 
objects")
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_counter.c | 128 ++--
 1 file changed, 51 insertions(+), 77 deletions(-)

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index f6a02c5071c2..c37983d0a141 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -22,27 +22,29 @@ struct nft_counter {
u64 packets;
 };
 
-struct nft_counter_percpu {
-   struct nft_counter  counter;
-   struct u64_stats_sync   syncp;
-};
-
 struct nft_counter_percpu_priv {
-   struct nft_counter_percpu __percpu *counter;
+   struct nft_counter __percpu *counter;
 };
 
+static DEFINE_PER_CPU(seqcount_t, nft_counter_seq);
+
 static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
   struct nft_regs *regs,
   const struct nft_pktinfo *pkt)
 {
-   struct nft_counter_percpu *this_cpu;
+   struct nft_counter *this_cpu;
+   seqcount_t *myseq;
 
local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
-   u64_stats_update_begin(_cpu->syncp);
-   this_cpu->counter.bytes += pkt->skb->len;
-   this_cpu->counter.packets++;
-   u64_stats_update_end(_cpu->syncp);
+   myseq = this_cpu_ptr(_counter_seq);
+
+   write_seqcount_begin(myseq);
+
+   this_cpu->bytes += pkt->skb->len;
+   this_cpu->packets++;
+
+   write_seqcount_end(myseq);
local_bh_enable();
 }
 
@@ -58,21 +60,21 @@ static inline void nft_counter_obj_eval(struct nft_object 
*obj,
 static int nft_counter_do_init(const struct nlattr * const tb[],
   struct nft_counter_percpu_priv *priv)
 {
-   struct nft_counter_percpu __percpu *cpu_stats;
-   struct nft_counter_percpu *this_cpu;
+   struct nft_counter __percpu *cpu_stats;
+   struct nft_counter *this_cpu;
 
-   cpu_stats = netdev_alloc_pcpu_stats(struct nft_counter_percpu);
+   cpu_stats = alloc_percpu(struct nft_counter);
if (cpu_stats == NULL)
return -ENOMEM;
 
preempt_disable();
this_cpu = this_cpu_ptr(cpu_stats);
if (tb[NFTA_COUNTER_PACKETS]) {
-   this_cpu->counter.packets =
+   this_cpu->packets =
be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS]));
}
if (tb[NFTA_COUNTER_BYTES]) {
-   this_cpu->counter.bytes =
+   this_cpu->bytes =
be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES]));
}
preempt_enable();
@@ -100,74 +102,44 @@ static void nft_counter_obj_destroy(struct nft_object 
*obj)
nft_counter_do_destroy(priv);
 }
 
-static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter,
- struct nft_counter *total)
+static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
+ struct nft_counter *total, bool reset)
 {
-   struct nft_counter_percpu *cpu_stats;
+   struct nft_counter *this_cpu;
+   const seqcount_t *myseq;
u64 bytes, packets;
unsigned int seq;
int cpu;
 
memset(total, 0, sizeof(*total));
for_each_possible_cpu(cpu) {
-   cpu_stats = per_cpu_ptr(counter, cpu);
+   myseq = per_cpu_ptr(_counter_seq, cpu);
+   this_cpu = per_cpu_ptr(priv->counter, cpu);
do {
-   seq = u64_stats_fetch_begin_irq(_stats->syncp);
-   bytes   = cpu_stats->counter.bytes;
-   packets = cpu_stats->counter.packets;
-   } while (u64_stats_fetch_retry_irq(_stats->syncp, seq));
-
-   total->packets += packets;
-   total->bytes += bytes;
-   }
-}
-
-static u64 __nft_counter_reset(u64 *counter)
-{
-   u64 ret, old;
-
-   do {
-   old = *counter;
-   ret = cmpxchg64(counter, old, 0);
-   } while (ret != old);
-
-   return ret;
-}
-
-static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
- struct nft_counter *total)
-{
-   struct nft_counter_percpu *cpu_stats;
-   u64 bytes, packets;
-   unsigned int seq;
-   int cpu;
-
-   memset(total, 0, sizeof(*total));
-   for_each_possible_cpu(cpu) {
-   bytes = packets = 0;
-
-   cpu_stats =

[PATCH] net: nicvf: use new api ethtool_{get|set}_link_ksettings

2016-12-10 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c|   56 +++-
 1 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
index b048241..2e74bba 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
@@ -116,33 +116,34 @@ struct nicvf_stat {
 static const unsigned int nicvf_n_drv_stats = ARRAY_SIZE(nicvf_drv_stats);
 static const unsigned int nicvf_n_queue_stats = ARRAY_SIZE(nicvf_queue_stats);
 
-static int nicvf_get_settings(struct net_device *netdev,
- struct ethtool_cmd *cmd)
+static int nicvf_get_link_ksettings(struct net_device *netdev,
+   struct ethtool_link_ksettings *cmd)
 {
struct nicvf *nic = netdev_priv(netdev);
+   u32 supported, advertising;
 
-   cmd->supported = 0;
-   cmd->transceiver = XCVR_EXTERNAL;
+   supported = 0;
+   advertising = 0;
 
if (!nic->link_up) {
-   cmd->duplex = DUPLEX_UNKNOWN;
-   ethtool_cmd_speed_set(cmd, SPEED_UNKNOWN);
+   cmd->base.duplex = DUPLEX_UNKNOWN;
+   cmd->base.speed = SPEED_UNKNOWN;
return 0;
}
 
switch (nic->speed) {
case SPEED_1000:
-   cmd->port = PORT_MII | PORT_TP;
-   cmd->autoneg = AUTONEG_ENABLE;
-   cmd->supported |= SUPPORTED_MII | SUPPORTED_TP;
-   cmd->supported |= SUPPORTED_1000baseT_Full |
+   cmd->base.port = PORT_MII | PORT_TP;
+   cmd->base.autoneg = AUTONEG_ENABLE;
+   supported |= SUPPORTED_MII | SUPPORTED_TP;
+   supported |= SUPPORTED_1000baseT_Full |
  SUPPORTED_1000baseT_Half |
  SUPPORTED_100baseT_Full  |
  SUPPORTED_100baseT_Half  |
  SUPPORTED_10baseT_Full   |
  SUPPORTED_10baseT_Half;
-   cmd->supported |= SUPPORTED_Autoneg;
-   cmd->advertising |= ADVERTISED_1000baseT_Full |
+   supported |= SUPPORTED_Autoneg;
+   advertising |= ADVERTISED_1000baseT_Full |
ADVERTISED_1000baseT_Half |
ADVERTISED_100baseT_Full  |
ADVERTISED_100baseT_Half  |
@@ -151,24 +152,29 @@ static int nicvf_get_settings(struct net_device *netdev,
break;
case SPEED_1:
if (nic->mac_type == BGX_MODE_RXAUI) {
-   cmd->port = PORT_TP;
-   cmd->supported |= SUPPORTED_TP;
+   cmd->base.port = PORT_TP;
+   supported |= SUPPORTED_TP;
} else {
-   cmd->port = PORT_FIBRE;
-   cmd->supported |= SUPPORTED_FIBRE;
+   cmd->base.port = PORT_FIBRE;
+   supported |= SUPPORTED_FIBRE;
}
-   cmd->autoneg = AUTONEG_DISABLE;
-   cmd->supported |= SUPPORTED_1baseT_Full;
+   cmd->base.autoneg = AUTONEG_DISABLE;
+   supported |= SUPPORTED_1baseT_Full;
break;
case SPEED_4:
-   cmd->port = PORT_FIBRE;
-   cmd->autoneg = AUTONEG_DISABLE;
-   cmd->supported |= SUPPORTED_FIBRE;
-   cmd->supported |= SUPPORTED_4baseCR4_Full;
+   cmd->base.port = PORT_FIBRE;
+   cmd->base.autoneg = AUTONEG_DISABLE;
+   supported |= SUPPORTED_FIBRE;
+   supported |= SUPPORTED_4baseCR4_Full;
break;
}
-   cmd->duplex = nic->duplex;
-   ethtool_cmd_speed_set(cmd, nic->speed);
+   cmd->base.duplex = nic->duplex;
+   cmd->base.speed = nic->speed;
+
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+   supported);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+   advertising);
 
return 0;
 }
@@ -770,7 +776,6 @@ static int nicvf_set_pauseparam(struct net_device *dev,
 }
 
 static const struct ethtool_ops nicvf_ethtool_ops = {
-   .get_settings   = nicvf_get_settings,
.get_link   = nicvf_get_link,
.get_drvinfo= nicvf_get_drvinfo,
.get_msglevel   = nicvf_get_msglevel,
@@ -793,6 +798,7 @@ static int nicvf_set_pauseparam(struct net_device *dev,
.get_pauseparam = nicvf_get_pauseparam,

Re: [PATCH V2 18/22] bnxt_re: Support for DCB

2016-12-10 Thread Or Gerlitz

On Fri, Dec 9, 2016 at 8:48 AM, Selvin Xavier
 wrote:
> This patch queries the configured RoCE APP Priority on the host
> using the dcbnl API and programs the RoCE FW with the corresponding
> Traffic Class(es) for the priority.

> +#define BNXT_RE_ROCE_V1_ETH_TYPE   0x8915
> +#define BNXT_RE_ROCE_V2_PORT_NO4791

I believe these two are defined already, try # git grep on each under include

Re: [PATCH net-next 2/5] liquidio VF vxlan

2016-12-10 Thread Or Gerlitz

On Fri, Dec 9, 2016 at 12:42 AM, Vatsavayi, Raghu
 wrote:
>> From: Or Gerlitz [mailto:gerlitz...@gmail.com]
>> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi
>>  wrote:

>>> Adds VF vxlan offload support.

>> What's the use case for that? a VM running a VTEP, isn't that part needs to
>> run @ the host?

> Our HW can support offloads for VF which is required if we load it on 
> Hypervisor.


+   nctrl.ncmd.u64 = 0;
+   nctrl.ncmd.s.cmd = command;
+   nctrl.ncmd.s.more = vxlan_cmd_bit;
+   nctrl.ncmd.s.param1 = vxlan_port;
+   nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
+   nctrl.wait_time = 100;
+   nctrl.netpndev = (u64)netdev;
+   nctrl.cb_fn = liquidio_link_ctrl_cmd_completion;
+
+   ret = octnet_send_nic_ctrl_pkt(lio->oct_dev, );

1. What happens if > 1 one VF runs this code, each with different
port? who wins? is the result well defined?

2. does octnet_send_nic_ctrl_pkt() goes to sleep? this is disallowed here

Or.

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Måns Rullgård

Felix Fietkau  writes:

> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
>> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>>> It's so much better to analyze properly where the misalignment comes from
>>> and address it at the source, as we have for various cases that trip up
>>> Sparc too.
>> 
>> That's sort of my attitude too, hence starting this thread. Any
>> pointers you have about this would be most welcome, so as not to
>> perpetuate what already seems like an issue in other parts of the
>> stack.
> Hi Jason,
>
> I'm the author of that hackish LEDE/OpenWrt patch that works around the
> misalignment issues. Here's some context regarding that patch:
>
> I intentionally put it in the target specific patches for only one of
> our MIPS targets. There are a few ar71xx devices where the misalignment
> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
> requirement, and does not support inserting 2 bytes of padding to
> correct the IP header misalignment.
>
> With these limitations the choice was between this ugly network stack
> patch or inserting a very expensive memmove in the data path (which is
> better than taking the mis-alignment traps, but still hurts routing
> performance significantly).

I solved this problem in an Ethernet driver by copying the initial part
of the packet to an aligned skb and appending the remainder using
skb_add_rx_frag().  The kernel network stack only cares about the
headers, so the alignment of the packet payload doesn't matter.

-- 
Måns Rullgård

Re: [PATCH] ARM: add cmpxchg64 helper for ARMv7-M

2016-12-10 Thread Pablo Neira Ayuso

Hi Arnd,

On Sat, Dec 10, 2016 at 11:36:34AM +0100, Arnd Bergmann wrote:
> A change to the netfilter code in net-next introduced the first caller of
> cmpxchg64 that can get built on ARMv7-M, leading to an error from the
> assembler that points out the lack of 64-bit atomics on this architecture:
> 
> /tmp/ccMe7djj.s: Assembler messages:
> /tmp/ccMe7djj.s:367: Error: selected processor does not support `ldrexd 
> r0,r1,[lr]' in Thumb mode
> /tmp/ccMe7djj.s:371: Error: selected processor does not support `strexd 
> ip,r2,r3,[lr]' in Thumb mode
> /tmp/ccMe7djj.s:389: Error: selected processor does not support `ldrexd 
> r8,r9,[r7]' in Thumb mode
> /tmp/ccMe7djj.s:393: Error: selected processor does not support `strexd 
> lr,r0,r1,[r7]' in Thumb mode
> scripts/Makefile.build:299: recipe for target 'net/netfilter/nft_counter.o' 
> failed
> 
> This makes ARMv7-M use the same emulation from asm-generic/cmpxchg-local.h
> that we use on architectures earlier than ARMv6K, to fix the build. The
> 32-bit atomics are available on ARMv7-M and we keep using them there.
> This ARM specific change is probably something we should do regardless
> of the netfilter code.
> 
> However, looking at the new nft_counter_reset() function in nft_counter.c,
> this looks incorrect to me not just on ARMv7-M but also on other
> architectures, with at least the following possible race:

Right, Eric Dumazet already spotted this problem. I'm preparing a
patch that doesn't require cmpxchg64(). Will keep you on Cc. Thanks.

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau

On 2016-12-07 19:54, Jason A. Donenfeld wrote:
> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>> It's so much better to analyze properly where the misalignment comes from
>> and address it at the source, as we have for various cases that trip up
>> Sparc too.
> 
> That's sort of my attitude too, hence starting this thread. Any
> pointers you have about this would be most welcome, so as not to
> perpetuate what already seems like an issue in other parts of the
> stack.
Hi Jason,

I'm the author of that hackish LEDE/OpenWrt patch that works around the
misalignment issues. Here's some context regarding that patch:

I intentionally put it in the target specific patches for only one of
our MIPS targets. There are a few ar71xx devices where the misalignment
cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
requirement, and does not support inserting 2 bytes of padding to
correct the IP header misalignment.

With these limitations the choice was between this ugly network stack
patch or inserting a very expensive memmove in the data path (which is
better than taking the mis-alignment traps, but still hurts routing
performance significantly).

There are a lot of places in the network stack that assume full 32 bit
alignment, and you only get to see those once you start using more of
netfilter, play with various tunnel encapsulations, etc.

I think you have 3 options to deal with this properly:
1. add 3 bytes of padding
2. allocate a separate skb for decryption (might be more expensive)
3. save the header and decrypt to the start of the packet data
(overwriting the misaligned header).

I'm not sure what the performance impact of 2 and 3 is, so it's probably
best to stick with the padding.

I've taken a quick look at the wireguard message headers, and my
recommendation would be to insert the 3-byte padding in struct
message_header and remove __packed from your structs.
This will also remove misaligment of your own protocol fields.

- Felix

[PATCH 1/5] net: ethernet: ti: cpsw: use same macros to get active slave

2016-12-10 Thread Ivan Khoronzhuk

Use the same, more convenient macros, to get active slave.

Signed-off-by: Ivan Khoronzhuk 
---
 drivers/net/ethernet/ti/cpsw.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b62d958..c45f7d2 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1624,10 +1624,7 @@ static void cpsw_hwtstamp_v2(struct cpsw_priv *priv)
struct cpsw_common *cpsw = priv->cpsw;
u32 ctrl, mtype;
 
-   if (cpsw->data.dual_emac)
-   slave = >slaves[priv->emac_port];
-   else
-   slave = >slaves[cpsw->data.active_slave];
+   slave = >slaves[cpsw_slave_index(cpsw, priv)];
 
ctrl = slave_read(slave, CPSW2_CONTROL);
switch (cpsw->version) {
-- 
2.7.4

[PATCH 2/5] net: ethernet: ti: cpsw: don't start queue twice

2016-12-10 Thread Ivan Khoronzhuk

No need to start queues after cpsw is started as it will be done
while cpsw_adjust_link(), after phy connection.

Signed-off-by: Ivan Khoronzhuk 
---
 drivers/net/ethernet/ti/cpsw.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c45f7d2..23213a3 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1506,8 +1506,6 @@ static int cpsw_ndo_open(struct net_device *ndev)
if (cpsw->data.dual_emac)
cpsw->slaves[priv->emac_port].open_stat = true;
 
-   netif_tx_start_all_queues(ndev);
-
return 0;
 
 err_cleanup:
-- 
2.7.4

[PATCH 3/5] net: ethernet: ti: cpsw: combine budget and weight split and check

2016-12-10 Thread Ivan Khoronzhuk

Re-split weight along with budget. It simplify code a little
and update state after every rate change. Also it's necessarily
to move arguments checks to this combined function. Replace
maximum rate check for an interface on maximum possible rate.

Signed-off-by: Ivan Khoronzhuk 
---
 drivers/net/ethernet/ti/cpsw.c | 107 +
 1 file changed, 34 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 23213a3..a2c2c06 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -753,27 +753,18 @@ static void cpsw_rx_handler(void *token, int len, int 
status)
dev_kfree_skb_any(new_skb);
 }
 
-/* split budget depending on channel rates */
-static void cpsw_split_budget(struct net_device *ndev)
+static void cpsw_split_res(struct net_device *ndev)
 {
struct cpsw_priv *priv = netdev_priv(ndev);
+   u32 consumed_rate = 0, bigest_rate = 0;
struct cpsw_common *cpsw = priv->cpsw;
struct cpsw_vector *txv = cpsw->txv;
-   u32 consumed_rate, bigest_rate = 0;
+   int i, ch_weight, rlim_ch_num = 0;
int budget, bigest_rate_ch = 0;
struct cpsw_slave *slave;
-   int i, rlim_ch_num = 0;
u32 ch_rate, max_rate;
int ch_budget = 0;
 
-   if (cpsw->data.dual_emac)
-   slave = >slaves[priv->emac_port];
-   else
-   slave = >slaves[cpsw->data.active_slave];
-
-   max_rate = slave->phy->speed * 1000;
-
-   consumed_rate = 0;
for (i = 0; i < cpsw->tx_ch_num; i++) {
ch_rate = cpdma_chan_get_rate(txv[i].ch);
if (!ch_rate)
@@ -785,7 +776,14 @@ static void cpsw_split_budget(struct net_device *ndev)
 
if (cpsw->tx_ch_num == rlim_ch_num) {
max_rate = consumed_rate;
+   } else if (!rlim_ch_num) {
+   ch_budget = CPSW_POLL_WEIGHT / cpsw->tx_ch_num;
+   bigest_rate = 0;
+   max_rate = consumed_rate;
} else {
+   slave = >slaves[cpsw_slave_index(cpsw, priv)];
+   max_rate = slave->phy->speed * 1000;
+
ch_budget = (consumed_rate * CPSW_POLL_WEIGHT) / max_rate;
ch_budget = (CPSW_POLL_WEIGHT - ch_budget) /
(cpsw->tx_ch_num - rlim_ch_num);
@@ -793,22 +791,28 @@ static void cpsw_split_budget(struct net_device *ndev)
  (cpsw->tx_ch_num - rlim_ch_num);
}
 
-   /* split tx budget */
+   /* split tx weight/budget */
budget = CPSW_POLL_WEIGHT;
for (i = 0; i < cpsw->tx_ch_num; i++) {
ch_rate = cpdma_chan_get_rate(txv[i].ch);
if (ch_rate) {
txv[i].budget = (ch_rate * CPSW_POLL_WEIGHT) / max_rate;
if (!txv[i].budget)
-   txv[i].budget = 1;
+   txv[i].budget++;
if (ch_rate > bigest_rate) {
bigest_rate_ch = i;
bigest_rate = ch_rate;
}
+
+   ch_weight = (ch_rate * 100) / max_rate;
+   if (!ch_weight)
+   ch_weight++;
+   cpdma_chan_set_weight(cpsw->txv[i].ch, ch_weight);
} else {
txv[i].budget = ch_budget;
if (!bigest_rate_ch)
bigest_rate_ch = i;
+   cpdma_chan_set_weight(cpsw->txv[i].ch, 0);
}
 
budget -= txv[i].budget;
@@ -1017,7 +1021,7 @@ static void cpsw_adjust_link(struct net_device *ndev)
for_each_slave(priv, _cpsw_adjust_link, priv, );
 
if (link) {
-   cpsw_split_budget(priv->ndev);
+   cpsw_split_res(priv->ndev);
netif_carrier_on(ndev);
if (netif_running(ndev))
netif_tx_wake_all_queues(ndev);
@@ -1962,64 +1966,25 @@ static int cpsw_ndo_vlan_rx_kill_vid(struct net_device 
*ndev,
 static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 
rate)
 {
struct cpsw_priv *priv = netdev_priv(ndev);
-   int tx_ch_num = ndev->real_num_tx_queues;
-   u32 consumed_rate, min_rate, max_rate;
struct cpsw_common *cpsw = priv->cpsw;
-   struct cpsw_slave *slave;
-   int ret, i, weight;
-   int rlim_num = 0;
+   u32 min_rate;
u32 ch_rate;
+   int ret;
 
ch_rate = netdev_get_tx_queue(ndev, queue)->tx_maxrate;
if (ch_rate == rate)
return 0;
 
-   if (cpsw->data.dual_emac)
-   slave = >slaves[priv->emac_port];
-   else
-   slave = >slaves[cpsw->data.active_slave];
-   max_rate = slave->phy->speed;
-
-   consumed_rate = 0;
-   for (i = 0; i

[PATCH 1/5] net: ethernet: ti: cpsw: improve re-split policy

2016-12-10 Thread Ivan Khoronzhuk

This patches add several simplifications and improvements to set
maximum rate for channels taking in account switch and dual emac mode.

Don't re-split res in the following cases:
- speed of phys is not changed
- speed of phys is changed and no rate limited channels
- speed of phys is changed and all channels are rate limited
- phy is unlinked while dev is open
- phy is linked back but speed is not changed

The maximum speed is sum of "linked" phys, thus res are split taken
into account two interfaces, both for dual emac mode and for
switch mode.

Tested on am572x

Based on net-next/master

Ivan Khoronzhuk (5):
  net: ethernet: ti: cpsw: use same macros to get active slave
  net: ethernet: ti: cpsw: don't start queue twice
  net: ethernet: ti: cpsw: combine budget and weight split and check
  net: ethernet: ti: cpsw: re-split res only when speed is changed
  net: ethernet: ti: cpsw: sync rates for channels in dual emac mode

 drivers/net/ethernet/ti/cpsw.c | 178 +++--
 1 file changed, 99 insertions(+), 79 deletions(-)

-- 
2.7.4

[PATCH 4/5] net: ethernet: ti: cpsw: re-split res only when speed is changed

2016-12-10 Thread Ivan Khoronzhuk

Don't re-split res in the following cases:
- speed of phys is not changed
- speed of phys is changed and no rate limited channels
- speed of phys is changed and all channels are rate limited
- phy is unlinked while dev is open
- phy is linked back but speed is not changed

The maximum speed is sum of "linked" phys, thus res are split taken
in account two interfaces, both for dual emac mode and for
switch mode.

Signed-off-by: Ivan Khoronzhuk 
---
 drivers/net/ethernet/ti/cpsw.c | 64 ++
 1 file changed, 59 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index a2c2c06..7ccfa63 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -394,6 +394,7 @@ struct cpsw_common {
u32 irqs_table[IRQ_NUM];
struct cpts *cpts;
int rx_ch_num, tx_ch_num;
+   int speed;
 };
 
 struct cpsw_priv {
@@ -761,7 +762,6 @@ static void cpsw_split_res(struct net_device *ndev)
struct cpsw_vector *txv = cpsw->txv;
int i, ch_weight, rlim_ch_num = 0;
int budget, bigest_rate_ch = 0;
-   struct cpsw_slave *slave;
u32 ch_rate, max_rate;
int ch_budget = 0;
 
@@ -781,8 +781,16 @@ static void cpsw_split_res(struct net_device *ndev)
bigest_rate = 0;
max_rate = consumed_rate;
} else {
-   slave = >slaves[cpsw_slave_index(cpsw, priv)];
-   max_rate = slave->phy->speed * 1000;
+   max_rate = cpsw->speed * 1000;
+
+   /* if max_rate is less then expected due to reduced link speed,
+* split proportionally according next potential max speed
+*/
+   if (max_rate < consumed_rate)
+   max_rate *= 10;
+
+   if (max_rate < consumed_rate)
+   max_rate *= 10;
 
ch_budget = (consumed_rate * CPSW_POLL_WEIGHT) / max_rate;
ch_budget = (CPSW_POLL_WEIGHT - ch_budget) /
@@ -1013,15 +1021,56 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
slave->mac_control = mac_control;
 }
 
+static int cpsw_get_common_speed(struct cpsw_common *cpsw)
+{
+   int i, speed;
+
+   for (i = 0, speed = 0; i < cpsw->data.slaves; i++)
+   if (cpsw->slaves[i].phy && cpsw->slaves[i].phy->link)
+   speed += cpsw->slaves[i].phy->speed;
+
+   return speed;
+}
+
+static int cpsw_need_resplit(struct cpsw_common *cpsw)
+{
+   int i, rlim_ch_num;
+   int speed, ch_rate;
+
+   /* re-split resources only in case speed was changed */
+   speed = cpsw_get_common_speed(cpsw);
+   if (speed == cpsw->speed || !speed)
+   return 0;
+
+   cpsw->speed = speed;
+
+   for (i = 0, rlim_ch_num = 0; i < cpsw->tx_ch_num; i++) {
+   ch_rate = cpdma_chan_get_rate(cpsw->txv[i].ch);
+   if (!ch_rate)
+   break;
+
+   rlim_ch_num++;
+   }
+
+   /* cases not dependent on speed */
+   if (!rlim_ch_num || rlim_ch_num == cpsw->tx_ch_num)
+   return 0;
+
+   return 1;
+}
+
 static void cpsw_adjust_link(struct net_device *ndev)
 {
struct cpsw_priv*priv = netdev_priv(ndev);
+   struct cpsw_common  *cpsw = priv->cpsw;
boollink = false;
 
for_each_slave(priv, _cpsw_adjust_link, priv, );
 
if (link) {
-   cpsw_split_res(priv->ndev);
+   if (cpsw_need_resplit(cpsw))
+   cpsw_split_res(ndev);
+
netif_carrier_on(ndev);
if (netif_running(ndev))
netif_tx_wake_all_queues(ndev);
@@ -1538,6 +1587,10 @@ static int cpsw_ndo_stop(struct net_device *ndev)
cpsw_ale_stop(cpsw->ale);
}
for_each_slave(priv, cpsw_slave_stop, cpsw);
+
+   if (cpsw_need_resplit(cpsw))
+   cpsw_split_res(ndev);
+
pm_runtime_put_sync(cpsw->dev);
if (cpsw->data.dual_emac)
cpsw->slaves[priv->emac_port].open_stat = false;
@@ -1983,7 +2036,7 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device 
*ndev, int queue, u32 rate)
return -EINVAL;
}
 
-   if (rate > 2000) {
+   if (rate > cpsw->speed) {
dev_err(priv->dev, "The channel rate cannot be more than 
2Gbps");
return -EINVAL;
}
@@ -2998,6 +3051,7 @@ static int cpsw_probe(struct platform_device *pdev)
ndev->ethtool_ops = _ethtool_ops;
netif_napi_add(ndev, >napi_rx, cpsw_rx_poll, CPSW_POLL_WEIGHT);
netif_tx_napi_add(ndev, >napi_tx, cpsw_tx_poll, CPSW_POLL_WEIGHT);
+   cpsw_split_res(ndev);
 
/* register the network device */
SET_NETDEV_DEV(ndev, >dev);
-- 
2.7.4

[PATCH 5/5] net: ethernet: ti: cpsw: sync rates for channels in dual emac mode

2016-12-10 Thread Ivan Khoronzhuk

The channels are common for both ndevs in dual emac mode. Hence, keep
in sync their rates.

Signed-off-by: Ivan Khoronzhuk 
---
 drivers/net/ethernet/ti/cpsw.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 7ccfa63..b203143 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2020,9 +2020,10 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device 
*ndev, int queue, u32 rate)
 {
struct cpsw_priv *priv = netdev_priv(ndev);
struct cpsw_common *cpsw = priv->cpsw;
+   struct cpsw_slave *slave;
u32 min_rate;
u32 ch_rate;
-   int ret;
+   int i, ret;
 
ch_rate = netdev_get_tx_queue(ndev, queue)->tx_maxrate;
if (ch_rate == rate)
@@ -2053,6 +2054,15 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device 
*ndev, int queue, u32 rate)
if (ret)
return ret;
 
+   /* update rates for slaves tx queues */
+   for (i = 0; i < cpsw->data.slaves; i++) {
+   slave = >slaves[i];
+   if (!slave->ndev)
+   continue;
+
+   netdev_get_tx_queue(slave->ndev, queue)->tx_maxrate = rate;
+   }
+
cpsw_split_res(ndev);
return ret;
 }
-- 
2.7.4

Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects

2016-12-10 Thread Pablo Neira Ayuso

On Fri, Dec 09, 2016 at 07:22:06AM -0800, Eric Dumazet wrote:
> On Fri, 2016-12-09 at 06:24 -0800, Eric Dumazet wrote:
> 
> > It looks that you want a seqcount, even on 64bit arches,
> > so that CPU 2 can restart its loop, and more importantly you need
> > to not accumulate the values you read, because they might be old/invalid.
> 
> Untested patch to give general idea. I can polish it a bit later today.

I'm preparing a patch now, so you can review it.

Eric, thanks a lot for reviewing and proposing a working approach!

[PATCH net-next] net: mvneta: select GENERIC_ALLOCATOR

2016-12-10 Thread Arnd Bergmann

We previously relied on GENERIC_ALLOCATOR to be selected by CONFIG_ARM,
but now we can compile-test the driver on other architectures that
don't select it:

drivers/net/built-in.o: In function `mvneta_bm_remove':
mvneta_bm.c:(.text+0x4ee35): undefined reference to `gen_pool_free'

This adds an explicit select for the part of the driver that has
the dependency.

Fixes: a0627f776a45 ("net: marvell: Allow drivers to be built with 
COMPILE_TEST")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/marvell/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/Kconfig 
b/drivers/net/ethernet/marvell/Kconfig
index 3b8f11fe5e13..f4b7cf18fb0f 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -76,6 +76,7 @@ config MVNETA_BM
default y if MVNETA=y && MVNETA_BM_ENABLE!=n
default MVNETA_BM_ENABLE
select HWBM
+   select GENERIC_ALLOCATOR
help
  MVNETA_BM must not be 'm' if MVNETA=y, so this symbol ensures
  that all dependencies are met.
-- 
2.9.0

[PATCH] ARM: add cmpxchg64 helper for ARMv7-M

2016-12-10 Thread Arnd Bergmann

A change to the netfilter code in net-next introduced the first caller of
cmpxchg64 that can get built on ARMv7-M, leading to an error from the
assembler that points out the lack of 64-bit atomics on this architecture:

/tmp/ccMe7djj.s: Assembler messages:
/tmp/ccMe7djj.s:367: Error: selected processor does not support `ldrexd 
r0,r1,[lr]' in Thumb mode
/tmp/ccMe7djj.s:371: Error: selected processor does not support `strexd 
ip,r2,r3,[lr]' in Thumb mode
/tmp/ccMe7djj.s:389: Error: selected processor does not support `ldrexd 
r8,r9,[r7]' in Thumb mode
/tmp/ccMe7djj.s:393: Error: selected processor does not support `strexd 
lr,r0,r1,[r7]' in Thumb mode
scripts/Makefile.build:299: recipe for target 'net/netfilter/nft_counter.o' 
failed

This makes ARMv7-M use the same emulation from asm-generic/cmpxchg-local.h
that we use on architectures earlier than ARMv6K, to fix the build. The
32-bit atomics are available on ARMv7-M and we keep using them there.
This ARM specific change is probably something we should do regardless
of the netfilter code.

However, looking at the new nft_counter_reset() function in nft_counter.c,
this looks incorrect to me not just on ARMv7-M but also on other
architectures, with at least the following possible race:

CPU A   CPU B
u64_stats_fetch_begin_irq
u64_stats_update_begin
fetch(upper 32 bits)
fetch(old)
cmpxchg64(counter, old, 0);
fetch(lower 32 bits)
u64_stats_fetch_retry_irq == true
store(upper 32 bits)
fetch(old)
cmpxchg64(counter, old, 0);
store(lower 32 bits)
u64_stats_update_end
u64_stats_fetch_retry_irq == true
fetch(old)
cmpxchg64(counter, old, 0);
u64_stats_fetch_retry_irq == false

In this example, the data returned by __nft_counter_reset() is zero
as we overwrite the per-cpu counter value during the retries.

Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful 
objects")
Signed-off-by: Arnd Bergmann 
---
 arch/arm/include/asm/cmpxchg.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
index 97882f9bad12..12215515ba02 100644
--- a/arch/arm/include/asm/cmpxchg.h
+++ b/arch/arm/include/asm/cmpxchg.h
@@ -240,6 +240,7 @@ static inline unsigned long __cmpxchg_local(volatile void 
*ptr,
sizeof(*(ptr)));\
 })
 
+#ifndef CONFIG_CPU_V7M
 static inline unsigned long long __cmpxchg64(unsigned long long *ptr,
 unsigned long long old,
 unsigned long long new)
@@ -273,6 +274,18 @@ static inline unsigned long long __cmpxchg64(unsigned long 
long *ptr,
 
 #define cmpxchg64_local(ptr, o, n) cmpxchg64_relaxed((ptr), (o), (n))
 
+#else
+
+/* ARMv7-M has 32-bit ldrex/strex but no ldrexd/strexd */
+
+#define cmpxchg64(ptr, o, n)   __cmpxchg64_local_generic((ptr), (o), 
(n))
+#define cmpxchg64_relaxed(ptr, o, n)   __cmpxchg64_local_generic((ptr), (o), 
(n))
+#define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), 
(n))
+
+#include 
+
+#endif
+
 #endif /* __LINUX_ARM_ARCH__ >= 6 */
 
 #endif /* __ASM_ARM_CMPXCHG_H */
-- 
2.9.0

79 matches

Mail list logo