date:20170424

RE: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it

2017-04-24 Thread Byczkowski, Jakub

Tested-by: Jakub Byczkowski 

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
Sent: Friday, April 14, 2017 9:11 PM
To: Bjorn Helgaas ; Cabiddu, Giovanni 
; Benedetto, Salvatore 
; Marciniszyn, Mike 
; Dalessandro, Dennis 
; Derek Chickles 
; Satanand Burla 
; Felix Manlunas 
; Raghu Vatsavayi 
; Kirsher, Jeffrey T 

Cc: linux-...@vger.kernel.org; qat-linux ; 
linux-cry...@vger.kernel.org; linux-r...@vger.kernel.org; 
netdev@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it

Signed-off-by: Christoph Hellwig 
---
 drivers/infiniband/hw/hfi1/chip.c |  4 ++--  drivers/infiniband/hw/hfi1/hfi.h  
|  1 -  drivers/infiniband/hw/hfi1/pcie.c | 30 --
 3 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c 
b/drivers/infiniband/hw/hfi1/chip.c
index 121a4c920f1b..d037f72e4d96 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -13610,14 +13610,14 @@ static void init_chip(struct hfi1_devdata *dd)
dd_dev_info(dd, "Resetting CSRs with FLR\n");
 
/* do the FLR, the DC reset will remain */
-   hfi1_pcie_flr(dd);
+   pcie_flr(dd->pcidev);
 
/* restore command and BARs */
restore_pci_variables(dd);
 
if (is_ax(dd)) {
dd_dev_info(dd, "Resetting CSRs with FLR\n");
-   hfi1_pcie_flr(dd);
+   pcie_flr(dd->pcidev);
restore_pci_variables(dd);
}
} else {
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 0808e3c3ba39..40d7559fa723 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1764,7 +1764,6 @@ int hfi1_pcie_init(struct pci_dev *, const struct 
pci_device_id *);  void hfi1_pcie_cleanup(struct pci_dev *);  int 
hfi1_pcie_ddinit(struct hfi1_devdata *, struct pci_dev *);  void 
hfi1_pcie_ddcleanup(struct hfi1_devdata *); -void hfi1_pcie_flr(struct 
hfi1_devdata *);  int pcie_speeds(struct hfi1_devdata *);  void 
request_msix(struct hfi1_devdata *, u32 *, struct hfi1_msix_entry *);  void 
hfi1_enable_intx(struct pci_dev *); diff --git 
a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
index 0829fce06172..c81556e84831 100644
--- a/drivers/infiniband/hw/hfi1/pcie.c
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -240,36 +240,6 @@ void hfi1_pcie_ddcleanup(struct hfi1_devdata *dd)
iounmap(dd->piobase);
 }
 
-/*
- * Do a Function Level Reset (FLR) on the device.
- * Based on static function drivers/pci/pci.c:pcie_flr().
- */
-void hfi1_pcie_flr(struct hfi1_devdata *dd) -{
-   int i;
-   u16 status;
-
-   /* no need to check for the capability - we know the device has it */
-
-   /* wait for Transaction Pending bit to clear, at most a few ms */
-   for (i = 0; i < 4; i++) {
-   if (i)
-   msleep((1 << (i - 1)) * 100);
-
-   pcie_capability_read_word(dd->pcidev, PCI_EXP_DEVSTA, );
-   if (!(status & PCI_EXP_DEVSTA_TRPND))
-   goto clear;
-   }
-
-   dd_dev_err(dd, "Transaction Pending bit is not clearing, proceeding 
with reset anyway\n");
-
-clear:
-   pcie_capability_set_word(dd->pcidev, PCI_EXP_DEVCTL,
-PCI_EXP_DEVCTL_BCR_FLR);
-   /* PCIe spec requires the function to be back within 100ms */
-   msleep(100);
-}
-
 static void msix_setup(struct hfi1_devdata *dd, int pos, u32 *msixcnt,
   struct hfi1_msix_entry *hfi1_msix_entry)  {
--
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any

[PATCH] net: bridge: suppress broadcast when multicast flood is disabled

2017-04-24 Thread Mike Manning

Flood suppression for packets that are not unicast needs to be handled
consistently by also not flooding broadcast packets. As broadcast is a
special case of multicast, the same kernel parameter should be used to
suppress flooding for both of these packet types.

Fixes: b6cb5ac8331b ("net: bridge: add per-port multicast flood flag")
Cc: Nikolay Aleksandrov 
Signed-off-by: Mike Manning 
---
 net/bridge/br_forward.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 902af6b..a61c7ad 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -183,13 +183,16 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *p;
 
list_for_each_entry_rcu(p, >port_list, list) {
-   /* Do not flood unicast traffic to ports that turn it off */
-   if (pkt_type == BR_PKT_UNICAST && !(p->flags & BR_FLOOD))
-   continue;
-   /* Do not flood if mc off, except for traffic we originate */
-   if (pkt_type == BR_PKT_MULTICAST &&
-   !(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev)
-   continue;
+   /* Do not flood unicast traffic to ports that turn it off, nor
+* other traffic if mc flood off except for traffic we originate
+*/
+   if (pkt_type == BR_PKT_UNICAST) {
+   if (!(p->flags & BR_FLOOD))
+   continue;
+   } else {
+   if (!(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev)
+   continue;
+   }
 
/* Do not flood to ports that enable proxy ARP */
if (p->flags & BR_PROXYARP)
-- 
2.1.4

Re: [PATCH v2 net] net: ipv6: regenerate host route if moved to gc list

2017-04-24 Thread Andrey Konovalov

On Sat, Apr 22, 2017 at 6:40 PM, David Ahern  wrote:
> Taking down the loopback device wreaks havoc on IPv6 routes. By
> extension, taking a VRF device wreaks havoc on its table.
>
> Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
> FIB code while running syzkaller fuzzer. The root cause is a dead dst
> that is on the garbage list gets reinserted into the IPv6 FIB. While on
> the gc (or perhaps when it gets added to the gc list) the dst->next is
> set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
> out-of-bounds access.
>
> Andrey's reproducer was the key to getting to the bottom of this.
>
> With IPv6, host routes for an address have the dst->dev set to the
> loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
> a walk of the fib evicting routes with the 'lo' device which means all
> host routes are removed. That process moves the dst which is attached to
> an inet6_ifaddr to the gc list and marks it as dead.
>
> The recent change to keep global IPv6 addresses added a new function
> fixup_permanent_addr that is called on admin up. That function restarts
> dad for an inet6_ifaddr and when it completes the host route attached
> to it is inserted into the fib. Since the route was marked dead and
> moved to the gc list, we get the reported out-of-bounds accesses. If
> the device with the address is taken down or the address is removed, the
> WARN_ON in fib6_del is triggered.
>
> All of those faults are fixed by regenerating the host route of the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
>
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov 
> Reported-by: Andrey Konovalov 
> Signed-off-by: David Ahern 
> ---
> v2
> - change ifp->rt under spinlock vs cmpxchg
> - add comment about rt6i_ref == 0
>
> Dmitry / Andrey: can you guys add this patch to your tree and run
> syzkaller tests? I'd like to confirm that all of the fib traces
> are fixed. Thanks.

Tried fuzzing with your patch, I don't see any fib6 related reports
any more, so it should be good. I'll let you know if I see some
possibly related crashes.

Thanks!

>
>  net/ipv6/addrconf.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 08f9e8ea7a81..97e86158bbcb 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -3303,14 +3303,24 @@ static void addrconf_gre_config(struct net_device 
> *dev)
>  static int fixup_permanent_addr(struct inet6_dev *idev,
> struct inet6_ifaddr *ifp)
>  {
> -   if (!ifp->rt) {
> -   struct rt6_info *rt;
> +   /* rt6i_ref == 0 means the host route was removed from the
> +* FIB, for example, if 'lo' device is taken down. In that
> +* case regenerate the host route.
> +*/
> +   if (!ifp->rt || !atomic_read(>rt->rt6i_ref)) {
> +   struct rt6_info *rt, *prev;
>
> rt = addrconf_dst_alloc(idev, >addr, false);
> if (unlikely(IS_ERR(rt)))
> return PTR_ERR(rt);
>
> +   spin_lock(>lock);
> +   prev = ifp->rt;
> ifp->rt = rt;
> +   spin_unlock(>lock);
> +
> +   if (prev)
> +   ip6_rt_put(prev);
> }
>
> if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
> --
> 2.1.4
>

[PATCH net-next 1/3] samples/bpf: add -Wno-unknown-warning-option to clang

2017-04-24 Thread Alexander Alemayhu

I was initially going to remove '-Wno-address-of-packed-member' because I
thought it was not supposed to be there but Daniel suggested using
'-Wno-unknown-warning-option'. 

This silences several warnings similiar to the one below

warning: unknown warning option '-Wno-address-of-packed-member' 
[-Wunknown-warning-option]
1 warning generated.
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include 
-I./arch/x86/include -I./arch/x86/include/generated/uapi 
-I./arch/x86/include/generated  -I./include
 -I./arch/x86/include/uapi -I./include/uapi -I./include/generated/uapi -include 
./include/linux/kconfig.h  \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-O2 -emit-llvm -c samples/bpf/xdp_tx_iptunnel_kern.c -o -| llc 
-march=bpf -filetype=obj -o samples/bpf/xdp_tx_iptunnel_kern.o

$ clang --version

 clang version 3.9.1 (tags/RELEASE_391/final)
 Target: x86_64-unknown-linux-gnu
 Thread model: posix
 InstalledDir: /usr/bin

Signed-off-by: Alexander Alemayhu 
---
 samples/bpf/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index d42b495b0992..6c7468eb3684 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -189,4 +189,5 @@ $(obj)/%.o: $(src)/%.c
-Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
+   -Wno-unknown-warning-option \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
-- 
2.9.3

[PATCH net-next 2/3] samples/bpf: add static to function with no prototype

2017-04-24 Thread Alexander Alemayhu

Fixes the following warning

samples/bpf/cookie_uid_helper_example.c: At top level:
samples/bpf/cookie_uid_helper_example.c:276:6: warning: no previous prototype 
for ‘finish’ [-Wmissing-prototypes]
 void finish(int ret)
  ^~
  HOSTLD  samples/bpf/per_socket_stats_example

Signed-off-by: Alexander Alemayhu 
---
 samples/bpf/cookie_uid_helper_example.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
index ad5afedf2e70..9ce55840d61d 100644
--- a/samples/bpf/cookie_uid_helper_example.c
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -273,7 +273,7 @@ static int usage(void)
return 1;
 }
 
-void finish(int ret)
+static void finish(int ret)
 {
test_finish = true;
 }
-- 
2.9.3

[PATCH net-next 3/3] samples/bpf: check before defining offsetof

2017-04-24 Thread Alexander Alemayhu

Fixes the following warning

samples/bpf/test_lru_dist.c:28:0: warning: "offsetof" redefined
 #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)

In file included from ./tools/lib/bpf/bpf.h:25:0,
 from samples/bpf/libbpf.h:5,
 from samples/bpf/test_lru_dist.c:24:
/usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/stddef.h:417:0: note: this is 
the location of the previous definition
 #define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)

Signed-off-by: Alexander Alemayhu 
---
 samples/bpf/test_lru_dist.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/test_lru_dist.c b/samples/bpf/test_lru_dist.c
index d96dc88d3b04..73c357142268 100644
--- a/samples/bpf/test_lru_dist.c
+++ b/samples/bpf/test_lru_dist.c
@@ -25,7 +25,9 @@
 #include "bpf_util.h"
 
 #define min(a, b) ((a) < (b) ? (a) : (b))
-#define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)
+#ifndef offsetof
+# define offsetof(TYPE, MEMBER)((size_t)&((TYPE *)0)->MEMBER)
+#endif
 #define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr);\
(type *)( (char *)__mptr - offsetof(type,member) );})
-- 
2.9.3

[PATCH net-next 0/3] Misc BPF cleanup

2017-04-24 Thread Alexander Alemayhu

Hei,

while looking into making the Makefile in samples/bpf better handle O= I saw
several warnings when running `make clean && make samples/bpf/`. This series
reduces those warnings.

Thanks.

Alexander Alemayhu (3):
  samples/bpf: add -Wno-unknown-warning-option to clang
  samples/bpf: add static to function with no prototype
  samples/bpf: check before defining offsetof

 samples/bpf/Makefile| 1 +
 samples/bpf/cookie_uid_helper_example.c | 2 +-
 samples/bpf/test_lru_dist.c | 4 +++-
 3 files changed, 5 insertions(+), 2 deletions(-)

-- 
2.9.3

Re: [PATCH v4 net-next RFC] net: Generic XDP

2017-04-24 Thread Jesper Dangaard Brouer


On Thu, 20 Apr 2017 16:30:34 +0200 Jesper Dangaard Brouer  
wrote:

> On Wed, 19 Apr 2017 10:29:03 -0400
> Andy Gospodarek  wrote:
> 
> > I ran this on top of a card that uses the bnxt_en driver on a desktop
> > class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> > UDP traffic with flow control disabled and saw the following (all stats
> > in Million PPS).
> > 
> > xdp1xdp2xdp_tx_tunnel
> > Generic XDP  7.85.5 (1.3 actual) 4.6 (1.1 actual)
> > Optimized XDP   11.7 9.7  4.6
> > 
> > One thing to note is that the Generic XDP case shows some different
> > results for reported by the application vs actual (seen on the wire).  I
> > did not debug where the drops are happening and what counter needs to be
> > incremented to note this -- I'll add that to my TODO list.  The
> > Optimized XDP case does not have a difference in reported vs actual
> > frames on the wire.  
> 
> The reported application vs actual (seen on the wire) number sound scary.
> How do you evaluate/measure "seen on the wire"?
> 
> Perhaps you could use ethtool -S stats to see if anything is fishy?
> I recommend using my tool[1] like:
> 
>  ~/git/network-testing/bin/ethtool_stats.pl --dev mlx5p2 --sec 2
> 
> [1] 
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> 
> I'm evaluating this patch on a mlx5 NIC, and something is not right...
> I'm seeing:
> 
>  Ethtool(mlx5p2) stat: 349599 (349,599) <= tx_multicast_phy /sec
>  Ethtool(mlx5p2) stat:4940185 (  4,940,185) <= tx_packets /sec
>  Ethtool(mlx5p2) stat: 349596 (349,596) <= tx_packets_phy /sec
>  [...]
>  Ethtool(mlx5p2) stat:  36898 ( 36,898) <= rx_cache_busy /sec
>  Ethtool(mlx5p2) stat:  36898 ( 36,898) <= rx_cache_full /sec
>  Ethtool(mlx5p2) stat:4903287 (  4,903,287) <= rx_cache_reuse /sec
>  Ethtool(mlx5p2) stat:4940185 (  4,940,185) <= rx_csum_complete /sec
>  Ethtool(mlx5p2) stat:4940185 (  4,940,185) <= rx_packets /sec
> 
> Something is wrong... when I tcpdump on the generator machine, I see
> garbled packets with IPv6 multicast addresses.
> 
> And it looks like I'm only sending 349,596 tx_packets_phy/sec on the "wire".
> 

Not seeing packets on the TX wire was caused by the NIC HW dropping the
packets, because the ethernet MAC-addr were not changed/swapped.

Fixed this XDP_TX bug in my test program xdp_bench01_mem_access_cost.
https://github.com/netoptimizer/prototype-kernel/commit/85f7ba2f0ea2

Even added a new option --swapmac for creating another test option for
modifying the packet.
https://github.com/netoptimizer/prototype-kernel/commit/fe080e6f3ccf

I will shortly publish a full report of testing this patch.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net v1 2/2] tipc: fix socket flow control accounting error at tipc_recv_stream

2017-04-24 Thread Parthasarathy Bhuvaragan

Until now in tipc_recv_stream(), we update the received
unacknowledged bytes based on a stack variable and not based on the
actual message size.
If the user buffer passed at tipc_recv_stream() is smaller than the
received skb, the size variable in stack differs from the actual
message size in the skb. This leads to a flow control accounting
error causing permanent congestion.

In this commit, we fix this accounting error by always using the
size of the incoming message.

Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan 
Reviewed-by: Jon Maloy 
---
 net/tipc/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index b28e94f1c739..566906795c8c 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1484,7 +1484,7 @@ static int tipc_recv_stream(struct socket *sock, struct 
msghdr *m,
if (unlikely(flags & MSG_PEEK))
goto exit;
 
-   tsk->rcv_unacked += tsk_inc(tsk, hlen + sz);
+   tsk->rcv_unacked += tsk_inc(tsk, hlen + msg_data_sz(msg));
if (unlikely(tsk->rcv_unacked >= (tsk->rcv_win / 4)))
tipc_sk_send_ack(tsk);
tsk_advance_rx_queue(sk);
-- 
2.1.4

[PATCH v2] brcmfmac: Make skb header writable before use

2017-04-24 Thread James Hughes

The driver was making changes to the skb_header without
ensuring it was writable (i.e. uncloned).
This patch also removes some boiler plate header size
checking/adjustment code as that is also handled by the
skb_cow_header function used to make header writable.

This patch depends on
brcmfmac: Ensure pointer correctly set if skb data location changes

Signed-off-by: James Hughes 
---
Changes in v2
  Makes the _cow_ call at the entry point of the skb in to the
  stack, means only needs to be done once, and error handling
  is easier.
  Split a separate minor bug fix off to a separate patch (which
  this patch depends on)



 .../net/wireless/broadcom/brcm80211/brcmfmac/core.c   | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 9b7c19a508ac..88f8675a94c2 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -211,22 +211,11 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
goto done;
}
 
-   /* Make sure there's enough room for any header */
-   if (skb_headroom(skb) < drvr->hdrlen) {
-   struct sk_buff *skb2;
-
-   brcmf_dbg(INFO, "%s: insufficient headroom\n",
- brcmf_ifname(ifp));
-   drvr->bus_if->tx_realloc++;
-   skb2 = skb_realloc_headroom(skb, drvr->hdrlen);
+   /* Make sure there's enough room for any header, and make it writable */
+   if (skb_cow_head(skb, drvr->hdrlen)) {
dev_kfree_skb(skb);
-   skb = skb2;
-   if (skb == NULL) {
-   brcmf_err("%s: skb_realloc_headroom failed\n",
- brcmf_ifname(ifp));
-   ret = -ENOMEM;
-   goto done;
-   }
+   ret = -ENOMEM;
+   goto done;
}
 
/* validate length for ether packet */
-- 
2.11.0

[PATCH net v1 1/2] tipc: fix socket flow control accounting error at tipc_send_stream

2017-04-24 Thread Parthasarathy Bhuvaragan

Until now in tipc_send_stream(), we return -1 when the socket
encounters link congestion even if the socket had successfully
sent partial data. This is incorrect as the application resends
the same the partial data leading to data corruption at
receiver's end.

In this commit, we return the partially sent bytes as the return
value at link congestion.

Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan 
Reviewed-by: Jon Maloy 
---
 net/tipc/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 7130e73bd42c..b28e94f1c739 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1083,7 +1083,7 @@ static int __tipc_sendstream(struct socket *sock, struct 
msghdr *m, size_t dlen)
}
} while (sent < dlen && !rc);
 
-   return rc ? rc : sent;
+   return sent ? sent : rc;
 }
 
 /**
-- 
2.1.4

[PATCH v2] net/packet: initialize val in packet_getsockopt()

2017-04-24 Thread Alexander Potapenko

In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value. This doesn't have security consequences (as the
uninit bytes aren't copied back), but it's still cleaner to initialize
|val| and ensure optlen is not less than sizeof(int).

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko 
---
v2: - if len < sizeof(int), make it 0

KMSAN report below:

==
BUG: KMSAN: use of unitialized memory in packet_getsockopt+0xb9b/0xbe0
inter: 0
CPU: 0 PID: 1036 Comm: probe Tainted: GB   4.11.0-rc5+ #2444
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x143/0x1b0 lib/dump_stack.c:52
 kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078
 __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510
 packet_getsockopt+0xb9b/0xbe0 net/packet/af_packet.c:3839
 SYSC_getsockopt+0x495/0x540 net/socket.c:1829
 SyS_getsockopt+0xb0/0xd0 net/socket.c:1811
 entry_SYSCALL_64_fastpath+0x13/0x94 arch/x86/entry/entry_64.S:204
RIP: 0033:0x436d8a
RSP: 002b:7ffce54e52c8 EFLAGS: 0203 ORIG_RAX: 0037
RAX: ffda RBX:  RCX: 00436d8a
RDX: 000b RSI: 0107 RDI: 0003
RBP: 7ffce54e52b0 R08: 7ffce54e52d8 R09: 0004
R10: 7ffce54e52d4 R11: 0203 R12: 7ffce54e53c8
R13: 7ffce54e53d8 R14: 0002 R15: 
origin description: val@packet_getsockopt (origin=f6600052)
local variable created at:
 packet_getsockopt+0xcd/0xbe0 net/packet/af_packet.c:3789
 SYSC_getsockopt+0x495/0x540 net/socket.c:1829
==
---
 net/packet/af_packet.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8489beff5c25..dfc762df5da9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3787,7 +3787,7 @@ static int packet_getsockopt(struct socket *sock, int 
level, int optname,
 char __user *optval, int __user *optlen)
 {
int len;
-   int val, lv = sizeof(val);
+   int val = 0, lv = sizeof(val);
struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk);
void *data = 
@@ -3836,6 +3836,8 @@ static int packet_getsockopt(struct socket *sock, int 
level, int optname,
case PACKET_HDRLEN:
if (len > sizeof(int))
len = sizeof(int);
+   if (len < sizeof(int))
+   len = 0;
if (copy_from_user(, optval, len))
return -EFAULT;
switch (val) {
-- 
2.12.2.816.g281164-goog

Re: [PATCH v4 13/18] arm64: allwinner: sun50i-a64: add dwmac-sun8i Ethernet driver

2017-04-24 Thread Chen-Yu Tsai

On Mon, Apr 24, 2017 at 8:24 PM, Corentin Labbe
 wrote:
> On Wed, Apr 12, 2017 at 02:41:53PM +0200, Maxime Ripard wrote:
>> On Wed, Apr 12, 2017 at 01:13:55PM +0200, Corentin Labbe wrote:
>> > The dwmac-sun8i is an Ethernet MAC that supports 10/100/1000 Mbit
>> > connections. It is very similar to the device found in the Allwinner
>> > H3, but lacks the internal 100 Mbit PHY and its associated control
>> > bits.
>> > This adds the necessary bits to the Allwinner A64 SoC .dtsi, but keeps
>> > it disabled at this level.
>> >
>> > Signed-off-by: Corentin Labbe 
>> > ---
>> >  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 37 
>> > +++
>> >  1 file changed, 37 insertions(+)
>> >
>> > diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi 
>> > b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
>> > index 0b0f4ab..2569827 100644
>> > --- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
>> > +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
>> > @@ -287,6 +287,23 @@
>> > bias-pull-up;
>> > };
>> >
>> > +   rmii_pins: rmii_pins {
>> > +   pins = "PD10", "PD11", "PD13", "PD14",
>> > +   "PD17", "PD18", "PD19", "PD20",
>> > +   "PD22", "PD23";
>>
>> Please align the wrapped lines on the first pin.
>>
>
> OK
>
>> > +   function = "emac";
>> > +   drive-strength = <40>;
>>
>> Do you actually need that for all the boards, or only a few of them?
>
> I have tried to use lower value without success on some boards. (opipc/pine64 
> in my memory)

FYI we need them for all the boards that use RGMII.
The signals at gigabit speed run at 125 MHz DDR.

For RMII we probably don't need it. Even at 100 Mbps,
it's only 50 MHz SDR. drive-strength = <30> should be
enough.

ChenYu

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-24 Thread Jamal Hadi Salim


On 17-04-24 05:27 AM, Simon Horman wrote:

On Fri, Apr 21, 2017 at 02:11:00PM -0400, Jamal Hadi Salim wrote:

On 17-04-21 12:12 PM, David Miller wrote:




From my PoV, for #d user-space has to either get smart or fail.

Creating new tc involved work in order to support the new feature.
Part of that work would be a decision weather or not to provide
compatibility for old kernel or to bail out gracefully.



But not that much work as is being ascribed now.
This is a big change to user space code. Are we planning to
change all netlink apps (everything in iproute2) to now discover
by testing whether their flags are accepted or not? One extra
crossing to the kernel just to test the waters.

I think there's a middle ground and see which idea takes off.
Refer to the last patch I sent which implements the idea below.

cheers,
jamal


There is a simpler approach that would work going forward.
How about letting the user choose their fate? Set something maybe
in the netlink header to tell the kernel "if you dont understand
something I am asking for - please ignore it and do what you can".
This would maintain current behavior but would force the user to
explicitly state so.

cheers,
jamal

Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-24 Thread Jamal Hadi Salim


On 17-04-24 05:14 AM, Simon Horman wrote:
[..]


Jamal, I am confused about why are you so concerned about the space
consumed by this attribute, it's per-message, right? Is it the bigger
picture you are worried about - a similar per-entry flag at some point in
the future?



To me the two worries are one and the same.

Jiri strongly believes (from a big picture view) we must use
TLVs for extensibility.
While I agree with him in general i have strong reservations
in this case because i can get both extensibility and
build for performance with using a flag bitmask as the
content of the TLV.

A TLV consumes 64 bits minimum. It doesnt matter if we decide
to use a u8 or a u16, we are still sending 64 bits on that
TLV with the rest being PADding. Not to be melodramatic, but
the worst case scenario of putting everything in a TLV for 32
flags is using about 30x more space than using a bitmask.

Yes, space is important and if i can express upto 32 flags
with one TLV rather than 32 TLVs i choose one TLV.
I am always looking for ways to filter out crap i dont need
when i do stats collection. I have numerous wounds from fdb
entries which decided to use a TLV per flag.

The design approach we have used in netlink is: flags start
as a bitmap (whether they are on main headers or TLVs); they may be
complemented with a bitmask/selector (refer to IFLINK messages).

Lets look at this specific patch I have sending. I have already
changed it 3 times and involved a churn of 3 different flags.
If you asked me in the beggining i wouldve scratched my head
thinking for a near term use for bit #3, #4 etc,

I am fine with the counter-Postel view of having the kernel
validate that appropriate bits are set as long as we dont make
user space to now start learning how to play acrobatics.

cheers,
jamal

Re: [PATCH v4 13/18] arm64: allwinner: sun50i-a64: add dwmac-sun8i Ethernet driver

2017-04-24 Thread Corentin Labbe

On Wed, Apr 12, 2017 at 02:41:53PM +0200, Maxime Ripard wrote:
> On Wed, Apr 12, 2017 at 01:13:55PM +0200, Corentin Labbe wrote:
> > The dwmac-sun8i is an Ethernet MAC that supports 10/100/1000 Mbit
> > connections. It is very similar to the device found in the Allwinner
> > H3, but lacks the internal 100 Mbit PHY and its associated control
> > bits.
> > This adds the necessary bits to the Allwinner A64 SoC .dtsi, but keeps
> > it disabled at this level.
> > 
> > Signed-off-by: Corentin Labbe 
> > ---
> >  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 37 
> > +++
> >  1 file changed, 37 insertions(+)
> > 
> > diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi 
> > b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> > index 0b0f4ab..2569827 100644
> > --- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> > +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> > @@ -287,6 +287,23 @@
> > bias-pull-up;
> > };
> >  
> > +   rmii_pins: rmii_pins {
> > +   pins = "PD10", "PD11", "PD13", "PD14",
> > +   "PD17", "PD18", "PD19", "PD20",
> > +   "PD22", "PD23";
> 
> Please align the wrapped lines on the first pin.
> 

OK

> > +   function = "emac";
> > +   drive-strength = <40>;
> 
> Do you actually need that for all the boards, or only a few of them?

I have tried to use lower value without success on some boards. (opipc/pine64 
in my memory)

Regards
Corentin Labbe

[PATCH net] ipv6: move stub initialization after ipv6 setup completion

2017-04-24 Thread Paolo Abeni

The ipv6 stub pointer is currently initialized before the ipv6
routing subsystem: a 3rd party can access and use such stub
before the routing data is ready.
Moreover, such pointer is not cleared in case of initialization
error, possibly leading to dangling pointers usage.

This change addresses the above moving the stub initialization
at the end of ipv6 init code.

Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan")
Signed-off-by: Paolo Abeni 
---
 net/ipv6/af_inet6.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index a9a9553..e82e59f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -933,8 +933,6 @@ static int __init inet6_init(void)
if (err)
goto igmp_fail;
 
-   ipv6_stub = _stub_impl;
-
err = ipv6_netfilter_init();
if (err)
goto netfilter_fail;
@@ -1010,6 +1008,10 @@ static int __init inet6_init(void)
if (err)
goto sysctl_fail;
 #endif
+
+   /* ensure that ipv6 stubs are visible only after ipv6 is ready */
+   wmb();
+   ipv6_stub = _stub_impl;
 out:
return err;
 
-- 
2.9.3

[PATCH net-next 2/2] l2tp: define "l2tpeth" device type

2017-04-24 Thread Guillaume Nault

Export type of l2tpeth interfaces to userspace
(/sys/class/net//uevent).

Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_eth.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 5e44b3cc1212..59aba8aeac03 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -130,8 +130,13 @@ static const struct net_device_ops l2tp_eth_netdev_ops = {
.ndo_set_mac_address= eth_mac_addr,
 };
 
+static struct device_type l2tpeth_type = {
+   .name = "l2tpeth",
+};
+
 static void l2tp_eth_dev_setup(struct net_device *dev)
 {
+   SET_NETDEV_DEVTYPE(dev, _type);
ether_setup(dev);
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
dev->features   |= NETIF_F_LLTX;
-- 
2.11.0

[PATCH net-next 1/2] l2tp: set name_assign_type for devices created by l2tp_eth.c

2017-04-24 Thread Guillaume Nault

Export naming scheme used when creating l2tpeth interfaces
(/sys/class/net//name_assign_type). This let userspace know if
the device's name has been generated automatically or defined manually.

Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_eth.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index b722d559c544..5e44b3cc1212 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -258,6 +258,7 @@ static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
 
 static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 
peer_session_id, struct l2tp_session_cfg *cfg)
 {
+   unsigned char name_assign_type;
struct net_device *dev;
char name[IFNAMSIZ];
struct l2tp_tunnel *tunnel;
@@ -281,8 +282,11 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, 
u32 session_id, u32 p
goto out;
}
strlcpy(name, cfg->ifname, IFNAMSIZ);
-   } else
+   name_assign_type = NET_NAME_USER;
+   } else {
strcpy(name, L2TP_ETH_DEV_NAME);
+   name_assign_type = NET_NAME_ENUM;
+   }
 
session = l2tp_session_create(sizeof(*spriv), tunnel, session_id,
  peer_session_id, cfg);
@@ -291,7 +295,7 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, 
u32 session_id, u32 p
goto out;
}
 
-   dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN,
+   dev = alloc_netdev(sizeof(*priv), name, name_assign_type,
   l2tp_eth_dev_setup);
if (!dev) {
rc = -ENOMEM;
-- 
2.11.0

[PATCH net-next 0/2] l2tp: add informations about l2tpeth interfaces in /sys

2017-04-24 Thread Guillaume Nault

Patch #1 lets userspace retrieve the naming scheme of an l2tpeth
interface, using /sys/class/net//name_assign_type.

Patch #2 adds the DEVTYPE field in /sys/class/net//uevent so
that userspace can reliably know if a device is an l2tpeth interface.


Guillaume Nault (2):
  l2tp: set name_assign_type for devices created by l2tp_eth.c
  l2tp: define "l2tpeth" device type

 net/l2tp/l2tp_eth.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

-- 
2.11.0

Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-24 Thread Jason A. Donenfeld

On Mon, Apr 24, 2017 at 1:02 PM, David Laight  wrote:
> ...
>
> Shouldn't skb_to_sgvec() be checking the number of fragments against
> the size of the sg list?
> The callers would then all need auditing to allow for failure.

This has never been done before, since this is one of those operations
that simply _shouldn't fail_ this late in the driver's path. There's
an easy way to use a fixed size array of MAX_SKB_FRAGS+1, and then
just not specify FRAGLIST as a device feature. Then the function
succeeds every time, rather than dropping packets. Alternatively, if
the array is being allocated dynamically (kmalloc), a call to
skb_cow_data returns the number of fragments needed; since usually
people using scattergather are going to be modifying the skb anyway, I
believe this function should be being called anyway...

It would be possible to do as you suggest, though, by using sg_is_last
in skb_to_sgvec. In this case we'd need to change every call site of
skb_to_sgvec to ensure the return value is being checked as well as
making sure that the sglist is initialized with sg_init_table to
ensure the last frag is properly marked. I wouldn't be opposed to
this, though it is potentially error prone work.

In any case, this patch here follows the pattern of the entire rest of
the present-day kernel, so it ought to be merged as-is.

Re: net: cleanup_net is slow

2017-04-24 Thread Florian Westphal

Andrey Konovalov  wrote:
> On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal  wrote:
> > Florian Westphal  wrote:
> >> Indeed.  Setting net.netfilter.nf_conntrack_default_on=0 cuts time
> >> cleanup time by 2/3 ...
> >>
> >> nf unregister is way too happy to issue synchronize_net(), I'll work on
> >> a fix.
> >
> > I'll test this patch as a start.  Maybe we can also leverage exit_batch
> > more on netfilter side.
> 
> Hi Florian,
> 
> Your patch improves fuzzing speed at least twice, which is a great start!

Great.  I'll try to push the patches i have to nf-next ASAP.

Re: [PATCH net] bridge: shutdown bridge device before removing it

2017-04-24 Thread Nikolay Aleksandrov

On 24/04/17 14:01, Nikolay Aleksandrov wrote:
> On 24/04/17 10:25, Xin Long wrote:
>> During removing a bridge device, if the bridge is still up, a new mdb entry
>> still can be added in br_multicast_add_group() after all mdb entries are
>> removed in br_multicast_dev_del(). Like the path:
>>
>>   mld_ifc_timer_expire ->
>> mld_sendpack -> ...
>>   br_multicast_rcv ->
>> br_multicast_add_group
>>
>> The new mp's timer will be set up. If the timer expires after the bridge
>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>> This can happen when ip link remove a bridge or destroy a netns with a
>> bridge device inside.
>>
>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>> device after it's shutdown.
>>
>> This patch is to call dev_close at the beginning of br_dev_delete so that
>> netif_running check in br_multicast_add_group can avoid this issue. But
>> to keep consistent with before, it will not remove the IFF_UP check in
>> br_del_bridge for brctl.
>>
>> Reported-by: Jianwen Ji 
>> Signed-off-by: Xin Long 
>> ---
>>  net/bridge/br_if.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
> 
> +CC bridge maintainers
> 
> I can see how this could happen, could you also provide the traceback ?
> 
> The patch looks good to me, actually I think it fixes another issue with
> mcast stats where the percpu pointer can be accessed after it's freed if
> an mcast packet can get sent via br->dev after the br_multicast_dev_del() 
> call.
> This is definitely stable material, if I'm not mistaken the issue is there 
> since
> the introduction of br_dev_delete:
> commit e10177abf842
> Author: Satish Ashok 
> Date:   Wed Jul 15 07:16:51 2015 -0700
> 
> bridge: multicast: fix handling of temp and perm entries
> 
> 
> 
> Acked-by: Nikolay Aleksandrov 
> 

Actually I have a better idea for a fix because dev_close() for a single device 
is rather heavy.
Why don't you move the mdb flush logic in the bridge's ndo_uninit() callback ?
That should have the same effect and be much faster.

By the way I just noticed that there's also a memory leak - the mdb hash is 
reallocated
and not freed due to the mdb rehash, here's also kmemleak's object:

unreferenced object 0x8800540ba800 (size 2048):
  comm "softirq", pid 0, jiffies 4520588901 (age 5787.284s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[] kmemleak_alloc+0x67/0xc0
[] __kmalloc+0x1ba/0x3e0
[] br_mdb_rehash+0x5e/0x340 [bridge]
[] br_multicast_new_group+0x43f/0x6e0 [bridge]
[] br_multicast_add_group+0x203/0x260 [bridge]
[] br_multicast_rcv+0x945/0x11d0 [bridge]
[] br_dev_xmit+0x180/0x470 [bridge]
[] dev_hard_start_xmit+0xbb/0x3d0
[] __dev_queue_xmit+0xb13/0xc10
[] dev_queue_xmit+0x10/0x20
[] ip6_finish_output2+0x5ca/0xac0 [ipv6]
[] ip6_finish_output+0x126/0x2c0 [ipv6]
[] ip6_output+0xe5/0x390 [ipv6]
[] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
[] mld_sendpack+0x216/0x3e0 [ipv6]
[] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]

Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Michael S. Tsirkin

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > 
> > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > Applications that consume a batch of entries in one go
> > > > can benefit from ability to return some of them back
> > > > into the ring.
> > > > 
> > > > Add an API for that - assuming there's space. If there's no space
> > > > naturally we can't do this and have to drop entries, but this implies
> > > > ring is full so we'd likely drop some anyway.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > > 
> > > > Jason, in my mind the biggest issue with your batching patchset is the
> > > > backet drops on disconnect.  This API will help avoid that in the common
> > > > case.
> > > Ok, I will rebase the series on top of this. (Though I don't think we care
> > > the packet loss).
> > E.g. I care - I often start sending packets to VM before it's
> > fully booted. Several vhost resets might follow.
> 
> Ok.
> 
> > 
> > > > I would still prefer that we understand what's going on,
> > > I try to reply in another thread, does it make sense?
> > > 
> > > >and I would
> > > > like to know what's the smallest batch size that's still helpful,
> > > Yes, I've replied in another thread, the result is:
> > > 
> > > 
> > > no batching   1.88Mpps
> > > RX_BATCH=11.93Mpps
> > > RX_BATCH=42.11Mpps
> > > RX_BATCH=16   2.14Mpps
> > > RX_BATCH=64   2.25Mpps
> > > RX_BATCH=256  2.18Mpps
> > Essentially 4 is enough, other stuf looks more like noise
> > to me. What about 2?
> 
> The numbers are pretty stable, so probably not noise. Retested on top of
> batch zeroing:
> 
> no  1.97Mpps
> 1   2.09Mpps
> 2   2.11Mpps
> 4   2.16Mpps
> 8   2.19Mpps
> 16  2.21Mpps
> 32  2.25Mpps
> 64  2.30Mpps
> 128 2.21Mpps
> 256 2.21Mpps
> 
> 64 performs best.
> 
> Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

> > 
> > > >but
> > > > I'm not going to block the patch on these grounds assuming packet drops
> > > > are fixed.
> > > Thanks a lot.
> > > 
> > > > Lightly tested - this is on top of consumer batching patches.
> > > > 
> > > > Thanks!
> > > > 
> > > >include/linux/ptr_ring.h | 57 
> > > > 
> > > >1 file changed, 57 insertions(+)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 783e7f5..5fbeab4 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring 
> > > > *r, int size, gfp_t gfp)
> > > > return 0;
> > > >}
> > > > +/*
> > > > + * Return entries into ring. Destroy entries that don't fit.
> > > > + *
> > > > + * Note: this is expected to be a rare slow path operation.
> > > > + *
> > > > + * Note: producer lock is nested within consumer lock, so if you
> > > > + * resize you must make sure all uses nest correctly.
> > > > + * In particular if you consume ring in interrupt or BH context, you 
> > > > must
> > > > + * disable interrupts/BH when doing so.
> > > > + */
> > > > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void 
> > > > **batch, int n,
> > > > + void (*destroy)(void *))
> > > > +{
> > > > +   unsigned long flags;
> > > > +   int head;
> > > > +
> > > > +   spin_lock_irqsave(&(r)->consumer_lock, flags);
> > > > +   spin_lock(&(r)->producer_lock);
> > > > +
> > > > +   if (!r->size)
> > > > +   goto done;
> > > > +
> > > > +   /*
> > > > +* Clean out buffered entries (for simplicity). This way 
> > > > following code
> > > > +* can test entries for NULL and if not assume they are valid.
> > > > +*/
> > > > +   head = r->consumer_head - 1;
> > > > +   while (likely(head >= r->consumer_tail))
> > > > +   r->queue[head--] = NULL;
> > > > +   r->consumer_tail = r->consumer_head;
> > > > +
> > > > +   /*
> > > > +* Go over entries in batch, start moving head back and copy 
> > > > entries.
> > > > +* Stop when we run into previously unconsumed entries.
> > > > +*/
> > > > +   while (n--) {
> > > > +   head = r->consumer_head - 1;
> > > > +   if (head < 0)
> > > > +   head = r->size - 1;
> > > > +   if (r->queue[head]) {
> > > > +   /* This batch entry will have to be destroyed. 
> > > > */
> > > > +   ++n;
> > > > +   goto done;
> > > > +

Re: net: cleanup_net is slow

2017-04-24 Thread Andrey Konovalov

On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal  wrote:
> Florian Westphal  wrote:
>> Indeed.  Setting net.netfilter.nf_conntrack_default_on=0 cuts time
>> cleanup time by 2/3 ...
>>
>> nf unregister is way too happy to issue synchronize_net(), I'll work on
>> a fix.
>
> I'll test this patch as a start.  Maybe we can also leverage exit_batch
> more on netfilter side.

Hi Florian,

Your patch improves fuzzing speed at least twice, which is a great start!

Thanks!

>
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index a87a6f8a74d8..08fe1f526265 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -126,14 +126,15 @@ int nf_register_net_hook(struct net *net, const struct 
> nf_hook_ops *reg)
>  }
>  EXPORT_SYMBOL(nf_register_net_hook);
>
> -void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +static struct nf_hook_entry *
> +__nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  {
> struct nf_hook_entry __rcu **pp;
> struct nf_hook_entry *p;
>
> pp = nf_hook_entry_head(net, reg);
> if (WARN_ON_ONCE(!pp))
> -   return;
> +   return NULL;
>
> mutex_lock(_hook_mutex);
> for (; (p = nf_entry_dereference(*pp)) != NULL; pp = >next) {
> @@ -145,7 +146,7 @@ void nf_unregister_net_hook(struct net *net, const struct 
> nf_hook_ops *reg)
> mutex_unlock(_hook_mutex);
> if (!p) {
> WARN(1, "nf_unregister_net_hook: hook not found!\n");
> -   return;
> +   return NULL;
> }
>  #ifdef CONFIG_NETFILTER_INGRESS
> if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
> @@ -154,6 +155,17 @@ void nf_unregister_net_hook(struct net *net, const 
> struct nf_hook_ops *reg)
>  #ifdef HAVE_JUMP_LABEL
> static_key_slow_dec(_hooks_needed[reg->pf][reg->hooknum]);
>  #endif
> +
> +   return p;
> +}
> +
> +void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +{
> +   struct nf_hook_entry *p = __nf_unregister_net_hook(net, reg);
> +
> +   if (!p)
> +   return;
> +
> synchronize_net();
> nf_queue_nf_hook_drop(net, p);
> /* other cpu might still process nfqueue verdict that used reg */
> @@ -183,10 +195,36 @@ int nf_register_net_hooks(struct net *net, const struct 
> nf_hook_ops *reg,
>  EXPORT_SYMBOL(nf_register_net_hooks);
>
>  void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg,
> -unsigned int n)
> +unsigned int hookcount)
>  {
> -   while (n-- > 0)
> -   nf_unregister_net_hook(net, [n]);
> +   struct nf_hook_entry *to_free[16];
> +   unsigned int i, n;
> +
> +   WARN_ON_ONCE(hookcount > ARRAY_SIZE(to_free));
> +
> + next_round:
> +   n = min_t(unsigned int, hookcount, ARRAY_SIZE(to_free));
> +
> +   for (i = 0; i < n; i++)
> +   to_free[i] = __nf_unregister_net_hook(net, [i]);
> +
> +   synchronize_net();
> +
> +   for (i = 0; i < n; i++) {
> +   if (to_free[i])
> +   nf_queue_nf_hook_drop(net, to_free[i]);
> +   }
> +
> +   synchronize_net();
> +
> +   for (i = 0; i < n; i++)
> +   kfree(to_free[i]);
> +
> +   if (n < hookcount) {
> +   hookcount -= n;
> +   reg += n;
> +   goto next_round;
> +   }
>  }
>  EXPORT_SYMBOL(nf_unregister_net_hooks);
>

Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Jason Wang




On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:


On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.


Ok.




I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


   and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?


The numbers are pretty stable, so probably not noise. Retested on top of 
batch zeroing:


no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks




   but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.

Thanks a lot.


Lightly tested - this is on top of consumer batching patches.

Thanks!

   include/linux/ptr_ring.h | 57 

   1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
   }
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
   static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))

[PATCH v2] brcmfmac: Ensure pointer correctly set if skb data location changes

2017-04-24 Thread James Hughes

The incoming skb header may be resized if header space is
insufficient, which might change the data adddress in the skb.
Ensure that a cached pointer to that data is correctly set by
moving assignment to after any possible changes.

Signed-off-by: James Hughes 

Acked-by: Arend van Spriel 
---
Changes in v2

Moved the assignment below the len check - saves a few cycles 
if the length check fails. 


 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 5eaac13e2317..9b7c19a508ac 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
int ret;
struct brcmf_if *ifp = netdev_priv(ndev);
struct brcmf_pub *drvr = ifp->drvr;
-   struct ethhdr *eh = (struct ethhdr *)(skb->data);
+   struct ethhdr *eh;
 
brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
 
@@ -236,6 +236,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
goto done;
}
 
+   eh = (struct ethhdr *)(skb->data);
+
if (eh->h_proto == htons(ETH_P_PAE))
atomic_inc(>pend_8021x_cnt);
 
-- 
2.11.0

Re: [PATCH net] bridge: shutdown bridge device before removing it

2017-04-24 Thread Nikolay Aleksandrov

On 24/04/17 14:01, Nikolay Aleksandrov wrote:
> On 24/04/17 10:25, Xin Long wrote:
>> During removing a bridge device, if the bridge is still up, a new mdb entry
>> still can be added in br_multicast_add_group() after all mdb entries are
>> removed in br_multicast_dev_del(). Like the path:
>>
>>   mld_ifc_timer_expire ->
>> mld_sendpack -> ...
>>   br_multicast_rcv ->
>> br_multicast_add_group
>>
>> The new mp's timer will be set up. If the timer expires after the bridge
>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>> This can happen when ip link remove a bridge or destroy a netns with a
>> bridge device inside.
>>
>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>> device after it's shutdown.
>>
>> This patch is to call dev_close at the beginning of br_dev_delete so that
>> netif_running check in br_multicast_add_group can avoid this issue. But
>> to keep consistent with before, it will not remove the IFF_UP check in
>> br_del_bridge for brctl.
>>
>> Reported-by: Jianwen Ji 
>> Signed-off-by: Xin Long 
>> ---
>>  net/bridge/br_if.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
> 
> +CC bridge maintainers
> 
> I can see how this could happen, could you also provide the traceback ?
> 
> The patch looks good to me, actually I think it fixes another issue with
> mcast stats where the percpu pointer can be accessed after it's freed if
> an mcast packet can get sent via br->dev after the br_multicast_dev_del() 
> call.

Never mind the another issue part, Ido's recent ndo_uninit() patch fixed it.

> This is definitely stable material, if I'm not mistaken the issue is there 
> since
> the introduction of br_dev_delete:
> commit e10177abf842
> Author: Satish Ashok 
> Date:   Wed Jul 15 07:16:51 2015 -0700
> 
> bridge: multicast: fix handling of temp and perm entries
> 
> 
> 
> Acked-by: Nikolay Aleksandrov 
> 
> 
> 
> 
> 
> 
>

Re: [PATCH] brcmfmac: Ensure pointer correctly set if skb data location changes

2017-04-24 Thread Arend van Spriel


On 4/24/2017 12:52 PM, James Hughes wrote:

The incoming skb header may be resized if header space is
insufficient, which might change the data adddress in the skb.
Ensure that a cached pointer to that data is correctly set by
moving assignment to after any possible changes.


Thanks, James

Minor nit below...

You may add my acknowledgement:

Acked-by: Arend van Spriel 

Signed-off-by: James Hughes 
---
  drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c


[...]


@@ -229,6 +229,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
}
}
  
+	eh = (struct ethhdr *)(skb->data);

+


Please move after the length validation below.

Regards,
Arend


/* validate length for ether packet */
if (skb->len < sizeof(*eh)) {
ret = -EINVAL;

Re: [PATCH net] team: fix memory leaks

2017-04-24 Thread Jiri Pirko

Mon, Apr 24, 2017 at 12:29:16PM CEST, bianpan2...@163.com wrote:
>In functions team_nl_send_port_list_get() and
>team_nl_send_options_get(), pointer skb keeps the return value of
>nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
>freed(). This will result in memory leak bugs.
>
>Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for options 
>transfers")
>Signed-off-by: Pan Bian 

Acked-by: Jiri Pirko

Re: [PATCH iproute2 net 3/8] tc/pedit: Introduce 'add' operation

2017-04-24 Thread Or Gerlitz

On Mon, Apr 24, 2017 at 10:52 AM, Amir Vadai  wrote:
> On Sun, Apr 23, 2017 at 01:44:51PM -0400, Jamal Hadi Salim wrote:
>> Thanks for the excellent work.

sure, it's Amir, you know..

>> On 17-04-23 08:53 AM, Amir Vadai wrote:

>> Mostly curious about hardware handling.

> As to hardware handling, Or did the implementation so he will answer
> better. AFAIK, current implementation doesn't allow partial field
> setting/adding, so such a rule can't be offloaded. I think it is only to
> make things simple and because there was no use case for that.
> To my knowledge, hardware should support such thing if needed.

yeah (currently no partial setting) and yeah (in the future yes, will
support partial setting of offset into
field and partial length to set from there) and no for partial addition.

The reason we decided to disallow partial addition in the FW API was
what do you do with the carry, e.g you
add a value to the 2nd nibble of IP address and there are carry bits,
where do you take them? this becomes
even more ugly if you are dealing with mac addresses. I guess once
there's use case for partial add offload,
we will revisit that.

Or.

RE: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-24 Thread David Laight

From: Jason A. Donenfeld
> Sent: 21 April 2017 22:15
> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>one to work with, but it precludes using scatterdata since the memory
>must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
> size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
> tmp = kmalloc(size, GFP_ATOMIC);
> *sg = (struct scatterlist *)(tmp + sg_offset);
>   ...
> sg_init_table(sg, MAX_SKB_FRAGS + 1);
> skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.
...

Shouldn't skb_to_sgvec() be checking the number of fragments against
the size of the sg list?
The callers would then all need auditing to allow for failure.

David

Re: [PATCH net] bridge: shutdown bridge device before removing it

2017-04-24 Thread Nikolay Aleksandrov

On 24/04/17 10:25, Xin Long wrote:
> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
> mld_sendpack -> ...
>   br_multicast_rcv ->
> br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> This can happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
> device after it's shutdown.
> 
> This patch is to call dev_close at the beginning of br_dev_delete so that
> netif_running check in br_multicast_add_group can avoid this issue. But
> to keep consistent with before, it will not remove the IFF_UP check in
> br_del_bridge for brctl.
> 
> Reported-by: Jianwen Ji 
> Signed-off-by: Xin Long 
> ---
>  net/bridge/br_if.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

+CC bridge maintainers

I can see how this could happen, could you also provide the traceback ?

The patch looks good to me, actually I think it fixes another issue with
mcast stats where the percpu pointer can be accessed after it's freed if
an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.
This is definitely stable material, if I'm not mistaken the issue is there since
the introduction of br_dev_delete:
commit e10177abf842
Author: Satish Ashok 
Date:   Wed Jul 15 07:16:51 2015 -0700

bridge: multicast: fix handling of temp and perm entries



Acked-by: Nikolay Aleksandrov

Re: PROBLEM: IPVS incorrectly reverse-NATs traffic to LVS host

2017-04-24 Thread Nick Moriarty

Hi Julian,

Many thanks for your prompt fix!  I've now tested this patch against
the 4.11.0-rc8 kernel on Ubuntu, and I can confirm that my check
script is no longer seeing incorrect addresses in its responses.

Could you please keep me posted as this is merged?

Thanks again

On 22 April 2017 at 18:06, Julian Anastasov  wrote:
>
> Hello,
>
> On Wed, 12 Apr 2017, Nick Moriarty wrote:
>
>> Hi,
>>
>> I've experienced a problem in how traffic returning to an LVS host is
>> handled in certain circumstances.  Please find a bug report below - if
>> there's any further information you'd like, please let me know.
>>
>> [1.] One line summary of the problem:
>> IPVS incorrectly reverse-NATs traffic to LVS host
>>
>> [2.] Full description of the problem/report:
>> When using IPVS in direct-routing mode, normal traffic from the LVS
>> host to a back-end server is sometimes incorrectly NATed on the way
>> back into the LVS host.  Using tcpdump shows that the return packets
>> have the correct source IP, but by the time it makes it back to the
>> application, it's been changed.
>>
>> To reproduce this, a configuration such as the following will work:
>> - Set up an LVS system with a VIP serving UDP to a backend DNS server
>> using the direct-routing method in IPVS
>> - Make an outgoing UDP request to the VIP from the LVS system itself
>> (this causes a connection to be added to the IPVS connection table)
>> - The request should succeed as normal
>> - Note the UDP source port used
>> - Within 5 minutes (before the UDP connection entry expires), make an
>> outgoing UDP request directly to the backend DNS server
>> - The request will fail as the reply is incorrectly modified on its
>> way back and appears to return from the VIP
>>
>> Monitoring the above sequence with tcpdump verifies that the returned
>> packet (as it enters the host) is from the DNS IP, even though the
>> application sees the VIP.
>>
>> If an outgoing request direct to the DNS server is made from a port
>> not in the connection table, everything is fine.
>
> Thanks for the detailed report! I think, I fixed the
> problem. Let me know if you are able to test the appended fix.
>
>> I expect that somewhere, something (e.g. functionality for IPVS MASQ
>> responses) is applying IPVS connection
>> information to incoming traffic, matching a DROUTE rule, and treating
>> it as NAT traffic.
>
> Yep, that is what happens.
>
> 
>
> [PATCH net] ipvs: SNAT packet replies only for NATed connections
>
> We do not check if packet from real server is for NAT
> connection before performing SNAT. This causes problems
> for setups that use DR/TUN and allow local clients to
> access the real server directly, for example:
>
> - local client in director creates IPVS-DR/TUN connection
> CIP->VIP and the request packets are routed to RIP.
> Talks are finished but IPVS connection is not expired yet.
>
> - second local client creates non-IPVS connection CIP->RIP
> with same reply tuple RIP->CIP and when replies are received
> on LOCAL_IN we wrongly assign them for the first client
> connection because RIP->CIP matches the reply direction.
>
> The problem is more visible to local UDP clients but in rare
> cases it can happen also for TCP or remote clients when the
> real server sends the reply traffic via the director.
>
> So, better to be more precise for the reply traffic.
> As replies are not expected for DR/TUN connections, better
> to not touch them.
>
> Reported-by: Nick Moriarty 
> Signed-off-by: Julian Anastasov 
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index db40050..ee44ed5 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -849,10 +849,8 @@ static int handle_response_icmp(int af, struct sk_buff 
> *skb,
>  {
> unsigned int verdict = NF_DROP;
>
> -   if (IP_VS_FWD_METHOD(cp) != 0) {
> -   pr_err("shouldn't reach here, because the box is on the "
> -  "half connection in the tun/dr module.\n");
> -   }
> +   if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ)
> +   goto ignore_cp;
>
> /* Ensure the checksum is correct */
> if (!skb_csum_unnecessary(skb) && ip_vs_checksum_complete(skb, ihl)) {
> @@ -886,6 +884,8 @@ static int handle_response_icmp(int af, struct sk_buff 
> *skb,
> ip_vs_notrack(skb);
> else
> ip_vs_update_conntrack(skb, cp, 0);
> +
> +ignore_cp:
> verdict = NF_ACCEPT;
>
>  out:
> @@ -1385,8 +1385,11 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int 
> hooknum, struct sk_buff *skb, in
>  */
> cp = pp->conn_out_get(ipvs, af, skb, );
>
> -   if (likely(cp))
> +   if (likely(cp))

[PATCH] brcmfmac: Ensure pointer correctly set if skb data location changes

2017-04-24 Thread James Hughes

The incoming skb header may be resized if header space is
insufficient, which might change the data adddress in the skb.
Ensure that a cached pointer to that data is correctly set by
moving assignment to after any possible changes.

Signed-off-by: James Hughes 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 5eaac13e2317..934fe00e28a0 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
int ret;
struct brcmf_if *ifp = netdev_priv(ndev);
struct brcmf_pub *drvr = ifp->drvr;
-   struct ethhdr *eh = (struct ethhdr *)(skb->data);
+   struct ethhdr *eh;
 
brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
 
@@ -229,6 +229,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
}
}
 
+   eh = (struct ethhdr *)(skb->data);
+
/* validate length for ether packet */
if (skb->len < sizeof(*eh)) {
ret = -EINVAL;
-- 
2.11.0

Re: bluetooth 6lowpan interfaces are not virtual anymore

2017-04-24 Thread Luiz Augusto von Dentz

Hi Alex,

On Wed, Apr 19, 2017 at 8:43 PM, Alexander Aring  wrote:
> Hi,
>
> at first I want to clarify what my definition of virtual and non virtual
> interface is:
>
> - virtual interfaces: has no queue(s)
> - non virtual interfaces: has a queue(s)
>
> I did some "big" ASCII graphic what I think it will do now.
> At first the virtual interface:
>
> --- snip ---
>
> Transmit side only! Case of virtual interface.
>
> IPv6 Stack
>
> +-+
> |IPv6 Packet (ND, etc)|
> +--+--+
>|
>   -+-
> 6LoWPAN Interface  |
>|
> +--v--+
> | Adaptation (IPv6 to 6Lo)|
> +--+--+
> +--+
> |
> | ---
> Link|Layer (802.15.4/hci_dev/etc)
> |
> |  SKB Queue (With kind of discipline)
> |   ++---+-+
> +--->  SKB1  | SKBN  | ... ++
> ++---+-+|
> |
> |
>     |
> Real Transeiver Hardware|
>++
>|
>   +v---+
>   |Framebuffers (YOUR hardware resource)   |
>   ++
>
> --- snap ---
>
> The non virtual interface:
>
> --- snip ---
>
> Transmit side only! Case of non virtual interface.
>
> IPv6 Stack
>
> +-+
> |IPv6 Packet (ND, etc)|
> +--+--+
>|
>   -+-
> 6LoWPAN Interface  |
>|
> +--+
> |  /> Will be 
> queued if you stop the queue
> |  SKB Queue (With kind of discipline)  /--   Because 
> Link Layer is full/software limitation
> |   ++---+-+
> +--->  SKB1  | SKBN  | ... ++
> ++---+-+|
> |
>++
>|
> +--v--+
> | Adaptation (IPv6 to 6Lo)|
> +--+--+
> +--+
> |
> | ---
> Link|Layer (802.15.4/hci_dev/etc)
> |  -> 
> software limitation (tx credits?)
> |  SKB Queue (With kind of discipline) ---/   Right 
> position to make different handling (for virtual case).
> |   ++---+/+
> +--->  SKB1  | SKBN  | ... ++
> ++---+-+|
> |
> |
>     |
> Real Transeiver Hardware|
>++
>|
>   +v---+
>   |Framebuffers (YOUR hardware resource)   |
>   ++
>
>
> --- snap ---
>
> Adding a queue to a interface which does a protocol translation only is
> in my opinion: wrong.
>
> At least if you want to try to make a more efficient qdisc handling,
> means dropping skb's in queue when userspace sends too much. You will
> change it back, control two queues seems to be difficult.

It is wrong only if you look at 6LoWPAN interface as being owned by
6lowpan modules, but it doesn't, in fact it won't anything by itself
so it basically acts as a library to that perform 6LoWPAN operation on
the packet level, thus why it

[PATCH net] team: fix memory leaks

2017-04-24 Thread Pan Bian

In functions team_nl_send_port_list_get() and
team_nl_send_options_get(), pointer skb keeps the return value of
nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
freed(). This will result in memory leak bugs.

Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for options 
transfers")
Signed-off-by: Pan Bian 
---
 drivers/net/team/team.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f8c81f1..85c0124 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_OPTIONS_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_PORT_LIST_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
-- 
1.9.1

Re: [PATCH net v2 1/3] net: hns: support deferred probe when can not obtain irq

2017-04-24 Thread Matthias Brugger


On 21/04/17 09:44, Yankejian wrote:

From: lipeng 

In the hip06 and hip07 SoCs, the interrupt lines from the
DSAF controllers are connected to mbigen hw module.
The mbigen module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

We check for probe deferral in the hw layer probe, so we not
probe into the main layer and memories, etc., to later learn
that we need to defer the probe.



Why? This looks like a hack.
From what I see, we can handle EPROBE_DEFER easily inside hns_ppe_init 
checking the return value of hns_rcb_get_cfg. Like you do in 2/3 of this 
series.


Regards,
Matthias


Signed-off-by: lipeng 
Reviewed-by: Yisen Zhuang 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index 403ea9d..2da5b42 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -2971,6 +2971,18 @@ static int hns_dsaf_probe(struct platform_device *pdev)
struct dsaf_device *dsaf_dev;
int ret;

+   /*
+* Check if we should defer the probe before we probe the
+* dsaf, as it's hard to defer later on.
+*/
+   ret = platform_get_irq(pdev, 0);
+   if (ret < 0) {
+   if (ret != -EPROBE_DEFER)
+   dev_err(>dev, "Cannot obtain irq\n");
+
+   return ret;
+   }
+
dsaf_dev = hns_dsaf_alloc_dev(>dev, sizeof(struct dsaf_drv_priv));
if (IS_ERR(dsaf_dev)) {
ret = PTR_ERR(dsaf_dev);

Re: [PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled

2017-04-24 Thread Simon Horman

On Mon, Apr 24, 2017 at 10:21:30AM +0300, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Mon, 24 Apr 2017, Paolo Abeni wrote:
> 
> > Hi,
> > 
> > The problem with the patched code is that it tries to resolve ipv6
> > addresses that are not created/validated by the kernel.
> 
>   OK. Simon, please apply to ipvs tree.
> 
> Acked-by: Julian Anastasov 

Thanks, done.

Re: [PATCH] net: ipv6: check route protocol when deleting routes

2017-04-24 Thread Lorenzo Colitti

On Fri, Dec 16, 2016 at 5:30 PM, Mantas Mikulėnas  wrote:
> The protocol field is checked when deleting IPv4 routes, but ignored for
> IPv6, which causes problems with routing daemons accidentally deleting
> externally set routes (observed by multiple bird6 users).
>
> This can be verified using `ip -6 route del  proto something`.

I think this change might have broken userspace deleting routes that
were created by RAs. This is because the rtm_protocol returned to
userspace is actually synthesized only at route dump time by
rt6_fill_node:

else if (rt->rt6i_flags & RTF_ADDRCONF) {
if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
rtm->rtm_protocol = RTPROT_RA;
else
rtm->rtm_protocol = RTPROT_KERNEL;
}

but rt6_add_dflt_router and rt6_add_route_info add the route with a
protocol of 0, and 0 is silently upgraded to RTPROT_BOOT by
ip6_route_info_create.

if (cfg->fc_protocol == RTPROT_UNSPEC)
cfg->fc_protocol = RTPROT_BOOT;
rt->rt6i_protocol = cfg->fc_protocol;

So an app that was previously trying to delete routes looking at
rtm_proto, and issuing a delete with whatever rtm_proto was returned
by netlink, will result in this check failing because its passed-in
protocol (RTPROT_RA or RTPROT_KERNEL) will not match the actual FIB
value, which is RTPROT_BOOT.

I can't easily test on a vanilla kernel, but on a system running a
slightly modified 4.4.63, I see the code fail like this:

# ip -6 route show
2001:db8:64::/64 dev nettest100  proto kernel  metric 256  expires 291sec
fe80::/64 dev nettest100  proto kernel  metric 256
default via fe80::6400 dev nettest100  proto ra  metric 1024  expires 291sec
# ip -6 route flush
Failed to send flush request: No such process
# ip -6 route show
default via fe80::6400 dev nettest100  proto ra  metric 1024  expires 286sec

If so, it seems unfortunate to have brought this into -stable.

For non-stable kernels, it seems that the proper fix would be:

1. Ensure that when an RA creates a route, it properly sets
rtm_protocol at time of route creation.
2. When we dump routes to userspace, we don't overwrite the rtm_protocol.

I can try to write that up, but I'm not sure why the code doesn't do
this already. Perhaps there's a good reason for it. Yoshifuji, Hannes,
any thoughts?

Re: [net 1/1] team: fix memory leaks

2017-04-24 Thread Jiri Pirko

Plus since you have only one patch, please do not do "1/1" in the
email subject.  Thanks.


Mon, Apr 24, 2017 at 11:36:52AM CEST, bianpan2...@163.com wrote:
>In functions team_nl_send_port_list_get() and
>team_nl_send_options_get(), pointer skb keeps the return value of
>nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
>freed(). This will result in memory leak bugs.
>
>Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for
>options transfers")
>Signed-off-by: Pan Bian 
>---
> drivers/net/team/team.c | 8 ++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index f8c81f1..85c0124 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
>u32 portid, u32 seq,
> 
>   hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
> NLM_F_MULTI,
> TEAM_CMD_OPTIONS_GET);
>-  if (!hdr)
>+  if (!hdr) {
>+  nlmsg_free(skb);
>   return -EMSGSIZE;
>+  }
> 
>   if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
>   goto nla_put_failure;
>@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team 
>*team, u32 portid, u32 seq,
> 
>   hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
> NLM_F_MULTI,
> TEAM_CMD_PORT_LIST_GET);
>-  if (!hdr)
>+  if (!hdr) {
>+  nlmsg_free(skb);
>   return -EMSGSIZE;
>+  }
> 
>   if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
>   goto nla_put_failure;
>-- 
>1.9.1
>
>

Re: [net 1/1] team: fix memory leaks

2017-04-24 Thread Jiri Pirko

Interesting. In last reply, I put the "[]" prefix example exactly as it
should be, yet you managed to have it wrong...


Mon, Apr 24, 2017 at 11:36:52AM CEST, bianpan2...@163.com wrote:
>In functions team_nl_send_port_list_get() and
>team_nl_send_options_get(), pointer skb keeps the return value of
>nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
>freed(). This will result in memory leak bugs.
>
>Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for
>options transfers")

No linewraps here, please.



>Signed-off-by: Pan Bian 
>---
> drivers/net/team/team.c | 8 ++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index f8c81f1..85c0124 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
>u32 portid, u32 seq,
> 
>   hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
> NLM_F_MULTI,
> TEAM_CMD_OPTIONS_GET);
>-  if (!hdr)
>+  if (!hdr) {
>+  nlmsg_free(skb);
>   return -EMSGSIZE;
>+  }
> 
>   if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
>   goto nla_put_failure;
>@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team 
>*team, u32 portid, u32 seq,
> 
>   hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
> NLM_F_MULTI,
> TEAM_CMD_PORT_LIST_GET);
>-  if (!hdr)
>+  if (!hdr) {
>+  nlmsg_free(skb);
>   return -EMSGSIZE;
>+  }
> 
>   if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
>   goto nla_put_failure;
>-- 
>1.9.1
>
>

Re: [PATCH v2 net-next] net: ipv6: send unsolicited NA if enabled for all interfaces

2017-04-24 Thread Simon Horman

On Sat, Apr 22, 2017 at 09:10:13AM -0700, David Ahern wrote:
> When arp_notify is set to 1 for either a specific interface or for 'all'
> interfaces, gratuitous arp requests are sent. Since ndisc_notify is the
> ipv6 equivalent to arp_notify, it should follow the same semantics.
> Commit 4a6e3c5def13 ("net: ipv6: send unsolicited NA on admin up") sends
> the NA on admin up. The final piece is checking devconf_all->ndisc_notify
> in addition to the per device setting. Add it.
> 
> Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer 
> address change")
> Signed-off-by: David Ahern 

Reviewed-by: Simon Horman 

> ---
> v2
> - update commit message with subject of commit 4a6e3c5def13 per comment
>   from Sergei
> 
>  net/ipv6/ndisc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index b23822e64228..d310dc41209a 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1753,7 +1753,8 @@ static int ndisc_netdev_event(struct notifier_block 
> *this, unsigned long event,
>   idev = in6_dev_get(dev);
>   if (!idev)
>   break;
> - if (idev->cnf.ndisc_notify)
> + if (idev->cnf.ndisc_notify ||
> + net->ipv6.devconf_all->ndisc_notify)
>   ndisc_send_unsol_na(dev);
>   in6_dev_put(idev);
>   break;
> -- 
> 2.1.4
>

[net 1/1] team: fix memory leaks

2017-04-24 Thread Pan Bian

In functions team_nl_send_port_list_get() and
team_nl_send_options_get(), pointer skb keeps the return value of
nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
freed(). This will result in memory leak bugs.

Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for
options transfers")
Signed-off-by: Pan Bian 
---
 drivers/net/team/team.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f8c81f1..85c0124 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_OPTIONS_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_PORT_LIST_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
-- 
1.9.1

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-24 Thread Simon Horman

On Fri, Apr 21, 2017 at 02:11:00PM -0400, Jamal Hadi Salim wrote:
> On 17-04-21 12:12 PM, David Miller wrote:
> 
> >Yes for existing attributes we are stuck in the mud because of how
> >we've handled things in the past.  I'm not saying we should change
> >behavior for existing attributes.
> >
> >I'm talking about any newly added attribute from here on out, and
> >that we need to require checks for them.
> >
> 
> Please bear with me. I want to make sure to get this right.
> 
> Lets say I updated the kernel today to reject transactions with
> bits it didnt understand. Lets call this "old kernel". A tc that
> understands/sets these bits and nothing else. Call it "old tc".
> 3 months later:
> I add one more bit setting to introduce a new feature in a new
> kernel version. Lets call this new "kernel". I update to
> understand new bits. Call it "new tc".
> 
> The possibilities:
> a) old tc + old kernel combo. No problem
> b) new tc + new kernel combo. No problem.
> c) old tc + new kernel combo. No problem.
> d) new tc + old kernel. Rejection.
> 
> For #d if i have a smart tc it would retry with a new combination
> which restores its behavior to old tc level. Of course this means
> apps would have to be rewritten going forward to understand these
> mechanics.
> Alternative is to request for capabilities first then doing a
> lowest common denominator request.
> But even that is a lot more code and crossing user/kernel twice.

>From my PoV, for #d user-space has to either get smart or fail.

Creating new tc involved work in order to support the new feature.
Part of that work would be a decision weather or not to provide
compatibility for old kernel or to bail out gracefully.

> There is a simpler approach that would work going forward.
> How about letting the user choose their fate? Set something maybe
> in the netlink header to tell the kernel "if you dont understand
> something I am asking for - please ignore it and do what you can".
> This would maintain current behavior but would force the user to
> explicitly state so.
> 
> cheers,
> jamal
>

Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-24 Thread Simon Horman

On Thu, Apr 20, 2017 at 04:24:53PM +0200, Jiri Pirko wrote:
> Thu, Apr 20, 2017 at 04:18:50PM CEST, j...@mojatatu.com wrote:
> >On 17-04-20 09:59 AM, Jiri Pirko wrote:
> >> Thu, Apr 20, 2017 at 03:06:21PM CEST, j...@mojatatu.com wrote:
> >> > From: Jamal Hadi Salim 

...

> >> > +if (tcaa[TCAA_ACT_FLAGS])
> >> > +act_flags = nla_get_u32(tcaa[TCAA_ACT_FLAGS]);
> >> 
> >> I still believe this is wrong. Should be a separate attr per flag.
> >> For user experience breakage reasons:
> >> 2 kernels should not behave differently on the exact same value passed
> >> from userspace:
> >> User passes 0x2. Now the kernel will ignore the set bit, the next kernel
> >> will recognize it as a valid flag and do something.
> >> Please let the discussion reach a consensus before pushing this again.
> >> 
> >> 
> >
> >Jiri - I dont agree. There is no such breakage. Refer to my previous
> >email. Lets just move on.
> 
> Anyone else has opinion on this?

At the risk of jumping into a hornets nest, yes, I have an opinion:

* A agree with Jiri that a separate attribute per flag seems to be
  the cleanest option, however;
* I think it would be reasonable from a UABI PoV to permit currently unused
  bits of TCAA_ACT_FLAGS to be re-uses so long as the kernel checks that
  they are zero until they are designated to have some use. I believe this
  implies that the default value for any future uses of these bits would be
  zero.

Jamal, I am confused about why are you so concerned about the space
consumed by this attribute, it's per-message, right? Is it the bigger
picture you are worried about - a similar per-entry flag at some point in
the future?

Re: [PATCH v4 net-next] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-24 Thread Roger Quadros

On 24/04/17 02:35, Andrew Lunn wrote:
> On Fri, Apr 21, 2017 at 03:31:09PM +0200, Lars-Peter Clausen wrote:
>> On 04/21/2017 03:15 PM, Roger Quadros wrote:
>>> diff --git a/Documentation/devicetree/bindings/net/mdio.txt 
>>> b/Documentation/devicetree/bindings/net/mdio.txt
>>> new file mode 100644
>>> index 000..4ffbbac
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/net/mdio.txt
>>> @@ -0,0 +1,33 @@
>>> +Common MDIO bus properties.
>>> +
>>> +These are generic properties that can apply to any MDIO bus.
>>> +
>>> +Optional properties:
>>> +- reset-gpios: List of one or more GPIOs that control the RESET lines
>>> +  of the PHYs on that MDIO bus.
>>> +- reset-delay-us: RESET pulse width in microseconds as per PHY datasheet.
>>> +
>>> +A list of child nodes, one per device on the bus is expected. These
>>> +should follow the generic phy.txt, or a device specific binding document.
>>> +
>>> +Example :
>>> +This example shows these optional properties, plus other properties
>>> +required for the TI Davinci MDIO driver.
>>> +
>>> +   davinci_mdio: ethernet@0x5c03 {
>>> +   compatible = "ti,davinci_mdio";
>>> +   reg = <0x5c03 0x1000>;
>>> +   #address-cells = <1>;
>>> +   #size-cells = <0>;
>>> +
>>> +   reset-gpios = < 5 GPIO_ACTIVE_LOW>;
>>> +   reset-delay-us = <2>;   /* PHY datasheet states 1us min */
>>
>> If this is the reset line of the PHY shouldn't it be a property of the PHY
>> node rather than of the MDIO controller node (which might have a reset on
>> its own)?
>>> +
>>> +   ethphy0: ethernet-phy@1 {
>>> +   reg = <1>;
>>> +   };
>>> +
>>> +   ethphy1: ethernet-phy@3 {
>>> +   reg = <3>;
>>> +   };
> 
> Hi Lars-Peter
> 
> We discussed this when the first proposal was made. There are two
> cases, to consider.
> 
> 1) Here, one GPIO line resets all PHYs on the same MDIO bus. In this
> example, two PHYs.
> 
> 2) There is one GPIO line per PHY. That is a separate case, and as you
> say, the reset line should probably be considered a PHY property, not
> an MDIO property. However, it can be messy, since in order to probe
> the MDIO bus, you probably need to take the PHY out of reset.
> 
> Anyway, this patch addresses the first case, so should be accepted. If
> anybody wants to address the second case, they are free to do so.

Thanks for the explanation Andrew.

For the second case, even if the RESET GPIO property is specified
in the PHY node, the RESET *will* have to be done by the MDIO bus driver
else the PHY might not be probed at all.

Whether we need additional code to just to make the DT look prettier is
questionable and if required can come as a separate patch.

cheers,
-roger

Re: [RFC PATCH 3/7] net: add option to get information about timestamped packets

2017-04-24 Thread Miroslav Lichvar

On Thu, Apr 13, 2017 at 12:16:09PM -0400, Willem de Bruijn wrote:
> On Thu, Apr 13, 2017 at 11:18 AM, Miroslav Lichvar  
> wrote:
> > On Thu, Apr 13, 2017 at 10:37:07AM -0400, Willem de Bruijn wrote:
> >> Why is this L2 length needed?
> >
> > It's needed for incoming packets to allow converting of preamble
> > timestamps to trailer timestamps.
> 
> Receiving the mac length of a packet sounds like a feature independent
> from timestamping.

I agree, but so far nobody suggested another use for this information.
Do you have any suggestions?

The idea was that if it is useful only with HW timestamping, it would
be better to save it only with the timestamp, so there is no
performance impact in the more common case when HW timestamping is
disabled. Am I overly cautious here?

> Either an ioctl similar to SIOCGIFMTU or, if it may
> vary due to existince of vlan headers, a new independent cmsg at the
> SOL_SOCKET layer.

It's not just the VLAN headers. The length of the IP header may vary
with IP options, so the offset of the UDP data in the packet cannot be
assumed to be constant.

Now I'm wondering if it's actually necessary to save the original
value of skb->mac_len + skb->len. Would "skb->data - skb->head -
skb->mac_header + skb->len" always work as the L2 length for received
packets at the time when the cmsg is prepared?

As for the original ifindex, it seems to me it does need to be saved
to a new field since __netif_receive_skb_core() intentionally
overwrites skb->skb_iif. What would be the best place for it, sk_buff
or skb_shared_info?

And would it really be acceptable to save it for all packets in
__netif_receive_skb_core(), even when HW timestamping is disabled?
Seeing how the code and the data structures were optimized over time,
I have a feeling it would not be accepted.

-- 
Miroslav Lichvar

Re: [PATCH] brcm80211: brcmfmac: Ensure that incoming skb's are writable

2017-04-24 Thread James Hughes

On 23 April 2017 at 20:34, Arend Van Spriel
 wrote:
> On 21-4-2017 11:22, James Hughes wrote:
>> On 20 April 2017 at 20:48, Arend van Spriel
>>  wrote:
>>> + linux-wireless
>>>
>>> On 4/20/2017 1:16 PM, James Hughes wrote:

 The driver was adding header information to incoming skb
 without ensuring the head was uncloned and hence writable.

 skb_cow_head has been used to ensure they are writable, however,
 this required some changes to error handling to ensure that
 if skb_cow_head failed it was not ignored.

 This really needs to be reviewed by someone who is more familiar
 with this code base to ensure any deallocation of skb's is
 still correct.

 Signed-off-by: James Hughes 
 ---
   .../wireless/broadcom/brcm80211/brcmfmac/bcdc.c| 15 --
   .../wireless/broadcom/brcm80211/brcmfmac/core.c| 23 +---
   .../broadcom/brcm80211/brcmfmac/fwsignal.c | 32
 +-
   .../wireless/broadcom/brcm80211/brcmfmac/sdio.c|  7 -
   4 files changed, 51 insertions(+), 26 deletions(-)

>>>
>>> [...]
>>>
 diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
 b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
 index 5eaac13..08272e8 100644
 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
 +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
 @@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct
 sk_buff *skb,
 int ret;
 struct brcmf_if *ifp = netdev_priv(ndev);
 struct brcmf_pub *drvr = ifp->drvr;
 -   struct ethhdr *eh = (struct ethhdr *)(skb->data);
 +   struct ethhdr *eh;
 brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
   @@ -212,23 +212,14 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct
 sk_buff *skb,
 }
 /* Make sure there's enough room for any header */
 -   if (skb_headroom(skb) < drvr->hdrlen) {
 -   struct sk_buff *skb2;
 -
 -   brcmf_dbg(INFO, "%s: insufficient headroom\n",
 - brcmf_ifname(ifp));
 -   drvr->bus_if->tx_realloc++;
 -   skb2 = skb_realloc_headroom(skb, drvr->hdrlen);
 -   dev_kfree_skb(skb);
 -   skb = skb2;
 -   if (skb == NULL) {
 -   brcmf_err("%s: skb_realloc_headroom failed\n",
 - brcmf_ifname(ifp));
 -   ret = -ENOMEM;
 -   goto done;
 -   }
>>>
>>>
>>> What you are throwing away here is code that assures there is sufficient
>>> headroom for protocol and bus layer in the tx path, because that is
>>> determined by drvr->hdrlen. This is where the skb is handed to the driver so
>>> if you could leave the functionality above *and* assure it is writeable that
>>> would be the best solution as there is no need for all the other changes
>>> down the tx path.
>>
>> The skb_cow_head function takes the required headroom as a parameter
>> and will ensure that there is enough space, so I don't think this code
>> segment is
>> required or have I misunderstood what you mean here?
>
> I looked more into what skb_cow_head() and skb_realloc_headroom()
> actually do and it seems you are right.
>
>> Is it safe to rely on the _cow_ being done here and not further down
>> in the stack?
>> Or at least checked further down in the stack. Previous comments from
>> another patch
>> requested that the _cow_ be done close to the actual addition of the
>> header. I presume
>> it is unlikely/impossible that the functions that add header
>> information  down the stack
>> will be called without the above being done first?
>
> It is safe. During probe sequence each layer in the driver stack
> increments drvr->hdrlen with headroom it needs. So drvr->hdrlen will
> indicate the total headroom needed when .start_xmit() is called. And
> yes, the .start_xmit() is the single point of entry in the transmit
> path. So the other locations where skb_push() is done are safe.
>

OK, I will redo the patch (making it v3 + changelogs), taking this in
to account.


>>>
 +   ret = skb_cow_head(skb, drvr->hdrlen);
 +   if (ret) {
>>>
>>>
>>> So move the realloc code above here instead of simply freeing the skb.
>>>
 +   dev_kfree_skb_any(skb);
 +   goto done;
 }
   + eh = (struct ethhdr *)(skb->data);
>>>
>>>
>>> Now this is actually a separate fix so I would like a separate patch for it.
>>>
>>
>> No problem, but see final paragraph below.
>
> Let me see...

I'll split this fix out, but should I issue it in the same patch set
or as a separate patch?
And for a different patch,

Re: [PATCH v2 2/2] iproute2: add support for invisible qdisc dumping

2017-04-24 Thread Jiri Pirko

Wed, Mar 08, 2017 at 01:04:42PM CET, ji...@kernel.org wrote:
>From: Jiri Kosina 
>
>Support the new TCA_DUMP_INVISIBLE netlink attribute that allows asking 
>kernel to perform 'full qdisc dump', as for historical reasons some of the 
>default qdiscs are being hidden by the kernel.
>
>The command syntax is being extended by voluntary 'invisible' argument to
>'tc qdisc show'.
>
>Signed-off-by: Jiri Kosina 
>---

[...]


>@@ -325,7 +328,25 @@ static int tc_qdisc_list(int argc, char **argv)
>   filter_ifindex = t.tcm_ifindex;
>   }
> 
>-  if (rtnl_dump_request(, RTM_GETQDISC, , sizeof(t)) < 0) {
>+  if (dump_invisible) {
>+  struct {
>+  struct nlmsghdr n;
>+  struct tcmsg t;
>+  char buf[256];
>+  } req = {
>+  .n.nlmsg_type = RTM_GETQDISC,
>+  .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
>+  };
>+
>+  req.t.tcm_family = AF_UNSPEC;
>+
>+  addattr(, 256, TCA_DUMP_INVISIBLE);
>+  if (rtnl_dump_request_n(, ) < 0) {

Jirko, you ignore the values that were previously set in "t". I think
that you can change this to do rtnl_dump_request_n always and simplify
the code a bit.


>+  perror("Cannot send dump request");
>+  return 1;
>+  }
>+
>+  } else if (rtnl_dump_request(, RTM_GETQDISC, , sizeof(t)) < 0) {
>   perror("Cannot send dump request");
>   return 1;
>   }
>-- 
>Jiri Kosina
>SUSE Labs
>

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-24 Thread Paul Menzel


Dear Benjamin,


Thank you for your fix.

On 04/21/17 23:20, Benjamin Poirier wrote:

Some statistics passed to ethtool are garbage because e1000e_get_stats64()
doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
memory to userspace and confuses users.


Could you please give specific examples to reproduce the issue? That way 
your fix can also be tested.


[…]


Kind regards,

Paul

Re: [PATCH net-next] net: dsa: LAN9303: add i2c dependency

2017-04-24 Thread Juergen Borleis

Hi Tobias,

On Monday 24 April 2017 10:07:37 Tobias Regnery wrote:
> The Kconfig symbol for the I2C mode of the LAN9303 driver selects
> REGMAP_I2C. This symbol depends on I2C but the driver doesen't has a
> dependency on I2C.
>
> With CONFIG_I2C=n kconfig fails to select REGMAP_I2C and we get the
> following warning:
>
> warning: (NET_DSA_SMSC_LAN9303_I2C) selects REGMAP_I2C which has unmet
> direct dependencies (I2C)
>
> This results in a build failure because the regmap-i2c functions aren't
> available:
>
> drivers/base/regmap/regmap-i2c.c: In function 'regmap_smbus_byte_reg_read':
> drivers/base/regmap/regmap-i2c.c:29:8: error: implicit declaration of 
> function 'i2c_smbus_read_byte_data' [-Werror=implicit-function-declaration]
>
> drivers/base/regmap/regmap-i2c.c: In function 'regmap_smbus_byte_reg_write':
> drivers/base/regmap/regmap-i2c.c:47:9: error: implicit declaration of 
> function 'i2c_smbus_write_byte_data' [-Werror=implicit-function-declaration]
> ... 
>
> Fix this by adding a kconfig dependency on the I2C subsystem.
>
> Fixes: be4e119f9914 ("net: dsa: LAN9303: add I2C managed mode support")
> Signed-off-by: Tobias Regnery 
> ---
>  drivers/net/dsa/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
> index 131a5b1cbfc8..862ee22303c2 100644
> --- a/drivers/net/dsa/Kconfig
> +++ b/drivers/net/dsa/Kconfig
> @@ -59,7 +59,7 @@ config NET_DSA_SMSC_LAN9303
>
>  config NET_DSA_SMSC_LAN9303_I2C
>   tristate "SMSC/Microchip LAN9303 3-ports 10/100 ethernet switch in I2C 
> managed mode"
> - depends on NET_DSA 
> + depends on NET_DSA && I2C
>   select NET_DSA_SMSC_LAN9303
>   select REGMAP_I2C
>   ---help---

Thanks, but already found and fixed by Arnd Bergmann.

jb

-- 
Pengutronix e.K.                             | Juergen Borleis             |
Industrial Linux Solutions                   | http://www.pengutronix.de/  |

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-24 Thread Neftin, Sasha

On 4/23/2017 15:53, Neftin, Sasha wrote:

-Original Message-
From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On 
Behalf Of Benjamin Poirier
Sent: Saturday, April 22, 2017 00:20
To: Kirsher, Jeffrey T 
Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Stefan Priebe 

Subject: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

Some statistics passed to ethtool are garbage because e1000e_get_stats64() 
doesn't write them, for example: tx_heartbeat_errors. This leaks kernel memory 
to userspace and confuses users.

Do like ixgbe and use dev_get_stats() which first zeroes out rtnl_link_stats64.

Reported-by: Stefan Priebe 
Signed-off-by: Benjamin Poirier 
---
  drivers/net/ethernet/intel/e1000e/ethtool.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 7aff68a4a4df..f117b90cdc2f 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -2063,7 +2063,7 @@ static void e1000_get_ethtool_stats(struct net_device 
*netdev,

  	pm_runtime_get_sync(netdev->dev.parent);

-	e1000e_get_stats64(netdev, _stats);

+   dev_get_stats(netdev, _stats);

  	pm_runtime_put_sync(netdev->dev.parent);

--

2.12.2

___
Intel-wired-lan mailing list
intel-wired-...@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Hello,

We would like to not accept this patch. Suggested generic method 
'*dev_get_stats' (net/core/dev.c) calls 'ops->ndo_get_stats64' method 
which eventually calls e1000e_get_stats64 (netdev.c) - so there is same 
functionality. Also, see that 'e1000e_get_stats64' method in netdev.c 
(line 5928) calls 'memset' with 0's before update statistics.  Local 
sanity check in our lab shows 'tx_heartbeat_errors' counter reported as 0.

Thanks,

Sasha

[PATCH net-next] net: dsa: LAN9303: add i2c dependency

2017-04-24 Thread Tobias Regnery

The Kconfig symbol for the I2C mode of the LAN9303 driver selects
REGMAP_I2C. This symbol depends on I2C but the driver doesen't has a
dependency on I2C.

With CONFIG_I2C=n kconfig fails to select REGMAP_I2C and we get the
following warning:

warning: (NET_DSA_SMSC_LAN9303_I2C) selects REGMAP_I2C which has unmet direct 
dependencies (I2C)

This results in a build failure because the regmap-i2c functions aren't
available:

drivers/base/regmap/regmap-i2c.c: In function 'regmap_smbus_byte_reg_read':
drivers/base/regmap/regmap-i2c.c:29:8: error: implicit declaration of function 
'i2c_smbus_read_byte_data' [-Werror=implicit-function-declaration]

drivers/base/regmap/regmap-i2c.c: In function 'regmap_smbus_byte_reg_write':
drivers/base/regmap/regmap-i2c.c:47:9: error: implicit declaration of function 
'i2c_smbus_write_byte_data' [-Werror=implicit-function-declaration]
...

Fix this by adding a kconfig dependency on the I2C subsystem.

Fixes: be4e119f9914 ("net: dsa: LAN9303: add I2C managed mode support")
Signed-off-by: Tobias Regnery 
---
 drivers/net/dsa/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 131a5b1cbfc8..862ee22303c2 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -59,7 +59,7 @@ config NET_DSA_SMSC_LAN9303
 
 config NET_DSA_SMSC_LAN9303_I2C
tristate "SMSC/Microchip LAN9303 3-ports 10/100 ethernet switch in I2C 
managed mode"
-   depends on NET_DSA
+   depends on NET_DSA && I2C
select NET_DSA_SMSC_LAN9303
select REGMAP_I2C
---help---
-- 
2.11.0

Re: [PATCH net] xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY

2017-04-24 Thread Herbert Xu

On Fri, Apr 21, 2017 at 02:05:31PM +0200, Sabrina Dubroca wrote:
> 
> You're right. I had a note about that but it got lost.  The bug
> (ignoring what flavor of flowi is passed) is older than that (from the
> introduction of subpolicies, I suspect, but I would have to dig more
> into the history), but this commit removed the last uses of
> origin/partner.
> 
> Looking into raw_sendmsg(), this code may have been safe before
> 9d6ec938019c ("ipv4: Use flowi4 in public route lookup interfaces."),
> since full flowi were used.
> 
> If we want a fix for kernels that don't have ca116922afa8, we would
> probably need to take the size of the flowi* struct depending on the
> address family passed to xfrm_resolve_and_create_bundle().
> 
> 
> I'm not sure how to proceed here, any advice?

I think the Fixes header should simply refer to the commit that
introduced the bug.  You're instead using it as a hint for how
to backport the patch.  That is beyond the scope of the Fixes
header.

So if the bug started with 9d6ec938019c then just use that in
your Fixes header.  If you want to help the backporter with more
information then you can put ca116922afa8 in the body of the commit
description.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: macvlan: Fix device ref leak when purging bc_queue

2017-04-24 Thread Herbert Xu

On Fri, Apr 21, 2017 at 02:40:50PM +, joe.gha...@dell.com wrote:
> That's not true. macvlan_dellink() unregisters the queue, and 
> macvlan_process_broadcast() will never get called. Please note that I'm not 
> speculating. I have traced enabled on the dev_put and dev_hold, and I'm 
> reporting a real, reproducible issue.

The only thing that can stop macvlan_process_broadcast from getting
called is macvlan_port_destroy.  Nothing else can stop the work
queue, unless of course the work queue mechanism itself is broken.

So if you're sure macvlan_port_destroy is never even called in
your case, then you'll need to start debugging the kernel work
queue mechanism to see why macvlan_process_broadcast is not getting
called.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH iproute2 net 3/8] tc/pedit: Introduce 'add' operation

2017-04-24 Thread Amir Vadai

On Sun, Apr 23, 2017 at 01:44:51PM -0400, Jamal Hadi Salim wrote:
> 
> Thanks for the excellent work.
> 
> On 17-04-23 08:53 AM, Amir Vadai wrote:
> > This command could be useful to increase/decrease fields value.
> > 
> 
> Does this contradict the "retain" feature? Example rule to
> retain the second nibble but set the first to 0xA
> (essentially it X-ORs):
> tc filter add dev lo parent : protocol ip prio 10 \
> u32 match ip dst 127.0.0.1/32 flowid 1:1 \
> action pedit munge offset 0 u8 set 0x0A retain 0xF0

You mean, for example increasing a single nible in an ipv4 address?
If so, then it should work:
# tc filter add dev veth0 parent :  u32 \
match ip dport 23 0x \
action pedit ex \
  munge ip src add 1.0.0.0 retain 0xff00

# tc -s filter show dev veth0 parent :
filter protocol all pref 49151 u32
filter protocol all pref 49151 u32 fh 801: ht divisor 1
filter protocol all pref 49151 u32 fh 801::800 order 2048 key ht 801 bkt 0 
terminal flowid ???  (rule hit 0 success 0)
  match 0017/ at 20 (success 0 )
action order 1:  pedit action pass keys 1
 index 48 ref 1 bind 1 installed 1 sec used 1 sec
 key #0  at ipv4+12: add 0100 mask 00ff
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

> 
> Mostly curious about hardware handling.
As to hardware handling, Or did the implementation so he will answer
better. AFAIK, current implementation doesn't allow partial field
setting/adding, so such a rule can't be offloaded. I think it is only to
make things simple and because there was no use case for that.
To my knowledge, hardware should support such thing if needed.

Amir

> 
> cheers,
> jamal
>

[PATCH net] xfrm: do the garbage collection after flushing policy

2017-04-24 Thread Xin Long

Now xfrm garbage collection can be triggered by 'ip xfrm policy del'.
These is no reason not to do it after flushing policies, especially
considering that 'garbage collection deferred' is only triggered
when it reaches gc_thresh.

It's no good that the policy is gone but the xdst still hold there.
The worse thing is that xdst->route/orig_dst is also hold and can
not be released even if the orig_dst is already expired.

This patch is to do the garbage collection if there is any policy
removed in xfrm_policy_flush.

Signed-off-by: Xin Long 
---
 net/xfrm/xfrm_policy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 236cbbc..dfc77b9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1006,6 +1006,10 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
task_valid)
err = -ESRCH;
 out:
spin_unlock_bh(>xfrm.xfrm_policy_lock);
+
+   if (cnt)
+   xfrm_garbage_collect(net);
+
return err;
 }
 EXPORT_SYMBOL(xfrm_policy_flush);
-- 
2.1.0

[PATCH net] bridge: shutdown bridge device before removing it

2017-04-24 Thread Xin Long

During removing a bridge device, if the bridge is still up, a new mdb entry
still can be added in br_multicast_add_group() after all mdb entries are
removed in br_multicast_dev_del(). Like the path:

  mld_ifc_timer_expire ->
mld_sendpack -> ...
  br_multicast_rcv ->
br_multicast_add_group

The new mp's timer will be set up. If the timer expires after the bridge
is freed, it may cause use-after-free panic in br_multicast_group_expired.
This can happen when ip link remove a bridge or destroy a netns with a
bridge device inside.

As we can see in br_del_bridge, brctl is also supposed to remove a bridge
device after it's shutdown.

This patch is to call dev_close at the beginning of br_dev_delete so that
netif_running check in br_multicast_add_group can avoid this issue. But
to keep consistent with before, it will not remove the IFF_UP check in
br_del_bridge for brctl.

Reported-by: Jianwen Ji 
Signed-off-by: Xin Long 
---
 net/bridge/br_if.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 56a2a72..8175f13 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -305,6 +305,8 @@ void br_dev_delete(struct net_device *dev, struct list_head 
*head)
struct net_bridge *br = netdev_priv(dev);
struct net_bridge_port *p, *n;
 
+   dev_close(br->dev);
+
list_for_each_entry_safe(p, n, >port_list, list) {
del_nbp(p);
}
-- 
2.1.0

Re: [PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled

2017-04-24 Thread Julian Anastasov


Hello,

On Mon, 24 Apr 2017, Paolo Abeni wrote:

> Hi,
> 
> The problem with the patched code is that it tries to resolve ipv6
> addresses that are not created/validated by the kernel.

OK. Simon, please apply to ipvs tree.

Acked-by: Julian Anastasov 

Regards

--
Julian Anastasov

Re: [PATCH 2/2] net: team: fix memory leak in team_nl_send_options_get

2017-04-24 Thread Jiri Pirko

Mon, Apr 24, 2017 at 09:04:55AM CEST, bianpan2...@163.com wrote:
>In function team_nl_send_options_get(), pointer skb keeps the return
>value of function nlmsg_new(). When the call to genlmsg_put() fails, the
>control flow directly returns and does not free skb. This will result in
>a memory leak bug. This patch fixes it.
>
>Fixes: 8ea7fd0d8792 ("team: fix memory leak")

test1:~/net-next$ git log 8ea7fd0d8792
fatal: ambiguous argument '8ea7fd0d8792': unknown revision or path not in the 
working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'

Please look up the commit that introduces the issue. Also no newline
in between "fixes" and "signed off".

Also. The subject should be:
[PATCH net 2/2] team: fix memory leak in team_nl_send_options_get

You can see this right away if you look in the mailing list archive...

>
>Signed-off-by: Pan Bian 
>---
> drivers/net/team/team.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index dd3a2e9..85c0124 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
>u32 portid, u32 seq,
> 
>   hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
> NLM_F_MULTI,
> TEAM_CMD_OPTIONS_GET);
>-  if (!hdr)
>+  if (!hdr) {
>+  nlmsg_free(skb);
>   return -EMSGSIZE;
>+  }
> 
>   if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
>   goto nla_put_failure;
>-- 
>1.9.1
>
>

Re: [PATCH net-next 2/2] cls_flower: add support for matching MPLS fields (v2)

2017-04-24 Thread Jiri Pirko

Sat, Apr 22, 2017 at 10:52:47PM CEST, benjamin.laha...@netronome.com wrote:
>Add support to the tc flower classifier to match based on fields in MPLS
>labels (TTL, Bottom of Stack, TC field, Label).
>
>Signed-off-by: Benjamin LaHaise 
>Signed-off-by: Benjamin LaHaise 
>Reviewed-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH net-next 1/2] flow_dissector: add mpls support (v2)

2017-04-24 Thread Jiri Pirko

Sat, Apr 22, 2017 at 10:52:46PM CEST, benjamin.laha...@netronome.com wrote:
>Add support for parsing MPLS flows to the flow dissector in preparation for
>adding MPLS match support to cls_flower.
>
>Signed-off-by: Benjamin LaHaise 
>Signed-off-by: Benjamin LaHaise 

This looks odd :)


>Reviewed-by: Jakub Kicinski 

Acked-by: Jiri Pirko

[PATCH 2/2] net: team: fix memory leak in team_nl_send_options_get

2017-04-24 Thread Pan Bian

In function team_nl_send_options_get(), pointer skb keeps the return
value of function nlmsg_new(). When the call to genlmsg_put() fails, the
control flow directly returns and does not free skb. This will result in
a memory leak bug. This patch fixes it.

Fixes: 8ea7fd0d8792 ("team: fix memory leak")

Signed-off-by: Pan Bian 
---
 drivers/net/team/team.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index dd3a2e9..85c0124 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_OPTIONS_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
-- 
1.9.1

[PATCH 1/2] net: team: fix memory leak in team_nl_send_port_list_get

2017-04-24 Thread Pan Bian

In function team_nl_send_port_list_get(), pointer skb keeps the return
value of nlmsg_new(). When the call to genlmsg_put() fails, the memory
is not freed. This will result in a memory leak bug. This patch fixes
it.

Fixes: fbd69cda90e7 ("team: fix memory leak")

Signed-off-by: Pan Bian 
---
 drivers/net/team/team.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f8c81f1..dd3a2e9 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2634,8 +2634,10 @@ static int team_nl_send_port_list_get(struct team *team, 
u32 portid, u32 seq,
 
hdr = genlmsg_put(skb, portid, seq, _nl_family, flags | 
NLM_F_MULTI,
  TEAM_CMD_PORT_LIST_GET);
-   if (!hdr)
+   if (!hdr) {
+   nlmsg_free(skb);
return -EMSGSIZE;
+   }
 
if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
goto nla_put_failure;
-- 
1.9.1

Re: [PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled

2017-04-24 Thread Paolo Abeni

Hi,

On Sat, 2017-04-22 at 14:16 +0300, Julian Anastasov wrote:
> On Thu, 20 Apr 2017, Paolo Abeni wrote:
> 
> > When creating a new ipvs service, ipv6 addresses are always accepted
> > if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family
> > is not explicitly checked.
> > 
> > This allows the user-space to configure ipvs services even if the
> > system is booted with ipv6.disable=1. On specific configuration, ipvs
> > can try to call ipv6 routing code at setup time, causing the kernel to
> > oops due to fib6_rules_ops being NULL.
> > 
> > This change addresses the issue adding a check for the ipv6
> > module being enabled while validating ipv6 service operations and
> > adding the same validation for dest operations.
> > 
> > According to git history, this issue is apparently present since
> > the introduction of ipv6 support, and the oops can be triggered
> > since commit 09571c7ae30865ad ("IPVS: Add function to determine
> > if IPv6 address is local")
> > 
> > Fixes: 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address 
> > is local")
> > Signed-off-by: Paolo Abeni 
> 
>   Looks good to me but I see two places that can benefit
> from such check:

I'm sorry for the lag, I was delayed by other notorious issues ;-)
I'm unable to trigger any crash with the patched kernel.

> - in ip_vs_genl_new_daemon() if we do not want to create IPv6 sockets
> for the sync protocol in make_send_sock() and make_receive_sock().
> Not sure if this can lead to crashes.

This one is, AFAICS, safe, because ip_vs_genl_new_daemon() calls
start_sync_thread(), which tries to create a socket of the specified
address family before doing any real action. Such operation will fail
gracefully, and overall we will get an - expected - error to userspace.

> - in ip_vs_proc_sync_conn() if we do not want backup server to accept 
> IPv6 conns because they may be created even when dests are missing.
> We may use retc = 10 there. Not fatal but may eat memory for
> conns that will not be used.

If I read the above correct correctly, even that should be safe, for
the same reason: ipv6 socket creation will fail if ipv6 is disabled.

The problem with the patched code is that it tries to resolve ipv6
addresses that are not created/validated by the kernel.

Cheers,

Paolo

Re: [PATCH v3 07/29] x86: bpf_jit, use ENTRY+ENDPROC

2017-04-24 Thread Jiri Slaby

On 04/21/2017, 09:32 PM, Alexei Starovoitov wrote:
> On Fri, Apr 21, 2017 at 04:12:43PM +0200, Jiri Slaby wrote:
>> Do not use a custom macro FUNC for starts of the global functions, use
>> ENTRY instead.
>>
>> And while at it, annotate also ends of the functions by ENDPROC.
>>
>> Signed-off-by: Jiri Slaby 
>> Cc: "David S. Miller" 
>> Cc: Alexey Kuznetsov 
>> Cc: James Morris 
>> Cc: Hideaki YOSHIFUJI 
>> Cc: Patrick McHardy 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: "H. Peter Anvin" 
>> Cc: x...@kernel.org
>> Cc: netdev@vger.kernel.org
>> ---
>>  arch/x86/net/bpf_jit.S | 32 ++--
>>  1 file changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
>> index f2a7faf4706e..762c29fb8832 100644
>> --- a/arch/x86/net/bpf_jit.S
>> +++ b/arch/x86/net/bpf_jit.S
>> @@ -23,16 +23,12 @@
>>  32 /* space for rbx,r13,r14,r15 */ + \
>>  8 /* space for skb_copy_bits */)
>>  
>> -#define FUNC(name) \
>> -.globl name; \
>> -.type name, @function; \
>> -name:
>> -
>> -FUNC(sk_load_word)
>> +ENTRY(sk_load_word)
>>  test%esi,%esi
>>  js  bpf_slow_path_word_neg
>> +ENDPROC(sk_load_word)
> 
> this doens't look right.
> It will add alignment nops in critical paths of these pseudo functions.
> I'm also not sure whether it will still work afterwards.
> Was it tested?
> I'd prefer if this code kept as-is.

It cannot stay as-is simply because we want to know where the functions
end to inject debuginfo properly. The code above does not warrant for
any exception.

Executing a nop takes a little and having externally-callable functions
aligned can actually help performance (no, I haven't measured nor tested
the code). But sure, the tool is generic, so I can introduce a local
macros to avoid alignments in the functions:

#define BPF_FUNC_START_LOCAL(name) \
SYM_START(name, SYM_V_LOCAL, SYM_A_NONE)
#define BPF_FUNC_START(name) \
SYM_START(name, SYM_V_GLOBAL, SYM_A_NONE)

#define BPF_FUNC_END(name) SYM_FUNC_END(name)

thanks,
-- 
js
suse labs

Re: [ISSUE: sky2 - rx error] Link stops working under heavy traffic load connected to a mv88e6176

2017-04-24 Thread Rafa Corvillo


I resend the mail with the schema fixed. Sorry for the inconvenience.

We are working in an ARMv7 embedded system running kernel 4.9 (LEDE build).
It is an imx6 board with 2 ethernet interfaces. One of them is connected to
a Marvell switch.

The schema of the system is the following:

+---+ eth0
 |   +--+
 |   |  |
 | Embedded system   +--+
 |   |
 |  ARMv7|
 |   | Marvell 88E8057(sky2) +-+
 |   +--+ +--+ +--+ eth1
 |   |  +-+ | |  +--+
 |   +--+  CPU port   +--+ mv88e6176  +--+
 +--+--+-+ | |
emulated|  | | |
GPIO+--+ +--+ +--+ eth2
MDIO  +---+ | |  +--+
  MDIO +--+ +--+
+-+

There is a bridge (br-lan) which includes eth0/eth1/eth2

If I connect the eth1/eth2, the link is up and I can do ping through it. 
But, once
I start sending a heavy traffic load the link fails and the kernel sends 
the

following messages:

[   48.557140] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.564964] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.572110] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.579263] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.586417] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.593573] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   48.600718] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518

[   54.877567] net_ratelimit: 6 callbacks suppressed
[   54.882293] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518
[   61.413552] sky2 :04:00.0 marvell: rx error, status 0x5f20010 
length 1518


I have a modified device-tree of imx6 which includes the mdio and dsa 
nodes. This
device-tree works in a kernel 4.1.6, but I know that these parts of the 
kernel have
a lot of changes. The changes included for mdio and dsa in the 
device-tree are the
following (diff arch/arm/boot/dts/imx6qdl-gw53xx.dtsi 
arch/arm/boot/dts/imx6qdl-gw53xx-mdio.dtsi):


16a17,18
> can0 = 
> ethernet0 = 
21a24
> sky2 = 
24a28,29
> usdhc2 = 
> mdio-gpio0 = 
62a68,125
>   mdio0: mdio {
> compatible = "virtual,mdio-gpio";
> #address-cells = <1>;
> #size-cells = <0>;
> /* MDC = gpio-17, MDIO = gpio-20 */
> gpios =  < 17 1
>  20 0>;
> ethernet-phy@0  {
>   compatible = "marvell,dsa";
> };
>   };
>
>   dsa {
> compatible = "marvell,dsa";
> #address-cells = <2>;
> #size-cells = <0>;
>
> interrupts = <10>;
> dsa,ethernet = <>;
> dsa,mii-bus = <>;
>
> switch@0 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   reg = <0 0>;  /* MDIO address 0, switch 0 in tree */
>
>   port@0 {
> reg = <0>;
> label = "cpu";
>   };
>
>   port@1 {
> reg = <1>;
> label = "eth1";
>   };
>
>   port@2 {
> reg = <2>;
> label = "eth2";
>   };
>
>   port@3 {
> reg = <3>;
> label = "eth3";
>   };
>
>   port@4 {
> reg = <4>;
> label = "eth4";
>   };
> };
>   };
>
361a425,430
>  {
>   pinctrl-names = "default";
>   pinctrl-0 = <_mdio>;
>   status = "okay";
> };
>
363c432
<   imx6qdl-gw53xx {
---
>   imx6qdl-gw53xx-mdio {
448a518,524
>   >;
> };
>
> pinctrl_mdio: mdiogrp {
>   fsl,pins = <
> MX6QDL_PAD_SD1_DAT1__GPIO1_IO17   0x1b0b9
> MX6QDL_PAD_SD1_CLK__GPIO1_IO200x1b0b9

Do you know of any possible reason why this could be happening?

Thanks in advance.

Rafa

< 1 2 3

201 - 268 of 268 matches

Mail list logo