date:20180129

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Michal Hocko

On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> Kirill A. Shutemov  wrote:
[...]
> > I hate what I'm saying, but I guess we need some tunable here.
> > Not sure what exactly.
> 
> Would memcg help?

That really depends. I would have to check whether vmalloc path obeys
__GFP_ACCOUNT (I suspect it does except for page tables allocations but
that shouldn't be a big deal). But then the other potential problem is
the life time of the xt_table_info (or other potentially large) data
structures. Are they bound to any process life time. Because if they are
not then the OOM killer will not help. The OOM panic earlier in this
thread suggests it doesn't because the test case managed to eat all the
available memory and killed all the eligible tasks which didn't help.

So in some sense the memcg would help to stop the excessive allocation,
but it wouldn't resolve it other than kill all tasks in the affected
memcg/container. Whether this is sufficient or not, I dunno. It sounds
quite suboptimal to me. But it is true this would be less tricky then
adding a global knob...
-- 
Michal Hocko
SUSE Labs

Re: [PATCH net-next] ptr_ring: fix integer overflow

2018-01-29 Thread Jason Wang

On 2018年01月30日 01:01, David Miller wrote:

From: Jason Wang 
Date: Thu, 25 Jan 2018 15:31:42 +0800

We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun array 
bounds")
Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Jason Wang 

I'm dropping this because I am to understand that Michael Tsirkin's patch
series covers this issue.

Yes.

Let me know if I still need to apply this.

Thanks.

No need for this.

Thanks

RE: [RFC crypto v3 8/9] chtls: Register the ULP

2018-01-29 Thread Atul Gupta

@Dave Watson, Did you get chance to look at my response?

What I was referring is that passing "tls" ulp type in setsockopt may be 
insufficient to make the decision when multi HW assist Inline TLS solution 
exists.
Some HW may go beyond defining sendmsg/sendpage of the prot and require 
additional info to setup the env? Also, we need to keep vendor specific code 
out of tls_main.c i.e anything other than base/sw_tx prot perhaps go to hw 
driver.

Sabrina echoed similar concern early last week, can we discuss or have thoughts 
how to address this?

Thanks
Atul

-Original Message-
From: Atul Gupta 
Sent: Sunday, January 28, 2018 11:26 AM
To: 'Dave Watson' 
Cc: herb...@gondor.apana.org.au; linux-cry...@vger.kernel.org; 
ganes...@chelsio.co; netdev@vger.kernel.org; da...@davemloft.net; Boris 
Pismenny ; Ilya Lesokhin 
Subject: RE: [RFC crypto v3 8/9] chtls: Register the ULP

-Original Message-
From: Dave Watson [mailto:davejwat...@fb.com] 
Sent: Friday, January 26, 2018 2:39 AM
To: Atul Gupta 
Cc: herb...@gondor.apana.org.au; linux-cry...@vger.kernel.org; 
ganes...@chelsio.co; netdev@vger.kernel.org; da...@davemloft.net; Boris 
Pismenny ; Ilya Lesokhin 
Subject: Re: [RFC crypto v3 8/9] chtls: Register the ULP

<1513769897-26945-1-git-send-email-atul.gu...@chelsio.com>

On 12/20/17 05:08 PM, Atul Gupta wrote:
> +static void __init chtls_init_ulp_ops(void) {
> + chtls_base_prot = tcp_prot;
> + chtls_base_prot.hash= chtls_hash;
> + chtls_base_prot.unhash  = chtls_unhash;
> + chtls_base_prot.close   = chtls_lsk_close;
> +
> + chtls_cpl_prot  = chtls_base_prot;
> + chtls_init_rsk_ops(_cpl_prot, _rsk_ops,
> +_prot, PF_INET);
> + chtls_cpl_prot.close= chtls_close;
> + chtls_cpl_prot.disconnect   = chtls_disconnect;
> + chtls_cpl_prot.destroy  = chtls_destroy_sock;
> + chtls_cpl_prot.shutdown = chtls_shutdown;
> + chtls_cpl_prot.sendmsg  = chtls_sendmsg;
> + chtls_cpl_prot.recvmsg  = chtls_recvmsg;
> + chtls_cpl_prot.sendpage = chtls_sendpage;
> + chtls_cpl_prot.setsockopt   = chtls_setsockopt;
> + chtls_cpl_prot.getsockopt   = chtls_getsockopt;
> +}

Much of this file should go in tls_main.c, reusing as much as possible. For 
example it doesn't look like the get/set sockopts have changed at all for chtls.

Agree, should common code and anything other than TLS_BASE_TX/TLS_SW_TX prot 
should go in vendor specific file/driver. Since, prot require redefinition for 
hardware the code is kept in chtls_main.c

> +
> +static int __init chtls_register(void) {
> + chtls_init_ulp_ops();
> + register_listen_notifier(_notifier);
> + cxgb4_register_uld(CXGB4_ULD_TLS, _uld_info);
> + tcp_register_ulp(_chtls_ulp_ops);
> + return 0;
> +}
> +
> +static void __exit chtls_unregister(void) {
> + unregister_listen_notifier(_notifier);
> + tcp_unregister_ulp(_chtls_ulp_ops);
> + chtls_free_all_uld();
> + cxgb4_unregister_uld(CXGB4_ULD_TLS);
> +}

The idea with ULP is that there is one ULP hook per protocol, not per driver.  

One thought is that apps/lib calling setsockopt pass the required ulp type [tls 
or chtls or xtls], this enables any HW assist to define base_prot as required 
and keep common code [tls_main] independent of underlying HW. 
If we are to have single TLS ULP hook [good from user point] then need a way to 
determine which Inline tls hw is used? System with multiple Inline TLS capable 
hw and differing functionality would require checks in tls_main to exercise 
that specific functionality/callback?

bpf-next is closed too. Was: net-next is CLOSED...

2018-01-29 Thread Alexei Starovoitov

On Sun, Jan 28, 2018 at 09:20:25PM -0500, David Miller wrote:
> 
> With the release of 4.15 final, the net-next tree is closed.

bpf-next is closed too.
There are several bpf patches queued in patch work.
We will review and apply them to bpf tree when net-next gets pulled
into Linus's tree and merged back into net -> bpf.

Please continue finding and fixing bugs. Thanks!

[PATCH bpf] tools/bpf: permit selftests/bpf to be built in a different directory

2018-01-29 Thread Yonghong Song

Fix a couple of issues at tools/testing/selftests/bpf/Makefile so
the following command
   make -C tools/testing/selftests/bpf OUTPUT=/home/yhs/tmp
can put the built results into a different directory.

Also add the built binary test_tcpbpf_user in the .gitignore file.

Fixes: 6882804c916b ("selftests/bpf: add a test for overlapping packet range 
checks")
Fixes: 9d1f15941967 ("bpf: move cgroup_helpers from samples/bpf/ to 
tools/testing/selftesting/bpf/")
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/.gitignore | 1 +
 tools/testing/selftests/bpf/Makefile   | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 1e09d77..cc15af2 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -8,5 +8,6 @@ fixdep
 test_align
 test_dev_cgroup
 test_progs
+test_tcpbpf_user
 test_verifier_log
 feature
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index bf05bc5..566d6ad 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -27,7 +27,7 @@ TEST_PROGS := test_kmod.sh test_xdp_redirect.sh 
test_xdp_meta.sh \
 
 include ../lib.mk
 
-BPFOBJ := $(OUTPUT)/libbpf.a $(OUTPUT)/cgroup_helpers.c
+BPFOBJ := $(OUTPUT)/libbpf.a cgroup_helpers.c
 
 $(TEST_GEN_PROGS): $(BPFOBJ)
 
@@ -58,7 +58,7 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
-%.o: %.c
+$(OUTPUT)/%.o: %.c
$(CLANG) $(CLANG_FLAGS) \
 -O2 -target bpf -emit-llvm -c $< -o - |  \
$(LLC) -march=bpf -mcpu=$(CPU) -filetype=obj -o $@
-- 
2.9.5

Re: [RFC] iproute2: add json and color support

2018-01-29 Thread David Ahern

On 1/29/18 1:29 PM, Stephen Hemminger wrote:
> This patch makes ip route command have json and color output in
> a similar fashion of ip link and address.
> 
> The printing is split into functions and duplicate code
> removed.
> 
> Probably incomplete, and has formatting glitches.
> 
> ---
>  include/utils.h |   4 +
>  ip/iproute.c| 764 
> ++--
>  2 files changed, 469 insertions(+), 299 deletions(-)

good idea. 469 insert, 299 delete makes this really complicated.
Splitting this into smaller patches will be helpful.

Re: [PATCH v3 1/2] net: create skb_gso_validate_mac_len()

2018-01-29 Thread Daniel Axtens

Hi Eric,

>> skb_gso_transport_seglen(skb) is quite expensive (out of line)
>> 
>> It is unfortunate bnx2x seems to support 9600 MTU (
>> ETH_MAX_JUMBO_PACKET_SIZE ), because 100 bytes of headers can be too
>> small in some cases.
>> 
>> Presumably we could avoid calling the function for standard MTU <=
>> 9000

Yes, it is an expensive call. I will send another version where we only
call the expensive path in the jumbo frame case. I'll wait until
tomorrow in case there is any more feedback in the next 24 hours or so.

> Also are we sure about these bnx2x crashes being limited to TSO ?
>
> Maybe it will crash the same after GSO has segmented the packet and we
> provide a big (like 10,000 bytes) packet ? I do not believe upper stack
> will prevent this.

We did test with TSO off and that was sufficient to prevent the crash.

You are right that the upper stack sends down the 10k packet, so I don't
know why it prevents the crash. But we did definitely test it! I don't
have access to the firmware details but I imagine it's an issue with the
offloaded segmentation.

Thanks for your patience with the many versions of this patch set.

Regards,
Daniel

RE: macvlan devices and vlan interaction

2018-01-29 Thread Yuan, Linyu (NSB - CN/Shanghai)

https://www.spinics.net/lists/netdev/msg476083.html

I also have a macvlan device question, but get no answer.

But my original thought is in __netif_receive_skb_core() we should check packet 
destination mac address,
if it match macvlan device, change packet as receive from macvlan device, not 
lower device, then packet go to upper layer.

But I don't know how to process broadcast mac address. Do macvlan device can 
receive broadcast packet ?

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Keller, Jacob E
> Sent: Tuesday, January 30, 2018 7:02 AM
> To: netdev@vger.kernel.org
> Cc: Duyck, Alexander H
> Subject: macvlan devices and vlan interaction
> 
> Hi,
> 
> I'm currently investigating how macvlan devices behave in regards to vlan
> support, and found some interesting behavior that I am not sure how best to
> correct, or what the right path forward is.
> 
> If I create a macvlan device:
> 
> ip link add link ens0 name macvlan0 type macvlan:
> 
> and then add a VLAN to it:
> 
> ip link add link macvlan0 name vlan10 type vlan id 10
> 
> This works to pass VLAN 10 traffic over the macvlan device. This seems like
> expected behavior.
> 
> However, if I then also add vlan 10 to the lowerdev:
> 
> ip link add link ens0 name lowervlan10  type vlan id 10
> 
> Then traffic stops flowing to the VLAN on the macvlan device.
> 
> This happens, as far as I can tell, because of how the VLAN traffic is 
> filtered
> first, and then forwarded to the VLAN device, which doesn't know about how
> the macvlan device exists.
> 
> It seems, essentially, that vlan stacked on top of a macvlan shouldn't work.
> Because the vlan code basically expects each vlan to apply to every MAC
> address, and the macvlan device works by putting its MAC address into the
> unicast address list, there's no way for a device driver to know when or how 
> to
> apply the vlan.
> 
> This gets a bit more confusing when we add in the l2 fwd hardware offload.
> 
> Currently, at least for the Intel network parts, this isn't supported, 
> because of a
> bug in which the device drivers don't apply the VLANs to the macvlan
> accelerated addresses. If we fix this, at least for fm10k, the behavior is 
> slightly
> better, because of how the hardware filtering at the MAC address happens
> first, and we direct the traffic to the proper device regardless of VLAN.
> 
> In addition to this peculiarity of VLANs on both the macvlan and lowerdev, is
> that when a macvlan device adds a VLAN, the lowerdev gets an indication to
> add the vlan via its .ndo_vlan_rx_add_vid(), which doesn't distinguish between
> which addresses the VLAN might apply to. It thus simply, depending on
> hardware design, enables the VLAN for all its unicast and multicast addresses.
> Some hardware could theoretically support MAC+VLAN pairs, where it could
> distinguish that a VLAN should only be added for some subset of addresses.
> Other hardware might not be so lucky..
> 
> Unfortunately, this has the weird consequence that if we have the following
> stack of devices:
> 
> vlan10@macvlan0
> macvlan0@ens0
> ens0
> 
> Then ens0 will receive VLAN10 traffic on every address. So VLAN 10 traffic
> destined to the MAC of the lowerdev will be received, instead of dropped.
> 
> If we add VLAN 10 to the lowerdev so we have both the above stack and also
> 
> lowervlan10@ens0
> ens0 (mac gg:hh:ii:jj:kk)
> 
> then all vlan 10 traffic will be received on the lowerdev VLAN 10, without any
> being forwarded to the VLAN10 attached to the macvlan.
> 
> However, if we add two macvlans, and each add the vlan10, so we have the
> following:
> 
> avlan10@macvlan0
> macvlan0@ens0
> ens0
> 
> bvlan10@macvlan1
> macvlan1@ens0
> ens0
> 
> In this case, it does appear that traffic is sorted out correctly. It seems 
> that
> only if the lowerdev gets the VLAN does it end up breaking. If I remove 
> bvlan10
> from macvlan1, the traffic associated with vlan10 is still received by 
> macvlan1,
> even though in principle it should no longer be.
> 
> What is the correct behavior here? Should this just be "administrators should
> know better"? I don't think that's a great argument, and either way we're 
> still
> essentially leaking VLANs across the macvlan interfaces, which I don't think 
> is
> ideal.
> 
> I see two possible solutions:
> 
> 1) modify macvlan driver so that it is marked as VLAN_CHALLENGED, and thus
> indicate it cannot handle VLAN traffic on top of it.
>   a. In order to get the VLANs associated, administrator could instead add the
> VLAN first, and then add the macvlan on top. This I think is a better
> configuration.
>   b. that doesn't work in the offload case, unless/until we fix the VLAN
> interface to forward the l2_dfwd_add_station() along with a vid.
>   c. this could appear as loss of functionality, since in some cases these 
> VLAN
> on top of macvlan work today (with the interesting caveats listed above).
> 
> 2) modify how

Re: [PATCH v3 1/2] net: create skb_gso_validate_mac_len()

2018-01-29 Thread Eric Dumazet

On Mon, 2018-01-29 at 17:46 -0800, Eric Dumazet wrote:
> On Tue, 2018-01-30 at 12:14 +1100, Daniel Axtens wrote:
> > If you take a GSO skb, and split it into packets, will the MAC
> > length (L2 + L3 + L4 headers + payload) of those packets be small
> > enough to fit within a given length?
> > 
> > Move skb_gso_mac_seglen() to skbuff.h with other related functions
> > like skb_gso_network_seglen() so we can use it, and then create
> > skb_gso_validate_mac_len to do the full calculation.
> > 
> > Signed-off-by: Daniel Axtens 
> > ---
> >  include/linux/skbuff.h | 16 +
> >  net/core/skbuff.c  | 65 
> > +++---
> >  net/sched/sch_tbf.c| 10 
> >  3 files changed, 67 insertions(+), 24 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index b8e0da6c27d6..242d6773c7c2 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff 
> > *skb, int shiftlen);
> >  void skb_scrub_packet(struct sk_buff *skb, bool xnet);
> >  unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
> >  bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu);
> > +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
> >  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t 
> > features);
> >  struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
> >  int skb_ensure_writable(struct sk_buff *skb, int write_len);
> > @@ -4120,6 +4121,21 @@ static inline unsigned int 
> > skb_gso_network_seglen(const struct sk_buff *skb)
> > return hdr_len + skb_gso_transport_seglen(skb);
> >  }
> >  
> > +/**
> > + * skb_gso_mac_seglen - Return length of individual segments of a gso 
> > packet
> > + *
> > + * @skb: GSO skb
> > + *
> > + * skb_gso_mac_seglen is used to determine the real size of the
> > + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4
> > + * headers (TCP/UDP).
> > + */
> > +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
> > +{
> > +   unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
> > +   return hdr_len + skb_gso_transport_seglen(skb);
> > +}
> 
> skb_gso_transport_seglen(skb) is quite expensive (out of line)
> 
> It is unfortunate bnx2x seems to support 9600 MTU (
> ETH_MAX_JUMBO_PACKET_SIZE ), because 100 bytes of headers can be too
> small in some cases.
> 
> Presumably we could avoid calling the function for standard MTU <= 9000

Also are we sure about these bnx2x crashes being limited to TSO ?

Maybe it will crash the same after GSO has segmented the packet and we
provide a big (like 10,000 bytes) packet ? I do not believe upper stack
will prevent this.

Re: [PATCH v3 1/2] net: create skb_gso_validate_mac_len()

2018-01-29 Thread Eric Dumazet

On Tue, 2018-01-30 at 12:14 +1100, Daniel Axtens wrote:
> If you take a GSO skb, and split it into packets, will the MAC
> length (L2 + L3 + L4 headers + payload) of those packets be small
> enough to fit within a given length?
> 
> Move skb_gso_mac_seglen() to skbuff.h with other related functions
> like skb_gso_network_seglen() so we can use it, and then create
> skb_gso_validate_mac_len to do the full calculation.
> 
> Signed-off-by: Daniel Axtens 
> ---
>  include/linux/skbuff.h | 16 +
>  net/core/skbuff.c  | 65 
> +++---
>  net/sched/sch_tbf.c| 10 
>  3 files changed, 67 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b8e0da6c27d6..242d6773c7c2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
> int shiftlen);
>  void skb_scrub_packet(struct sk_buff *skb, bool xnet);
>  unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
>  bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu);
> +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
>  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
>  struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
>  int skb_ensure_writable(struct sk_buff *skb, int write_len);
> @@ -4120,6 +4121,21 @@ static inline unsigned int 
> skb_gso_network_seglen(const struct sk_buff *skb)
>   return hdr_len + skb_gso_transport_seglen(skb);
>  }
>  
> +/**
> + * skb_gso_mac_seglen - Return length of individual segments of a gso packet
> + *
> + * @skb: GSO skb
> + *
> + * skb_gso_mac_seglen is used to determine the real size of the
> + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4
> + * headers (TCP/UDP).
> + */
> +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
> +{
> + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
> + return hdr_len + skb_gso_transport_seglen(skb);
> +}

skb_gso_transport_seglen(skb) is quite expensive (out of line)

It is unfortunate bnx2x seems to support 9600 MTU (
ETH_MAX_JUMBO_PACKET_SIZE ), because 100 bytes of headers can be too
small in some cases.

Presumably we could avoid calling the function for standard MTU <= 9000

[PATCH v3 1/2] net: create skb_gso_validate_mac_len()

2018-01-29 Thread Daniel Axtens

If you take a GSO skb, and split it into packets, will the MAC
length (L2 + L3 + L4 headers + payload) of those packets be small
enough to fit within a given length?

Move skb_gso_mac_seglen() to skbuff.h with other related functions
like skb_gso_network_seglen() so we can use it, and then create
skb_gso_validate_mac_len to do the full calculation.

Signed-off-by: Daniel Axtens 
---
 include/linux/skbuff.h | 16 +
 net/core/skbuff.c  | 65 +++---
 net/sched/sch_tbf.c| 10 
 3 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b8e0da6c27d6..242d6773c7c2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
 unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
 bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu);
+bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
 struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
 int skb_ensure_writable(struct sk_buff *skb, int write_len);
@@ -4120,6 +4121,21 @@ static inline unsigned int skb_gso_network_seglen(const 
struct sk_buff *skb)
return hdr_len + skb_gso_transport_seglen(skb);
 }
 
+/**
+ * skb_gso_mac_seglen - Return length of individual segments of a gso packet
+ *
+ * @skb: GSO skb
+ *
+ * skb_gso_mac_seglen is used to determine the real size of the
+ * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4
+ * headers (TCP/UDP).
+ */
+static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
+{
+   unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
+   return hdr_len + skb_gso_transport_seglen(skb);
+}
+
 /* Local Checksum Offload.
  * Compute outer checksum based on the assumption that the
  * inner checksum will be offloaded later.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01e8285aea73..55d84ab7d093 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4914,36 +4914,73 @@ unsigned int skb_gso_transport_seglen(const struct 
sk_buff *skb)
 EXPORT_SYMBOL_GPL(skb_gso_transport_seglen);
 
 /**
- * skb_gso_validate_mtu - Return in case such skb fits a given MTU
+ * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS
  *
- * @skb: GSO skb
- * @mtu: MTU to validate against
+ * There are a couple of instances where we have a GSO skb, and we
+ * want to determine what size it would be after it is segmented.
  *
- * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU
- * once split.
+ * We might want to check:
+ * -L3+L4+payload size (e.g. IP forwarding)
+ * - L2+L3+L4+payload size (e.g. sanity check before passing to driver)
+ *
+ * This is a helper to do that correctly considering GSO_BY_FRAGS.
+ *
+ * @seg_len: The segmented length (from skb_gso_*_seglen). In the
+ *   GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS].
+ *
+ * @max_len: The maximum permissible length.
+ *
+ * Returns true if the segmented length <= max length.
  */
-bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu)
-{
+static inline bool skb_gso_size_check(const struct sk_buff *skb,
+ unsigned int seg_len,
+ unsigned int max_len) {
const struct skb_shared_info *shinfo = skb_shinfo(skb);
const struct sk_buff *iter;
-   unsigned int hlen;
-
-   hlen = skb_gso_network_seglen(skb);
 
if (shinfo->gso_size != GSO_BY_FRAGS)
-   return hlen <= mtu;
+   return seg_len <= max_len;
 
/* Undo this so we can re-use header sizes */
-   hlen -= GSO_BY_FRAGS;
+   seg_len -= GSO_BY_FRAGS;
 
skb_walk_frags(skb, iter) {
-   if (hlen + skb_headlen(iter) > mtu)
+   if (seg_len + skb_headlen(iter) > max_len)
return false;
}
 
return true;
 }
-EXPORT_SYMBOL_GPL(skb_gso_validate_mtu);
+
+/**
+ * skb_gso_validate_mtu - Return in case such skb fits a given MTU
+ *
+ * @skb: GSO skb
+ * @mtu: MTU to validate against
+ *
+ * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU
+ * once split.
+ */
+bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu)
+{
+   return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu);
+}
+EXPORT_SYMBOL_GPL(skb_gso_validate_network_len);
+
+/**
+ * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length?
+ *
+ * @skb: GSO skb
+ * @len: length to validate against
+ *
+ * skb_gso_validate_mac_len validates if a given skb will fit a wanted
+ * length once split, including L2, L3 and L4 headers and the payload.
+ */
+bool

[PATCH v3 0/2] bnx2x: disable GSO on too-large packets

2018-01-29 Thread Daniel Axtens

We observed a case where a packet received on an ibmveth device had a
GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
device, where it caused a firmware assert. This is described in detail
at [0].

Ultimately we want a fix in the core, but that is very tricky to
backport. So for now, just stop the bnx2x driver from crashing.

When net-next re-opens I will send the fix to the core and a revert
for this.

Marcelo: I have left renaming skb_gso_validate_mtu() for the
next series.

Thanks,
Daniel
   
[0]: https://patchwork.ozlabs.org/patch/859410/

Cc: manish.cho...@cavium.com
Cc: Jason Wang 
Cc: Pravin Shelar 
Cc: Marcelo Ricardo Leitner 

Daniel Axtens (2):
  net: create skb_gso_validate_mac_len()
  bnx2x: disable GSO where gso_size is too big for hardware

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  9 
 include/linux/skbuff.h   | 16 ++
 net/core/skbuff.c| 65 +++-
 net/sched/sch_tbf.c  | 10 
 4 files changed, 76 insertions(+), 24 deletions(-)

-- 
2.14.1

[PATCH v3 2/2] bnx2x: disable GSO where gso_size is too big for hardware

2018-01-29 Thread Daniel Axtens

If a bnx2x card is passed a GSO packet with a gso_size larger than
~9700 bytes, it will cause a firmware error that will bring the card
down:

bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert!
bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2
bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e43e47 0x00463e01 0x00010052
bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 
7_13_1
... (dump of values continues) ...

Detect when the mac length of a GSO packet is greater than the maximum
packet size (9700 bytes) and disable GSO.

Signed-off-by: Daniel Axtens 
---

v3: use skb_gso_validate_mac_len to get the actual size, to allow
jumbo frames to work.

I don't have local access to a bnx2x card and sending this off to the
user who does would add about a month of round-trip time, so this
particular spin is build-tested only. (Earlier spins were tested on
real hardware.)

---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 7b08323e3f3d..fbb08e360625 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12934,6 +12934,15 @@ static netdev_features_t bnx2x_features_check(struct 
sk_buff *skb,
  struct net_device *dev,
  netdev_features_t features)
 {
+   /*
+* A skb with gso_size + header length > 9700 will cause a
+* firmware panic. Drop GSO support.
+*
+* Eventually the upper layer should not pass these packets down.
+*/
+   if (unlikely(skb_is_gso(skb) && !skb_gso_validate_mac_len(skb, 9700)))
+   features &= ~NETIF_F_GSO_MASK;
+
features = vlan_features_check(skb, features);
return vxlan_features_check(skb, features);
 }
-- 
2.14.1

after adding > 200vlans to mlx nic no traffic

2018-01-29 Thread Paweł Staszewski

Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from 
net-next davem tree




after:

ip link add link enp175s0f1 name vlan1538 type vlan id 1538

ip link set up dev vlan1538


traffic on vlan is working


But after

VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 
1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 
1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 
1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 150
0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 
1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 
1529 1530 1531 1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 
1539 1540 1542 1541 1543 1544 1801 1546 1547 1548 1
549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 
3130 1803 3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 
1561 1562 1563 1564 1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 
1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586
 1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 
2001 250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 
1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 
1628 1629 1630 1631 1632 1634 1635 1636 1640 1641 164
2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 
1657 1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 
1670 1671 1672 1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 
1686 1687 1688 1689 1690 1691 1692 1693 1694 1696 1
697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 
1861 1862 1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 
1876 1877 1878 1879 1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 
2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2541

 2542 2543 2544 2545 2546 2547 2548 2549 2550 2290"
for i in $VID
do
    ip link add link enp175s0f1 name vlan$i type vlan id $i
done


And setting vlan 1538 up - there is no received traffic on this vlan.



So searching for broken things (last time same problem was with ixgbe)

ethtool -K enp175s0f1 rx-vlan-filter off


And all vlans attached to this device start working

Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-29 Thread Cong Wang

On Mon, Jan 29, 2018 at 9:43 AM, David Miller  wrote:
>
> Please follow up with John about making the queue allocation use
> a prepare + commit phase.

Will do it once net-next is re-open.


>
> And as for the TX queue state handling on change, I think it's
> fine to purge the whole queue.  That is definitely better than
> allowing the queue length to be visibly over the tx_queue_len
> setting.
>

OK. Thanks.

Re: [PATCH net-next v2 00/12] net: sched: propagate extack to cls offloads on destroy and only with skip_sw

2018-01-29 Thread Jakub Kicinski

On Sun, 28 Jan 2018 20:39:02 -0200, Marcelo Ricardo Leitner wrote:
> On Thu, Jan 25, 2018 at 02:57:17PM -0800, Jakub Kicinski wrote:
> > On Thu, 25 Jan 2018 13:11:57 -0200, Marcelo Ricardo Leitner wrote:  
> > > On Wed, Jan 24, 2018 at 12:54:12PM -0800, Jakub Kicinski wrote:  
> > > > Hi!
> > > > 
> > > > This series some of Jiri's comments and the fact that today drivers
> > > > may produce extack even if there is no skip_sw flag (meaning the
> > > > driver failure is not really a problem), and warning messages will
> > > > only confuse the users.
> > > 
> > > It's a fair point, but I think this is not the best solution. How will
> > > the user then know why it failed to install in hw? Will have to
> > > install a new rule, just with skip_sw, and hope that it fails with the
> > > same reason?
> > >
> > > Maybe it's better to let tc/ovs/etc only exhibit this information
> > > under a certain log/debug level.  
> > 
> > What you say does make sense in case of classifiers which are basically
> > HW offload vehicles.  But for cls_bpf, which people are actually using
> > heavily as a software solution, I don't want any warning to be produced
> > just because someone happened to run the command on a Netronome
> > card :(  Say someone swaps an old NIC for a NFP, and runs their old
> > cls_bpf scripts and suddenly there are warnings they don't care about
> > and have no way of silencing.  
> 
> (Sorry for the delay on replying, btw. I'm still traveling.)
> 
> Makes sense. I agree that at least it shouldn't be displayed in a way
> that may lead the user to think it was a big/fatal error.
> 
> > I do think skip_sw will fail for the same reason, unless someone adds
> > extacks for IO or memory allocation problems or other transient things.  
> 
> I don't really follow this one. Fail you mean, fail to report the
> actual reason? If so, ok, but that's something easily fixable I think,
> especially because with skip_sw, if such an error happens, it's fatal
> for the operation so the error reporting is consistent.

I was referring to your question: "Will have to install a new rule,
just with skip_sw, and hope that it fails with the same reason?"
So yes, currently the only way to get the extack would be to retry the
command with skip_sw.

> > add a "verbose offload" flag to the API or filter the bad extacks at
> > the user space level.  Although, again, my preference would be not to
> > filter at the user space level, because user space can't differentiate
> > between a probably-doesn't-matter-but-HW-offload-failed warning or legit
> > something-is-not-right-in-the-software-but-command-succeeded warning :S  
> 
> But userspace is the original requester. It should know what the rule
> is intended for and how to act upon it. For ovs, for example, it could
> just log a message and move on, while tc could report "hey, ok, but
> please note that the rule couldn't be offloaded".
> 
> > So if there is a major use for non-forced offload failure warnings I
> > would lean towards a new flag.  
> 
> I'm thinking about this, still undecided. In the end maybe a counter
> somewhere could be enough and such reporting is not needed. Thinking..

Hm, special counters for failure reasons or just for all failures to
offload?  FWIW user space can dump the filters and if the filter is
offloaded there will be an in_hw flag Or added a few releases back.  
Let us know what your thoughts are :)

Re: [PATCH net] ibmvnic: Wait for device response when changing MAC

2018-01-29 Thread David Miller

From: Thomas Falcon 
Date: Mon, 29 Jan 2018 13:45:05 -0600

> Wait for a response from the VNIC server before exiting after setting
> the MAC address. The resolves an issue with bonding a VNIC client in
> ALB or TLB modes. The bonding driver was changing the MAC address more
> rapidly than the device could respond, causing the following errors.
> 
> "bond0: the hw address of slave eth2 is in use by the bond;
> couldn't find a slave with a free hw address to give it
> (this should not have happened)"
> 
> If the function waits until the change is finalized, these errors are
> avoided.
> 
> Signed-off-by: Thomas Falcon 

Applied, thanks Thomas.

macvlan devices and vlan interaction

2018-01-29 Thread Keller, Jacob E

Hi,

I'm currently investigating how macvlan devices behave in regards to vlan 
support, and found some interesting behavior that I am not sure how best to 
correct, or what the right path forward is.

If I create a macvlan device:

ip link add link ens0 name macvlan0 type macvlan:

and then add a VLAN to it:

ip link add link macvlan0 name vlan10 type vlan id 10

This works to pass VLAN 10 traffic over the macvlan device. This seems like 
expected behavior.

However, if I then also add vlan 10 to the lowerdev:

ip link add link ens0 name lowervlan10  type vlan id 10

Then traffic stops flowing to the VLAN on the macvlan device.

This happens, as far as I can tell, because of how the VLAN traffic is filtered 
first, and then forwarded to the VLAN device, which doesn't know about how the 
macvlan device exists.

It seems, essentially, that vlan stacked on top of a macvlan shouldn't work. 
Because the vlan code basically expects each vlan to apply to every MAC 
address, and the macvlan device works by putting its MAC address into the 
unicast address list, there's no way for a device driver to know when or how to 
apply the vlan.

This gets a bit more confusing when we add in the l2 fwd hardware offload.

Currently, at least for the Intel network parts, this isn't supported, because 
of a bug in which the device drivers don't apply the VLANs to the macvlan 
accelerated addresses. If we fix this, at least for fm10k, the behavior is 
slightly better, because of how the hardware filtering at the MAC address 
happens first, and we direct the traffic to the proper device regardless of 
VLAN.

In addition to this peculiarity of VLANs on both the macvlan and lowerdev, is 
that when a macvlan device adds a VLAN, the lowerdev gets an indication to add 
the vlan via its .ndo_vlan_rx_add_vid(), which doesn't distinguish between 
which addresses the VLAN might apply to. It thus simply, depending on hardware 
design, enables the VLAN for all its unicast and multicast addresses. Some 
hardware could theoretically support MAC+VLAN pairs, where it could distinguish 
that a VLAN should only be added for some subset of addresses. Other hardware 
might not be so lucky..

Unfortunately, this has the weird consequence that if we have the following 
stack of devices:

vlan10@macvlan0
macvlan0@ens0
ens0

Then ens0 will receive VLAN10 traffic on every address. So VLAN 10 traffic 
destined to the MAC of the lowerdev will be received, instead of dropped.

If we add VLAN 10 to the lowerdev so we have both the above stack and also

lowervlan10@ens0
ens0 (mac gg:hh:ii:jj:kk)

then all vlan 10 traffic will be received on the lowerdev VLAN 10, without any 
being forwarded to the VLAN10 attached to the macvlan.

However, if we add two macvlans, and each add the vlan10, so we have the 
following:

avlan10@macvlan0
macvlan0@ens0
ens0

bvlan10@macvlan1
macvlan1@ens0
ens0

In this case, it does appear that traffic is sorted out correctly. It seems 
that only if the lowerdev gets the VLAN does it end up breaking. If I remove 
bvlan10 from macvlan1, the traffic associated with vlan10 is still received by 
macvlan1, even though in principle it should no longer be.

What is the correct behavior here? Should this just be "administrators should 
know better"? I don't think that's a great argument, and either way we're still 
essentially leaking VLANs across the macvlan interfaces, which I don't think is 
ideal.

I see two possible solutions:

1) modify macvlan driver so that it is marked as VLAN_CHALLENGED, and thus 
indicate it cannot handle VLAN traffic on top of it.
  a. In order to get the VLANs associated, administrator could instead add the 
VLAN first, and then add the macvlan on top. This I think is a better 
configuration.
  b. that doesn't work in the offload case, unless/until we fix the VLAN 
interface to forward the l2_dfwd_add_station() along with a vid.
  c. this could appear as loss of functionality, since in some cases these VLAN 
on top of macvlan work today (with the interesting caveats listed above).

2) modify how VLANs interact with MAC addresses, so that the lowerdev can 
explicitly be aware of which VLANs are tied to which address groups, in order 
to allow for the explicit configuration of which MAC+VLAN pairs are actually 
allowed.
  a. this is a much more invasive change to driver interface, and more 
difficult to get right
  b. possibly other configurations of stacked devices might have a similar 
problem, so we could solve more here? Or create more problems.. I'm not really 
certain.


I think the correct solution is (1) but I wasn't sure what others thought, and 
whether anyone else has encountered the problems I mention and outline above. I 
cc'd Alex who I discussed with offline when I first heard of and began 
investigating this, in case he has anything further to add.

Regards,
Jake

Re: [PATCH] qlcnic: fix deadlock bug

2018-01-29 Thread David Miller

From: Junxiao Bi 
Date: Mon, 29 Jan 2018 17:53:42 +0800

> The following soft lockup was caught. This is a deadlock caused by
> recusive locking.
> 
> Process kworker/u40:1:28016 was holding spin lock "mbx->queue_lock" in
> qlcnic_83xx_mailbox_worker(), while a softirq came in and ask the same spin
> lock in qlcnic_83xx_enqueue_mbx_cmd(). This lock should be hold by disable
> bh..
 ...
> Signed-off-by: Junxiao Bi 

Looks good, applied and queued up for -stable.

Re: [PATCH] tcp: release sk_frag.page in tcp_disconnect

2018-01-29 Thread David Miller

From: Li RongQing 
Date: Fri, 26 Jan 2018 16:40:41 +0800

> socket can be disconnected and gets transformed back to a listening
> socket, if sk_frag.page is not released, which will be cloned into
> a new socket by sk_clone_lock, but the reference count of this page
> is increased, lead to a use after free or double free issue
> 
> Signed-off-by: Li RongQing 
> Cc: Eric Dumazet 

Applied and queued up for -stable, thank you.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Florian Westphal

Kirill A. Shutemov  wrote:
> On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote:
> > Kirill A. Shutemov  wrote:
> > > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: 
> > > > > back
> > > > > off when the current task is killed") but then became unkillable by 
> > > > > commit
> > > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > > Please consider adding some limit from networking side.
> > > > 
> > > > I don't know what "some limit" would be.  I would prefer if there was
> > > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> > > 
> > > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > > arbitrary large buffer in kernel.
> > 
> > Isn't that what we do everywhere in network stack?
> > 
> > I think we should try to allocate whatever amount of memory is needed
> > for the given xtables ruleset, given that is what admin requested us to do.
> 
> Is it correct that "admin" in this case is root in random container?

Yes.

> I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET?

Yes.

> This can be fun.

Do we prevent "admin in random container" to insert 2m ipv6 routes
(alternatively: ipsec tunnels, interfaces etc etc)?

> > I also would not know what limit is sane -- I've seen setups with as much
> > as 100k iptables rules, and that was 5 years ago.
> > 
> > And even if we add a "Xk rules" limit, it might be too much for
> > low-memory systems, or not enough for whatever other use case there
> > might be.
> 
> I hate what I'm saying, but I guess we need some tunable here.
> Not sure what exactly.

Would memcg help?

(I don't buy the "run untrusted binaries on linux is safe" thing, so
 I would not know).

Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config

2018-01-29 Thread Greg KH

On Mon, Jan 29, 2018 at 12:40:47AM +0100, Daniel Borkmann wrote:
> On 01/28/2018 03:45 PM, Greg KH wrote:
> > On Wed, Jan 24, 2018 at 11:10:50AM +0100, Daniel Borkmann wrote:
> >> On 01/24/2018 11:07 AM, David Woodhouse wrote:
> >>> On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote:
>  On 01/09/2018 07:04 PM, Alexei Starovoitov wrote:
> >
> > The BPF interpreter has been used as part of the spectre 2 attack 
> > CVE-2017-5715.
> >
> > A quote from goolge project zero blog:
> > "At this point, it would normally be necessary to locate gadgets in
> > the host kernel code that can be used to actually leak data by reading
> > from an attacker-controlled location, shifting and masking the result
> > appropriately and then using the result of that as offset to an
> > attacker-controlled address for a load. But piecing gadgets together
> > and figuring out which ones work in a speculation context seems 
> > annoying.
> > So instead, we decided to use the eBPF interpreter, which is built into
> > the host kernel - while there is no legitimate way to invoke it from 
> > inside
> > a VM, the presence of the code in the host kernel's text section is 
> > sufficient
> > to make it usable for the attack, just like with ordinary ROP gadgets."
> >
> > To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
> > option that removes interpreter from the kernel in favor of JIT-only 
> > mode.
> > So far eBPF JIT is supported by:
> > x64, arm64, arm32, sparc64, s390, powerpc64, mips64
> >
> > The start of JITed program is randomized and code page is marked as 
> > read-only.
> > In addition "constant blinding" can be turned on with 
> > net.core.bpf_jit_harden
> >
> > v2->v3:
> > - move __bpf_prog_ret0 under ifdef (Daniel)
> >
> > v1->v2:
> > - fix init order, test_bpf and cBPF (Daniel's feedback)
> > - fix offloaded bpf (Jakub's feedback)
> > - add 'return 0' dummy in case something can invoke prog->bpf_func
> > - retarget bpf tree. For bpf-next the patch would need one extra hunk.
> >   It will be sent when the trees are merged back to net-next
> >
> > Considered doing:
> >   int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
> > but it seems better to land the patch as-is and in bpf-next remove
> > bpf_jit_enable global variable from all JITs, consolidate in one place
> > and remove this jit_init() function.
> >
> > Signed-off-by: Alexei Starovoitov 
> 
>  Applied to bpf tree, thanks Alexei!
> >>>
> >>> For stable too?
> >>
> >> Yes, this will go into stable as well; batch of backports will come 
> >> Thurs/Fri.
> > 
> > Any word on these?  Worse case, a simple list of git commit ids to
> > backport is all I need.
> 
> Sorry for the delay! There are various conflicts all over the place, so I had
> to backport manually. I just flushed out tested 4.14 batch, I'll see to get 
> 4.9
> out hopefully tonight as well, and the rest for 4.4 on Mon.

Not a problem at all, wanted to make sure I didn't miss them having be
posted somewhere I missed :)

If you need/want help for the 4.4 stuff, just let me know, and I'll be
glad to work on it.

thanks,

greg k-h

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-29 Thread Alexander Duyck

On Mon, Jan 29, 2018 at 10:24 AM, Michael S. Tsirkin  wrote:
> On Sun, Jan 28, 2018 at 08:26:53PM -0800, Alexander Duyck wrote:
>> >> > For live migration with advanced usecases that Siwei is suggesting, i
>> >> > think we need a new driver with a new device type that can track the
>> >> > VF specific feature settings even when the VF driver is unloaded.
>> >
>> > I see no added value of the 3 netdev model, there is no need for a bond
>> > device.
>>
>> I agree a full-blown bond isn't what is needed. However, just forking
>> traffic out from virtio to a VF doesn't really solve things either.
>>
>> One of the issues as I see it is the fact that the qdisc model with
>> the merged device gets to be pretty ugly. There is the fact that every
>> packet that goes to the VF has to pass through the qdisc code twice.
>> There is the dual nature of the 2 netdev solution that also introduces
>> the potential for head-of-line blocking since the virtio could put
>> back pressure on the upper qdisc layer which could stall the VF
>> traffic when switching over. I hope we could avoid issues like that by
>> maintaining qdiscs per device queue instead of on an upper device that
>> is half software interface and half not. Ideally the virtio-bond
>> device could operate without a qdisc and without needing any
>> additional locks so there shouldn't be head of line blocking occurring
>> between the two interfaces and overhead could be kept minimal.
>>
>> Also in the case of virtio there is support for in-driver XDP. As
>> Sridhar stated, when using the 2 netdev model "we cannot support XDP
>> in this model and it needs to be disabled". That sounds like a step
>> backwards instead of forwards. I would much rather leave the XDP
>> enabled at the lower dev level, and then if we want we can use the
>> generic XDP at the virtio-bond level to capture traffic on both
>> interfaces at the same time.
>
> I agree dropping XDP makes everything very iffy.
>
>> In the case of netvsc you have control of both sides of a given link
>> so you can match up the RSS tables, queue configuration and everything
>> is somewhat symmetric since you are running the PF and all the HyperV
>> subchannels. Most of the complexity is pushed down into the host and
>> your subchannel management is synchronized there if I am not mistaken.
>> We don't have this in the case of this virtio-bond setup. Instead a
>> single bit is set indicating "backup" without indicating what that
>> means to topology other than the fact that this virtio interface is
>> the backup for some other interface. We are essentially blind other
>> than having the link status for the VF and virtio and knowing that the
>> virtio is intended to be the backup.
>
> Would you be interested in posting at least a proof of concept
> patch for this approach? It's OK if there are some TODO stubs.
> It would be much easier to compare code to code rather than
> a high level description to code.

That is the general idea. I was hoping to go the bonding route last
week but I got too snarled up trying to untangle the features we
didn't need. I have some code I am working on but don't have an ETA
for when it will be done.

At this point I am hoping we can build something based on Sridhar's
original patch that can addresses the items I brought up and shifts to
more of a 3 netdev model. If I am not mistaken we might be able to do
it as a separate driver that has a pair of calls that allow for
adding/removing a virt-bond that is provided with a MAC address and a
device pointer so that we can do the bits necessary to get ourselves
swapped with the original virtio device and identify the virtio as the
backup channel.

Thanks.

- Alex

[RFC] iproute2: add json and color support

2018-01-29 Thread Stephen Hemminger

This patch makes ip route command have json and color output in
a similar fashion of ip link and address.

The printing is split into functions and duplicate code
removed.

Probably incomplete, and has formatting glitches.

---
 include/utils.h |   4 +
 ip/iproute.c| 764 ++--
 2 files changed, 469 insertions(+), 299 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 0394268e1276..e35ea32c1d3b 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -155,6 +155,10 @@ int af_byte_len(int af);
 
 const char *format_host_r(int af, int len, const void *addr,
   char *buf, int buflen);
+#define format_host_rta_r(af, rta, buf, buflen)\
+   format_host_r(af, RTA_PAYLOAD(rta), RTA_DATA(rta), \
+ buf, buflen)
+
 const char *format_host(int af, int lne, const void *addr);
 #define format_host_rta(af, rta) \
format_host(af, RTA_PAYLOAD(rta), RTA_DATA(rta))
diff --git a/ip/iproute.c b/ip/iproute.c
index bf886fda9d76..b07d1ac7b7c9 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -340,12 +340,346 @@ static void print_rtax_features(FILE *fp, unsigned int 
features)
unsigned int of = features;
 
if (features & RTAX_FEATURE_ECN) {
-   fprintf(fp, "ecn ");
+   print_null(PRINT_ANY, "ecn", "ecn ", NULL);
features &= ~RTAX_FEATURE_ECN;
}
 
if (features)
-   fprintf(fp, "0x%x ", of);
+   print_0xhex(PRINT_ANY,
+   "features", "0x%x ", of);
+}
+
+static void print_rt_flags(FILE *fp, unsigned int flags)
+{
+   open_json_array(PRINT_JSON,
+   is_json_context() ?  "flags" : "");
+
+   if (flags & RTNH_F_DEAD)
+   print_null(PRINT_ANY, "dead", "dead ", NULL);
+   if (flags & RTNH_F_ONLINK)
+   print_null(PRINT_ANY, "onlink", "onlink ", NULL);
+   if (flags & RTNH_F_PERVASIVE)
+   print_null(PRINT_ANY, "pervasive", "pervasive ", NULL);
+   if (flags & RTNH_F_OFFLOAD)
+   print_null(PRINT_ANY, "offload", "offload ", NULL);
+   if (flags & RTM_F_NOTIFY)
+   print_null(PRINT_ANY, "notify", "notify ", NULL);
+   if (flags & RTNH_F_LINKDOWN)
+   print_null(PRINT_ANY, "linkdown", "linkdown ", NULL);
+   if (flags & RTNH_F_UNRESOLVED)
+   print_null(PRINT_ANY, "unresolved", "unresolved ", NULL);
+
+   close_json_array(PRINT_JSON, NULL);
+}
+
+static void print_rt_metrics(FILE *fp, struct rtattr *rtm)
+{
+   struct rtattr *mxrta[RTAX_MAX+1];
+   unsigned int mxlock = 0;
+   int i;
+
+   open_json_array(PRINT_JSON, "metrics");
+
+   parse_rtattr(mxrta, RTAX_MAX, RTA_DATA(rtm),
+RTA_PAYLOAD(rtm));
+
+   if (mxrta[RTAX_LOCK])
+   mxlock = rta_getattr_u32(mxrta[RTAX_LOCK]);
+
+   for (i = 2; i <= RTAX_MAX; i++) {
+   __u32 val = 0U;
+
+   if (mxrta[i] == NULL && !(mxlock & (1 << i)))
+   continue;
+
+   if (mxrta[i] != NULL && i != RTAX_CC_ALGO)
+   val = rta_getattr_u32(mxrta[i]);
+
+   if (i == RTAX_HOPLIMIT && (int)val == -1)
+   continue;
+
+   if (!is_json_context()) {
+   if (i < sizeof(mx_names)/sizeof(char *) && mx_names[i])
+   fprintf(fp, "%s ", mx_names[i]);
+   else
+   fprintf(fp, "metric %d ", i);
+
+   if (mxlock & (1<= 1000)
+   fprintf(fp, "%gs ", val/1e3);
+   else
+   fprintf(fp, "%ums ", val);
+   }
+   break;
+   case RTAX_CC_ALGO:
+   print_string(PRINT_ANY, "congestion",
+"%s ", rta_getattr_str(mxrta[i]));
+   break;
+   }
+   }
+
+

[bpf-next PATCH v3 3/3] bpf: sockmap, fix leaking maps with attached but not detached progs

2018-01-29 Thread John Fastabend

When a program is attached to a map we increment the program refcnt
to ensure that the program is not removed while it is potentially
being referenced from sockmap side. However, if this same program
also references the map (this is a reasonably common pattern in
my programs) then the verifier will also increment the maps refcnt
from the verifier. This is to ensure the map doesn't get garbage
collected while the program has a reference to it.

So we are left in a state where the map holds the refcnt on the
program stopping it from being removed and releasing the map refcnt.
And vice versa the program holds a refcnt on the map stopping it
from releasing the refcnt on the prog.

All this is fine as long as users detach the program while the
map fd is still around. But, if the user omits this detach command
we are left with a dangling map we can no longer release.

To resolve this when the map fd is released decrement the program
references and remove any reference from the map to the program.
This fixes the issue with possibly dangling map and creates a
user side API constraint. That is, the map fd must be held open
for programs to be attached to a map.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 6c6c274..c8b25f1 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -608,11 +608,6 @@ static void sock_map_free(struct bpf_map *map)
}
rcu_read_unlock();
 
-   if (stab->bpf_verdict)
-   bpf_prog_put(stab->bpf_verdict);
-   if (stab->bpf_parse)
-   bpf_prog_put(stab->bpf_parse);
-
sock_map_remove_complete(stab);
 }
 
@@ -888,6 +883,19 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
 }
 
+static void sock_map_release(struct bpf_map *map, struct file *map_file)
+{
+   struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+   struct bpf_prog *orig;
+
+   orig = xchg(>bpf_parse, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+   orig = xchg(>bpf_verdict, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+}
+
 const struct bpf_map_ops sock_map_ops = {
.map_alloc = sock_map_alloc,
.map_free = sock_map_free,
@@ -895,6 +903,7 @@ static int sock_map_update_elem(struct bpf_map *map,
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
+   .map_release = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,

[bpf-next PATCH v3 2/3] bpf: sockmap, add sock close() hook to remove socks

2018-01-29 Thread John Fastabend

The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.

The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.

To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole 
Signed-off-by: John Fastabend 
---
 include/net/tcp.h|2 +
 kernel/bpf/sockmap.c |  156 +-
 2 files changed, 91 insertions(+), 67 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ba10ca7..e082101 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1985,6 +1985,7 @@ static inline void tcp_listendrop(const struct sock *sk)
 
 enum {
TCP_ULP_TLS,
+   TCP_ULP_BPF,
 };
 
 struct tcp_ulp_ops {
@@ -2003,6 +2004,7 @@ struct tcp_ulp_ops {
 int tcp_register_ulp(struct tcp_ulp_ops *type);
 void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
+int tcp_set_ulp_id(struct sock *sk, const int ulp);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 0314d17..6c6c274 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -86,9 +86,10 @@ struct smap_psock {
struct work_struct tx_work;
struct work_struct gc_work;
 
+   struct proto *sk_proto;
+   void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
-   void (*save_state_change)(struct sock *sk);
 };
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -96,12 +97,90 @@ static inline struct smap_psock *smap_psock_sk(const struct 
sock *sk)
return rcu_dereference_sk_user_data(sk);
 }
 
+struct proto tcp_bpf_proto;
+static int bpf_tcp_init(struct sock *sk)
+{
+   struct smap_psock *psock;
+
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock))
+   return -EINVAL;
+
+   if (unlikely(psock->sk_proto))
+   return -EBUSY;
+
+   psock->save_close = sk->sk_prot->close;
+   psock->sk_proto = sk->sk_prot;
+   sk->sk_prot = _bpf_proto;
+   return 0;
+}
+
+static void bpf_tcp_release(struct sock *sk)
+{
+   struct smap_psock *psock = smap_psock_sk(sk);
+
+   if (likely(psock)) {
+   sk->sk_prot = psock->sk_proto;
+   psock->sk_proto = NULL;
+   }
+}
+
+static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
+
+void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock_map_entry *e, *tmp;
+   struct smap_psock *psock;
+   struct sock *osk;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock))
+   return sk->sk_prot->close(sk, timeout);
+
+   /* The psock may be destroyed anytime after exiting the RCU critial
+* section so by the time we use close_fun the psock may no longer
+* be valid. However, bpf_tcp_close is called with the sock lock
+* held so the close hook and sk are still valid.
+*/
+   close_fun = psock->save_close;
+
+   write_lock_bh(>sk_callback_lock);
+   list_for_each_entry_safe(e, tmp, >maps, list) {
+   osk = cmpxchg(e->entry, sk, NULL);
+   if (osk == sk) {
+   list_del(>list);
+   smap_release_sock(psock, sk);
+   }
+   }
+   write_unlock_bh(>sk_callback_lock);
+   rcu_read_unlock();
+   close_fun(sk, timeout);
+}
+
 enum __sk_action {
__SK_DROP = 0,
__SK_PASS,
__SK_REDIRECT,
 };
 
+static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
+   .name   = "bpf_tcp",
+   .uid= TCP_ULP_BPF,
+   .user_visible   = false,
+   .owner  = NULL,
+   .init   = bpf_tcp_init,
+   .release= bpf_tcp_release,
+};
+
+static int bpf_tcp_ulp_register(void)
+{
+   tcp_bpf_proto = tcp_prot;
+   tcp_bpf_proto.close = bpf_tcp_close;
+   return tcp_register_ulp(_tcp_ulp_ops);
+}
+
 static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 {
struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
@@ -166,68 +245,6 @@ static void smap_report_sk_error(struct smap_psock *psock, 
int err)

[bpf-next PATCH v3 1/3] net: add a UID to use for ULP socket assignment

2018-01-29 Thread John Fastabend

Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.

Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |6 +
 net/ipv4/tcp_ulp.c |   57 +++-
 net/tls/tls_main.c |2 ++
 3 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 093e967..ba10ca7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1983,6 +1983,10 @@ static inline void tcp_listendrop(const struct sock *sk)
 #define TCP_ULP_MAX128
 #define TCP_ULP_BUF_MAX(TCP_ULP_NAME_MAX*TCP_ULP_MAX)
 
+enum {
+   TCP_ULP_TLS,
+};
+
 struct tcp_ulp_ops {
struct list_headlist;
 
@@ -1991,7 +1995,9 @@ struct tcp_ulp_ops {
/* cleanup ulp */
void (*release)(struct sock *sk);
 
+   int uid;
charname[TCP_ULP_NAME_MAX];
+   booluser_visible;
struct module   *owner;
 };
 int tcp_register_ulp(struct tcp_ulp_ops *type);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 6bb9e14..6d87e7a 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -29,6 +29,18 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name)
return NULL;
 }
 
+static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp)
+{
+   struct tcp_ulp_ops *e;
+
+   list_for_each_entry_rcu(e, _ulp_list, list) {
+   if (e->uid == ulp)
+   return e;
+   }
+
+   return NULL;
+}
+
 static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
 {
const struct tcp_ulp_ops *ulp = NULL;
@@ -51,6 +63,18 @@ static const struct tcp_ulp_ops 
*__tcp_ulp_find_autoload(const char *name)
return ulp;
 }
 
+static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid)
+{
+   const struct tcp_ulp_ops *ulp = NULL;
+
+   rcu_read_lock();
+   ulp = tcp_ulp_find_id(uid);
+   if (!ulp || !try_module_get(ulp->owner))
+   ulp = NULL;
+   rcu_read_unlock();
+   return ulp;
+}
+
 /* Attach new upper layer protocol to the list
  * of available protocols.
  */
@@ -59,13 +83,10 @@ int tcp_register_ulp(struct tcp_ulp_ops *ulp)
int ret = 0;
 
spin_lock(_ulp_list_lock);
-   if (tcp_ulp_find(ulp->name)) {
-   pr_notice("%s already registered or non-unique name\n",
- ulp->name);
+   if (tcp_ulp_find(ulp->name))
ret = -EEXIST;
-   } else {
+   else
list_add_tail_rcu(>list, _ulp_list);
-   }
spin_unlock(_ulp_list_lock);
 
return ret;
@@ -124,6 +145,32 @@ int tcp_set_ulp(struct sock *sk, const char *name)
if (!ulp_ops)
return -ENOENT;
 
+   if (!ulp_ops->user_visible)
+   return -ENOENT;
+
+   err = ulp_ops->init(sk);
+   if (err) {
+   module_put(ulp_ops->owner);
+   return err;
+   }
+
+   icsk->icsk_ulp_ops = ulp_ops;
+   return 0;
+}
+
+int tcp_set_ulp_id(struct sock *sk, int ulp)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+   const struct tcp_ulp_ops *ulp_ops;
+   int err;
+
+   if (icsk->icsk_ulp_ops)
+   return -EEXIST;
+
+   ulp_ops = __tcp_ulp_lookup(ulp);
+   if (!ulp_ops)
+   return -ENOENT;
+
err = ulp_ops->init(sk);
if (err) {
module_put(ulp_ops->owner);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 736719c..b0d5fce 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -484,6 +484,8 @@ static int tls_init(struct sock *sk)
 
 static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.name   = "tls",
+   .uid= TCP_ULP_TLS,
+   .user_visible   = true,
.owner  = THIS_MODULE,
.init   = tls_init,
 };

[bpf-next PATCH v3 0/3] bpf: sockmap fixes

2018-01-29 Thread John Fastabend

A set of fixes for sockmap to resolve programs referencing sockmaps
and closing without deleting all entries in the map and/or not detaching
BPF programs attached to the map. Both leaving entries in the map and
not detaching programs may result in the map failing to be removed by
BPF infrastructure due to reference counts never reaching zero.

For this we pull in the ULP infrastructure to hook into the close()
hook of the sock layer. This seemed natural because we have additional
sockmap features (to add support for TX hooks) that will also use the
ULP infrastructure. This allows us to cleanup entries in the map when
socks are closed() and avoid trying to get the sk_state_change() hook
to fire in all cases.

The second issue resolved here occurs when users don't detach
programs. The gist is a refcnt issue resolved by implementing the
release callback. See patch for details.

For testing I ran both sample/sockmap and selftests bpf/test_maps.c.
Dave Watson ran TLS test suite on v1 version of the patches without
the put_module error path change.

v3 wrap psock reference in RCU
v2 changes rebased onto bpf-next with the following small updates

Patch 1/3 Dave Watson pointed out that we should do a module put
in case future users use this with another ULP other than sockmap.
And module_put is a nop when owner = NULL (sockmap case) so no
harm here and code is more robust.

Patch 2/3 Be explicit and set user_visible to false just to avoid
any reader confusion.

---

John Fastabend (3):
  net: add a UID to use for ULP socket assignment
  bpf: sockmap, add sock close() hook to remove socks
  bpf: sockmap, fix leaking maps with attached but not detached progs


 include/net/tcp.h|8 ++
 kernel/bpf/sockmap.c |  175 +-
 net/ipv4/tcp_ulp.c   |   57 +++-
 net/tls/tls_main.c   |2 +
 4 files changed, 165 insertions(+), 77 deletions(-)

--
Signature

Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config

2018-01-29 Thread Daniel Borkmann

On 01/29/2018 06:36 PM, Greg KH wrote:
> On Mon, Jan 29, 2018 at 04:36:35PM +0100, Daniel Borkmann wrote:
>> On 01/29/2018 12:40 AM, Daniel Borkmann wrote:
>>> On 01/28/2018 03:45 PM, Greg KH wrote:
 On Wed, Jan 24, 2018 at 11:10:50AM +0100, Daniel Borkmann wrote:
> On 01/24/2018 11:07 AM, David Woodhouse wrote:
>> On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote:
>>> On 01/09/2018 07:04 PM, Alexei Starovoitov wrote:
>> [...]
>>> Applied to bpf tree, thanks Alexei!
>>
>> For stable too?
>
> Yes, this will go into stable as well; batch of backports will come 
> Thurs/Fri.

 Any word on these?  Worse case, a simple list of git commit ids to
 backport is all I need.
>>>
>>> Sorry for the delay! There are various conflicts all over the place, so I 
>>> had
>>> to backport manually. I just flushed out tested 4.14 batch, I'll see to get 
>>> 4.9
>>> out hopefully tonight as well, and the rest for 4.4 on Mon.
>>
>> While 4.14 and 4.9 BPF backports are tested and out since yesterday, and I
>> saw Greg queued them up (thanks!), it looks like plain 4.4.113 doesn't even
>> boot on my machine. While I can shortly see the kernel log, my screen turns
>> black shortly thereafter and nothing reacts anymore. No such problems with
>> 4.9 and 4.14 stables seen. (using x86_64, i7-6600U) Is this a known issue?
> 
> Not that I know of, sorry.  Odd graphics issue perhaps?
> 
> If you have some test programs I can run, I can look into doing the
> backports, I still have a laptop around here that runs 4.4 :)
> 
> There's always a virtual machine as well, have you tried that?

I've switched to an arm64 node now, that's working fine with 4.4, so patches
will come later tonight.

Thanks,
Daniel

Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config

2018-01-29 Thread Greg KH

On Mon, Jan 29, 2018 at 04:36:35PM +0100, Daniel Borkmann wrote:
> On 01/29/2018 12:40 AM, Daniel Borkmann wrote:
> > On 01/28/2018 03:45 PM, Greg KH wrote:
> >> On Wed, Jan 24, 2018 at 11:10:50AM +0100, Daniel Borkmann wrote:
> >>> On 01/24/2018 11:07 AM, David Woodhouse wrote:
>  On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote:
> > On 01/09/2018 07:04 PM, Alexei Starovoitov wrote:
> [...]
> > Applied to bpf tree, thanks Alexei!
> 
>  For stable too?
> >>>
> >>> Yes, this will go into stable as well; batch of backports will come 
> >>> Thurs/Fri.
> >>
> >> Any word on these?  Worse case, a simple list of git commit ids to
> >> backport is all I need.
> > 
> > Sorry for the delay! There are various conflicts all over the place, so I 
> > had
> > to backport manually. I just flushed out tested 4.14 batch, I'll see to get 
> > 4.9
> > out hopefully tonight as well, and the rest for 4.4 on Mon.
> 
> While 4.14 and 4.9 BPF backports are tested and out since yesterday, and I
> saw Greg queued them up (thanks!), it looks like plain 4.4.113 doesn't even
> boot on my machine. While I can shortly see the kernel log, my screen turns
> black shortly thereafter and nothing reacts anymore. No such problems with
> 4.9 and 4.14 stables seen. (using x86_64, i7-6600U) Is this a known issue?

Not that I know of, sorry.  Odd graphics issue perhaps?

If you have some test programs I can run, I can look into doing the
backports, I still have a laptop around here that runs 4.4 :)

There's always a virtual machine as well, have you tried that?

thanks,

greg k-h

[PATCH net] ibmvnic: Wait for device response when changing MAC

2018-01-29 Thread Thomas Falcon

Wait for a response from the VNIC server before exiting after setting
the MAC address. The resolves an issue with bonding a VNIC client in
ALB or TLB modes. The bonding driver was changing the MAC address more
rapidly than the device could respond, causing the following errors.

"bond0: the hw address of slave eth2 is in use by the bond;
couldn't find a slave with a free hw address to give it
(this should not have happened)"

If the function waits until the change is finalized, these errors are
avoided.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b65f5f3..d4e8733 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1543,15 +1543,19 @@ static int __ibmvnic_set_mac(struct net_device *netdev, 
struct sockaddr *p)
crq.change_mac_addr.first = IBMVNIC_CRQ_CMD;
crq.change_mac_addr.cmd = CHANGE_MAC_ADDR;
ether_addr_copy(_mac_addr.mac_addr[0], addr->sa_data);
+
+   init_completion(>fw_done);
ibmvnic_send_crq(adapter, );
+   wait_for_completion(>fw_done);
/* netdev->dev_addr is changed in handle_change_mac_rsp function */
-   return 0;
+   return adapter->fw_done_rc ? -EIO : 0;
 }
 
 static int ibmvnic_set_mac(struct net_device *netdev, void *p)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
struct sockaddr *addr = p;
+   int rc;
 
if (adapter->state == VNIC_PROBED) {
memcpy(>desired.mac, addr, sizeof(struct sockaddr));
@@ -1559,9 +1563,9 @@ static int ibmvnic_set_mac(struct net_device *netdev, 
void *p)
return 0;
}
 
-   __ibmvnic_set_mac(netdev, addr);
+   rc = __ibmvnic_set_mac(netdev, addr);
 
-   return 0;
+   return rc;
 }
 
 /**
@@ -3558,8 +3562,8 @@ static void handle_error_indication(union ibmvnic_crq 
*crq,
ibmvnic_reset(adapter, VNIC_RESET_NON_FATAL);
 }
 
-static void handle_change_mac_rsp(union ibmvnic_crq *crq,
- struct ibmvnic_adapter *adapter)
+static int handle_change_mac_rsp(union ibmvnic_crq *crq,
+struct ibmvnic_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
struct device *dev = >vdev->dev;
@@ -3568,10 +3572,13 @@ static void handle_change_mac_rsp(union ibmvnic_crq 
*crq,
rc = crq->change_mac_addr_rsp.rc.code;
if (rc) {
dev_err(dev, "Error %ld in CHANGE_MAC_ADDR_RSP\n", rc);
-   return;
+   goto out;
}
memcpy(netdev->dev_addr, >change_mac_addr_rsp.mac_addr[0],
   ETH_ALEN);
+out:
+   complete(>fw_done);
+   return rc;
 }
 
 static void handle_request_cap_rsp(union ibmvnic_crq *crq,
@@ -4031,7 +4038,7 @@ static void ibmvnic_handle_crq(union ibmvnic_crq *crq,
break;
case CHANGE_MAC_ADDR_RSP:
netdev_dbg(netdev, "Got MAC address change Response\n");
-   handle_change_mac_rsp(crq, adapter);
+   adapter->fw_done_rc = handle_change_mac_rsp(crq, adapter);
break;
case ERROR_INDICATION:
netdev_dbg(netdev, "Got Error Indication\n");
-- 
2.7.5

Re: [PATCH] ipv4: Get the address of interface correctly.

2018-01-29 Thread David Miller

From: Tonghao Zhang 
Date: Sun, 28 Jan 2018 03:38:58 -0800

> When using ioctl to get address of interface, we can't
> get it anymore. For example, the command is show as below.
> 
>   # ifconfig eth0
> 
> In the patch ("03aef17bb79b3"), the devinet_ioctl does not
> return a suitable value, even though we can find it in
> the kernel. Then fix it now.
> 
> Fixes: 03aef17bb79b3 ("devinet_ioctl(): take copyin/copyout to caller")
> Cc: Al Viro 
> Signed-off-by: Tonghao Zhang 

Applied, thank you.

Re: [PATCH] net: pxa168_eth: add netconsole support

2018-01-29 Thread David Miller

From: Alexander Monakov 
Date: Sat, 27 Jan 2018 23:29:07 +0300

> @@ -1362,6 +1362,14 @@ static int pxa168_eth_do_ioctl(struct net_device *dev, 
> struct ifreq *ifr,
>   return -EOPNOTSUPP;
>  }
>  
> +#ifdef CONFIG_NET_POLL_CONTROLLER
> +static void pxa168_eth_netpoll(struct net_device *dev)
> +{
> + struct pxa168_eth_private *pep = netdev_priv(dev);
> + napi_schedule(>napi);
> +}
> +#endif

This definitely is not sufficient.

Look at what other drivers do.

You have to invoke the interrupt handler with the device's interrupts disabled,
collect the events, and most importantly mask chip interrupts before scheduling
the NAPI instance.

Thank you.

Re: [PATCH net] net_sched: gen_estimator: fix lockdep splat

2018-01-29 Thread David Miller

From: Eric Dumazet 
Date: Sat, 27 Jan 2018 10:58:43 -0800

> From: Eric Dumazet 
> 
> syzbot reported a lockdep splat in gen_new_estimator() /
> est_fetch_counters() when attempting to lock est->stats_lock.
> 
> Since est_fetch_counters() is called from BH context from timer
> interrupt, we need to block BH as well when calling it from process
> context.
> 
> Most qdiscs use per cpu counters and are immune to the problem,
> but net/sched/act_api.c and net/netfilter/xt_RATEEST.c are using
> a spinlock to protect their data. They both call gen_new_estimator()
> while object is created and not yet alive, so this bug could
> not trigger a deadlock, only a lockdep splat.
> 
> Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate 
> estimators")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied, thanks.

Re: [PATCH v3] net: macb: Handle HRESP error

2018-01-29 Thread David Miller

From: 
Date: Sat, 27 Jan 2018 12:09:01 +0530

> From: Harini Katakam 
> 
> Handle HRESP error by doing a SW reset of RX and TX and
> re-initializing the descriptors, RX and TX queue pointers.
> 
> Signed-off-by: Harini Katakam 
> Signed-off-by: Michal Simek 

Applied, thanks.

Re: [PATCH net-next] net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring

2018-01-29 Thread David Miller

From: Saeed Mahameed 
Date: Fri, 26 Jan 2018 16:16:45 -0800

> From: Gal Pressman 
> 
> On TTC table creation, the indirection TIRs should be used instead of
> the inner indirection TIRs.
> 
> Fixes: 1ae1df3a1193 ("net/mlx5e: Refactor RSS related objects and code")
> Signed-off-by: Gal Pressman 
> Reviewed-by: Shalom Lagziel 
> Signed-off-by: Saeed Mahameed 

Applied, thanks.

Re: [PATCH net] ipv6: addrconf: break critical section in addrconf_verify_rtnl()

2018-01-29 Thread David Miller

From: Eric Dumazet 
Date: Fri, 26 Jan 2018 16:10:43 -0800

> From: Eric Dumazet 
> 
> Heiner reported a lockdep splat [1]
> 
> This is caused by attempting GFP_KERNEL allocation while RCU lock is
> held and BH blocked.
> 
> We believe that addrconf_verify_rtnl() could run for a long period,
> so instead of using GFP_ATOMIC here as Ido suggested, we should break
> the critical section and restart it after the allocation.
> 
> 
> [1]
 ...
> Fixes: f3d9832e56c4 ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
> Signed-off-by: Eric Dumazet 
> Reported-by: Heiner Kallweit 

Applied and queued up for v4.15 -stable, thanks.

Re: [PATCH net] ipv6: change route cache aging logic

2018-01-29 Thread David Miller

From: Wei Wang 
Date: Fri, 26 Jan 2018 11:40:17 -0800

> From: Wei Wang 
> 
> In current route cache aging logic, if a route has both RTF_EXPIRE and
> RTF_GATEWAY set, the route will only be removed if the neighbor cache
> has no RTN_ROUTE flag. Otherwise, even if the route has expired, it
> won't get deleted.
> Fix this logic to always check if the route has expired first and then
> do the gateway neighbor cache check if previous check decide to not
> remove the exception entry.
> 
> Fixes: 1859bac04fb6 ("ipv6: remove from fib tree aged out RTF_CACHE dst")
> Signed-off-by: Wei Wang 
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks.

Re: [net] i40e/i40evf: Update DESC_NEEDED value to reflect larger value

2018-01-29 Thread David Miller

From: Jeff Kirsher 
Date: Fri, 26 Jan 2018 08:54:45 -0800

> From: Alexander Duyck 
> 
> When compared to ixgbe and other previous Intel drivers the i40e and i40evf
> drivers actually reserve 2 additional descriptors in maybe_stop_tx for
> cache line alignment. We need to update DESC_NEEDED to reflect this as
> otherwise we are more likely to return TX_BUSY which will cause issues with
> things like xmit_more.
> 
> Signed-off-by: Alexander Duyck 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

Applied, thanks.

Re: [PATCH net-next] bnxt_en: cleanup DIM work on device shutdown

2018-01-29 Thread David Miller

From: Andy Gospodarek 
Date: Fri, 26 Jan 2018 10:27:47 -0500

> From: Andy Gospodarek 
> 
> Make sure to cancel any pending work that might update driver coalesce
> settings when taking down an interface.
> 
> Fixes: 6a8788f25625 ("bnxt_en: add support for software dynamic interrupt 
> moderation")
> Signed-off-by: Andy Gospodarek 
> Cc: Michael Chan 

Applied, thank you.

Re: [PATCH net] net: ipv6: send unsolicited NA after DAD

2018-01-29 Thread David Miller

From: David Ahern 
Date: Thu, 25 Jan 2018 20:16:29 -0800

> Unsolicited IPv6 neighbor advertisements should be sent after DAD
> completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
> addresses and have those sent by addrconf_dad_completed after DAD.
> 
> Fixes: 4a6e3c5def13c ("net: ipv6: send unsolicited NA on admin up")
> Reported-by: Vivek Venkatraman 
> Signed-off-by: David Ahern 

Applied and queued up for -stable, thanks.

Re: [PATCH net] gianfar: prevent integer wrapping in the rx handler

2018-01-29 Thread David Miller

From: Andy Spencer 
Date: Thu, 25 Jan 2018 19:37:50 -0800

> When the frame check sequence (FCS) is split across the last two frames
> of a fragmented packet, part of the FCS gets counted twice, once when
> subtracting the FCS, and again when subtracting the previously received
> data.
> 
> For example, if 1602 bytes are received, and the first fragment contains
> the first 1600 bytes (including the first two bytes of the FCS), and the
> second fragment contains the last two bytes of the FCS:
> 
>   'skb->len == 1600' from the first fragment
> 
>   size  = lstatus & BD_LENGTH_MASK; # 1602
>   size -= ETH_FCS_LEN;  # 1598
>   size -= skb->len; # -2
> 
> Since the size is unsigned, it wraps around and causes a BUG later in
> the packet handling, as shown below:
> 
>   kernel BUG at ./include/linux/skbuff.h:2068!
>   Oops: Exception in kernel mode, sig: 5 [#1]
>   ...
>   NIP [c021ec60] skb_pull+0x24/0x44
>   LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690
>   Call Trace:
>   [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable)
>   [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c
>   [df7edf40] [c023352c] net_rx_action+0x21c/0x274
>   [df7edf90] [c0329000] __do_softirq+0xd8/0x240
>   [df7edff0] [c000c108] call_do_irq+0x24/0x3c
>   [c0597e90] [c00041dc] do_IRQ+0x64/0xc4
>   [c0597eb0] [c000d920] ret_from_except+0x0/0x18
>   --- interrupt: 501 at arch_cpu_idle+0x24/0x5c
> 
> Change the size to a signed integer and then trim off any part of the
> FCS that was received prior to the last fragment.
> 
> Fixes: 6c389fc931bc ("gianfar: fix size of scatter-gathered frames")
> Signed-off-by: Andy Spencer 

Applied.

Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Cong Wang

On Mon, Jan 29, 2018 at 9:03 AM, Cong Wang  wrote:
> On Mon, Jan 29, 2018 at 8:00 AM, Stephen Hemminger
>  wrote:
>> On Mon, 29 Jan 2018 16:18:07 +0100
>> "Roland Franke"  wrote:
>>
>>> Hello,
>>>
>>> > To: Roland Franke ; netdev@vger.kernel.org
>>> > Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic
>>> >>
>>> >> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
>>> >> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
>>> >> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
>>> >> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil
>>> >> 1kbit prio 0
>>> >>
>>> >> I become an Kernel-panic with the following output:
>>> >> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200
>>> >
>>>
>>> > Would you have a stack trace to share with us ?
>>>
>>> As i will be an absolute newby here, i will not know how to
>>> get the stack trace out.
>>> When i will get some information how to get this, i can try to
>>> give you this information.
>>> But by my last tests i made the first 3 commands on an console
>>> and had no error. Only by typing the last line i will get the error and
>>> here i get actally only the "kern.err kernel: BUG: ." message.
>>>
>>> Roland
>>
>> It generates this with lockdep (on 4.15)
>>
>> [  151.355076] HTB: quantum of class 27 is big. Consider r2q change.
>> [  151.371495] BUG: sleeping function called from invalid context at 
>> kernel/locking/mutex.c:747
>> [  151.371815] in_atomic(): 1, irqs_disabled(): 0, pid: 1135, name: tc
>> [  151.371875] 2 locks held by tc/1135:
>> [  151.371878]  #0:  (rtnl_mutex){+.+.}, at: [] 
>> rtnetlink_rcv_msg+0x23b/0x5f0
>> [  151.371899]  #1:  (_tx_lock){+...}, at: [] 
>> htb_change_class+0x55f/0x880 [sch_htb]
>> [  151.371918] CPU: 2 PID: 1135 Comm: tc Not tainted 4.14.15 #2
>> [  151.371921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.10.2-1 04/01/2014
>> [  151.371924] Call Trace:
>> [  151.371934]  dump_stack+0x7c/0xbe
>> [  151.371944]  ___might_sleep+0x21e/0x250
>> [  151.371953]  __mutex_lock+0x59/0x9a0
>> [  151.371960]  ? lock_acquire+0xec/0x1e0
>> [  151.371973]  ? __raw_spin_lock_init+0x1c/0x50
>> [  151.371990]  ? _rcu_barrier+0x2f/0x170
>> [  151.371995]  ? __lockdep_init_map+0x5c/0x1d0
>> [  151.371998]  _rcu_barrier+0x2f/0x170
>> [  151.372006]  tcf_block_put+0x7f/0xa0
>> [  151.372013]  sfq_destroy+0x15/0x50 [sch_sfq]
>> [  151.372021]  qdisc_destroy+0x6c/0xe0
>> [  151.372028]  htb_change_class+0x704/0x880 [sch_htb]
>
>
> We hold qdisc tree spinlock but call rcu_barrier() in
> mini_qdisc_pair_swap()...

Well, not min_qdisc things, but it should be resolved by:

commit efbf78973978b0d25af59bc26c8013a942af6e64
Author: Cong Wang 
Date:   Mon Dec 4 10:48:18 2017 -0800

net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Jan Engelhardt


On Monday 2018-01-29 17:57, Florian Westphal wrote:
>> > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
>> > > off when the current task is killed") but then became unkillable by 
>> > > commit
>> > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
>> > > killed""). Therefore, we can't handle this problem from MM side.
>> > > Please consider adding some limit from networking side.
>> > 
>> > I don't know what "some limit" would be.  I would prefer if there was
>> > a way to supress OOM Killer in first place so we can just -ENOMEM user.
>> 
>> Just supressing OOM kill is a bad idea. We still leave a way to allocate
>> arbitrary large buffer in kernel.

At the very least, mm could limit that kind of "arbitrary". If the machine has
x GB (swap included) and the admin tries to make the kernel allocate space for
an x GB ruleset, no way is it going to be satisfiable _even with OOM_.

>I think we should try to allocate whatever amount of memory is needed
>for the given xtables ruleset, given that is what admin requested us to do.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Kirill A. Shutemov

On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote:
> Kirill A. Shutemov  wrote:
> > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: 
> > > > back
> > > > off when the current task is killed") but then became unkillable by 
> > > > commit
> > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > Please consider adding some limit from networking side.
> > > 
> > > I don't know what "some limit" would be.  I would prefer if there was
> > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> > 
> > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > arbitrary large buffer in kernel.
> 
> Isn't that what we do everywhere in network stack?
> 
> I think we should try to allocate whatever amount of memory is needed
> for the given xtables ruleset, given that is what admin requested us to do.

Is it correct that "admin" in this case is root in random container?
I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET?

This can be fun.

> I also would not know what limit is sane -- I've seen setups with as much
> as 100k iptables rules, and that was 5 years ago.
> 
> And even if we add a "Xk rules" limit, it might be too much for
> low-memory systems, or not enough for whatever other use case there
> might be.

I hate what I'm saying, but I guess we need some tunable here.
Not sure what exactly.

-- 
 Kirill A. Shutemov

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Michal Hocko

On Mon 29-01-18 17:57:22, Florian Westphal wrote:
> Kirill A. Shutemov  wrote:
> > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: 
> > > > back
> > > > off when the current task is killed") but then became unkillable by 
> > > > commit
> > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > Please consider adding some limit from networking side.
> > > 
> > > I don't know what "some limit" would be.  I would prefer if there was
> > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> > 
> > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > arbitrary large buffer in kernel.
> 
> Isn't that what we do everywhere in network stack?
> 
> I think we should try to allocate whatever amount of memory is needed
> for the given xtables ruleset, given that is what admin requested us to do.

If this is a root only thing then __GFP_NORETRY sounds like the most
straightforward way to go.
-- 
Michal Hocko
SUSE Labs

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-29 Thread Michael S. Tsirkin

On Sun, Jan 28, 2018 at 08:26:53PM -0800, Alexander Duyck wrote:
> >> > For live migration with advanced usecases that Siwei is suggesting, i
> >> > think we need a new driver with a new device type that can track the
> >> > VF specific feature settings even when the VF driver is unloaded.
> >
> > I see no added value of the 3 netdev model, there is no need for a bond
> > device.
> 
> I agree a full-blown bond isn't what is needed. However, just forking
> traffic out from virtio to a VF doesn't really solve things either.
> 
> One of the issues as I see it is the fact that the qdisc model with
> the merged device gets to be pretty ugly. There is the fact that every
> packet that goes to the VF has to pass through the qdisc code twice.
> There is the dual nature of the 2 netdev solution that also introduces
> the potential for head-of-line blocking since the virtio could put
> back pressure on the upper qdisc layer which could stall the VF
> traffic when switching over. I hope we could avoid issues like that by
> maintaining qdiscs per device queue instead of on an upper device that
> is half software interface and half not. Ideally the virtio-bond
> device could operate without a qdisc and without needing any
> additional locks so there shouldn't be head of line blocking occurring
> between the two interfaces and overhead could be kept minimal.
> 
> Also in the case of virtio there is support for in-driver XDP. As
> Sridhar stated, when using the 2 netdev model "we cannot support XDP
> in this model and it needs to be disabled". That sounds like a step
> backwards instead of forwards. I would much rather leave the XDP
> enabled at the lower dev level, and then if we want we can use the
> generic XDP at the virtio-bond level to capture traffic on both
> interfaces at the same time.

I agree dropping XDP makes everything very iffy.

> In the case of netvsc you have control of both sides of a given link
> so you can match up the RSS tables, queue configuration and everything
> is somewhat symmetric since you are running the PF and all the HyperV
> subchannels. Most of the complexity is pushed down into the host and
> your subchannel management is synchronized there if I am not mistaken.
> We don't have this in the case of this virtio-bond setup. Instead a
> single bit is set indicating "backup" without indicating what that
> means to topology other than the fact that this virtio interface is
> the backup for some other interface. We are essentially blind other
> than having the link status for the VF and virtio and knowing that the
> virtio is intended to be the backup.

Would you be interested in posting at least a proof of concept
patch for this approach? It's OK if there are some TODO stubs.
It would be much easier to compare code to code rather than
a high level description to code.

> We might be able to get to a 2 or maybe even a 1 netdev solution at
> some point in the future, but  I don't think that time is now. For now
> a virtio-bond type solution would allow us to address most of the use
> cases with minimal modification to the existing virtio and can deal
> with feature and/or resource asymmetry.
> 
> - Alex

Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-29 Thread David Miller

From: Cong Wang 
Date: Thu, 25 Jan 2018 18:26:21 -0800

> This pathcset restores the pfifo_fast qdisc behavior of dropping
> packets based on latest dev->tx_queue_len. Patch 1 introduces
> a helper, patch 2 introduces a new Qdisc ops which is called when
> we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.
> 
> Please see each patch for details.
> 
> ---
> v3: use skb_array_resize_multiple()
> v2: handle error case for ->change_tx_queue_len()

Series applied, thanks Cong.

Please follow up with John about making the queue allocation use
a prepare + commit phase.

And as for the TX queue state handling on change, I think it's
fine to purge the whole queue.  That is definitely better than
allowing the queue length to be visibly over the tx_queue_len
setting.

Thank you.

Re: [PATCH net-next 0/2] Ease to follow an interface that moves to another netns

2018-01-29 Thread David Miller

From: Nicolas Dichtel 
Date: Thu, 25 Jan 2018 15:01:37 +0100

> The goal of this series is to ease the user to follow an interface that
> moves to another netns.
> 
> After this series, with a patched iproute2:
> 
> $ ip netns
> bar
> foo
> $ ip monitor link &
> $ ip link set dummy0 netns foo
> Deleted 5: dummy0:  mtu 1500 qdisc noop state DOWN group 
> default
> link/ether 6e:a7:82:35:96:46 brd ff:ff:ff:ff:ff:ff new-nsid 0 new-ifindex 
> 6
> 
> => new nsid: 0, new ifindex: 6 (was 5 in the previous netns)
> 
> $ ip link set eth1 netns bar
> Deleted 3: eth1:  mtu 1500 qdisc noop state DOWN group 
> default
> link/ether 52:54:01:12:34:57 brd ff:ff:ff:ff:ff:ff new-nsid 1 new-ifindex 
> 3
> 
> => new nsid: 1, new ifindex: 3 (same ifindex)
> 
> $ ip netns
> bar (id: 1)
> foo (id: 0)

Series applied, thanks Nicolas.

Re: [PATCH net] vhost_net: stop device during reset owner

2018-01-29 Thread David Miller

From: Jason Wang 
Date: Thu, 25 Jan 2018 22:03:52 +0800

> We don't stop device before reset owner, this means we could try to
> serve any virtqueue kick before reset dev->worker. This will result a
> warn since the work was pending at llist during owner resetting. Fix
> this by stopping device during owner reset.
> 
> Reported-by: syzbot+eb17c6162478cc506...@syzkaller.appspotmail.com
> Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
> Signed-off-by: Jason Wang 

Applied and queued up for -stable.

Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

2018-01-29 Thread Alexander Duyck

On Sun, Jan 28, 2018 at 11:28 PM, Benjamin Poirier  wrote:
> On 2018/01/26 13:01, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier  wrote:
>> > This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.
>> >
>> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
>> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
>> > overrun interrupt bursts"). Some tracing shows that after
>> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
>> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
>> > icr=0x8004 (_INT_ASSERTED | _LSC) in the same situation.
>> >
>> > Some experimentation showed that this flaw in vmware e1000e emulation can
>> > be worked around by not setting Other in EIAC. This is how it was before
>> > commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> >
>> > Since the ICR read in the Other interrupt handler has already been
>> > restored, this patch effectively reverts the remainder of commit
>> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> >
>> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
>> > Signed-off-by: Benjamin Poirier 
>> > ---
>> >  drivers/net/ethernet/intel/e1000e/netdev.c | 10 --
>> >  1 file changed, 8 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
>> > b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > index ed103b9a8d3a..fffc1f0e3895 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > @@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int 
>> > __always_unused irq, void *data)
>> > struct e1000_hw *hw = >hw;
>> > u32 icr = er32(ICR);
>> >
>> > +   /* Certain events (such as RXO) which trigger Other do not set
>> > +* INT_ASSERTED. In that case, read to clear of icr does not take
>> > +* place.
>> > +*/
>> > +   if (!(icr & E1000_ICR_INT_ASSERTED))
>> > +   ew32(ICR, E1000_ICR_OTHER);
>> > +
>>
>> This piece doesn't make sense to me. Why are we clearing OTHER if
>> ICR_INT_ASSERTED is not set?
>
> Datasheet §10.2.4.1 ("Interrupt Cause Read Register") says that ICR read
> to clear only occurs if INT_ASSERTED is set. This corresponds to what I
> observed.
>
> However, while working on these issues, I noticed that when there is an rxo
> event, INT_ASSERTED is not always set even though the interrupt is raised. I
> think this is a hardware flaw.

I agree. I need to check with our silicon team to see what we can determine.

> For example, if doing
> ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
> we enter e1000_msix_other() and two consecutive reads of ICR result in
> 0x8104
> 0x
>
> If doing
> ew32(ICS, E1000_ICS_RXO | E1000_ICS_OTHER);
> we enter e1000_msix_other() and two consecutive reads of ICR result in
> 0x0141
> 0x0141

This is interesting. So the ICR is doing the clear on read, so that
answers the question I had about the earlier patch.

One thought on this.. Is there any reason why you are limiting this to
only the OTHER bit? It seems like RXO and the other causes that aren't
supposed to be included in the mask should probably be cleared as
well, are they auto-cleared, ignored, or is there some advantage to
leaving them set?

> Consequently, we must clear OTHER manually from ICR, otherwise the
> interrupt is immediately re-raised after exiting the handler.
>
> These observations are the same whether the interrupt is triggered via a
> write to ICS or in hardware.
>
> Furthermore, I tested that this behavior is the same for other Other
> events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
> only, not in hardware.
>
> This is a version of the test patch that I used to trigger lsc and rxo in
> software and hardware. It applies over this patch series.

I plan to look into this some more over the next few days. Ideally if
we could mask these "OTHER" interrupts besides the LSC we could comply
with all the needed bits for MSI-X. My concern is that we are still
stuck reading the ICR at this point because of this and it is going to
make dealing with MSI-X challenging on 82574 since it seems like the
intention was that you weren't supposed to be reading the ICR when
MSI-X is enabled based on the list of current issues and HW errata.

At this point it seems like the interrupts is firing and the
INT_ASSERTED is all we really need to be checking for if I understand
this all correctly. Basically if LSC is set it will trigger OTHER and
INT_ASSERTED, if any of the other causes are set they are only setting
OTHER.

> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
> b/drivers/net/ethernet/intel/e1000e/defines.h
> index 0641c0098738..f54e7ac9c934 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++

Re: [PATCH v4] net: ethernet: cavium: Correct Cavium Thunderx NIC driver names accordingly to module name

2018-01-29 Thread David Miller

From: Vadim Lomovtsev 
Date: Thu, 25 Jan 2018 03:38:17 -0800

> From: Vadim Lomovtsev 
> 
> It was found that ethtool provides unexisting module name while
> it queries the specified network device for associated driver
> information. Then user tries to unload that module by provided
> module name and fails.
> 
> This happens because ethtool reads value of DRV_NAME macro,
> while module name is defined at the driver's Makefile.
> 
> This patch is to correct Cavium CN88xx Thunder NIC driver names
> (DRV_NAME macro) 'thunder-nicvf' to 'nicvf' and 'thunder-nic'
> to 'nicpf', sync bgx and xcv driver names accordingly to their
> module names.
> 
> Signed-off-by: Vadim Lomovtsev 

Applied to net-next, thank you.

Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-29 Thread Christian Brauner

On Mon, Jan 29, 2018 at 11:31:57AM -0500, David Miller wrote:
> From: Christian Brauner 
> Date: Wed, 24 Jan 2018 15:26:31 +0100
> 
> > Based on the previous discussion this enables passing a IFLA_IF_NETNSID
> > property along with RTM_SETLINK and RTM_DELLINK requests. The patch for
> > RTM_NEWLINK will be sent out in a separate patch since there are more
> > corner-cases to think about.
>  ...
> > Changelog 2018-01-24:
> > * Preserve old behavior and report -ENODEV when either ifindex or ifname is
> >   provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller.
> 
> Series applied, and this seems to be consistent with what Jiri envisioned
> in commit 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
> 
> Thank you.

Thanks for applying!
Christian

[PATCH net-next 0/1] rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK

2018-01-29 Thread Christian Brauner

Hi,

Based on the previous discussion this enables passing a IFLA_IF_NETNSID
property along with RTM_NEWLINK requests. The latter patch was missing from my
previous series to allow for some more time to test this.

Best,
Christian

Christian Brauner (1):
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK

 net/core/rtnetlink.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

-- 
2.14.1

[PATCH net-next 1/1] rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK

2018-01-29 Thread Christian Brauner

- Backwards Compatibility:
  If userspace wants to determine whether RTM_NEWLINK supports the
  IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
  with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
  does not include IFLA_IF_NETNSID userspace should assume that
  IFLA_IF_NETNSID is not supported on this kernel.
  If the reply does contain an IFLA_IF_NETNSID property userspace
  can send an RTM_NEWLINK with a IFLA_IF_NETNSID property. If they receive
  EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
  with RTM_NEWLINK. Userpace should then fallback to other means.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner 
---
 net/core/rtnetlink.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f111557958bb..889b34f78a44 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2954,14 +2954,10 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
name_assign_type = NET_NAME_ENUM;
}
 
-   dest_net = rtnl_link_get_net(net, tb);
+   dest_net = rtnl_link_get_net_capable(skb, net, tb, 
CAP_NET_ADMIN);
if (IS_ERR(dest_net))
return PTR_ERR(dest_net);
 
-   err = -EPERM;
-   if (!netlink_ns_capable(skb, dest_net->user_ns, CAP_NET_ADMIN))
-   goto out;
-
if (tb[IFLA_LINK_NETNSID]) {
int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
 
-- 
2.14.1

Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Cong Wang

On Mon, Jan 29, 2018 at 8:00 AM, Stephen Hemminger
 wrote:
> On Mon, 29 Jan 2018 16:18:07 +0100
> "Roland Franke"  wrote:
>
>> Hello,
>>
>> > To: Roland Franke ; netdev@vger.kernel.org
>> > Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic
>> >>
>> >> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
>> >> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
>> >> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
>> >> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil
>> >> 1kbit prio 0
>> >>
>> >> I become an Kernel-panic with the following output:
>> >> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200
>> >
>>
>> > Would you have a stack trace to share with us ?
>>
>> As i will be an absolute newby here, i will not know how to
>> get the stack trace out.
>> When i will get some information how to get this, i can try to
>> give you this information.
>> But by my last tests i made the first 3 commands on an console
>> and had no error. Only by typing the last line i will get the error and
>> here i get actally only the "kern.err kernel: BUG: ." message.
>>
>> Roland
>
> It generates this with lockdep (on 4.15)
>
> [  151.355076] HTB: quantum of class 27 is big. Consider r2q change.
> [  151.371495] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:747
> [  151.371815] in_atomic(): 1, irqs_disabled(): 0, pid: 1135, name: tc
> [  151.371875] 2 locks held by tc/1135:
> [  151.371878]  #0:  (rtnl_mutex){+.+.}, at: [] 
> rtnetlink_rcv_msg+0x23b/0x5f0
> [  151.371899]  #1:  (_tx_lock){+...}, at: [] 
> htb_change_class+0x55f/0x880 [sch_htb]
> [  151.371918] CPU: 2 PID: 1135 Comm: tc Not tainted 4.14.15 #2
> [  151.371921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [  151.371924] Call Trace:
> [  151.371934]  dump_stack+0x7c/0xbe
> [  151.371944]  ___might_sleep+0x21e/0x250
> [  151.371953]  __mutex_lock+0x59/0x9a0
> [  151.371960]  ? lock_acquire+0xec/0x1e0
> [  151.371973]  ? __raw_spin_lock_init+0x1c/0x50
> [  151.371990]  ? _rcu_barrier+0x2f/0x170
> [  151.371995]  ? __lockdep_init_map+0x5c/0x1d0
> [  151.371998]  _rcu_barrier+0x2f/0x170
> [  151.372006]  tcf_block_put+0x7f/0xa0
> [  151.372013]  sfq_destroy+0x15/0x50 [sch_sfq]
> [  151.372021]  qdisc_destroy+0x6c/0xe0
> [  151.372028]  htb_change_class+0x704/0x880 [sch_htb]


We hold qdisc tree spinlock but call rcu_barrier() in
mini_qdisc_pair_swap()...



> [  151.372050]  tc_ctl_tclass+0x193/0x3c0
> [  151.372075]  rtnetlink_rcv_msg+0x270/0x5f0
> [  151.372088]  ? rtnetlink_put_metrics+0x1c0/0x1c0
> [  151.372096]  netlink_rcv_skb+0xde/0x110
> [  151.372109]  netlink_unicast+0x1e4/0x310
> [  151.372118]  netlink_sendmsg+0x2dc/0x3c0
> [  151.372136]  sock_sendmsg+0x30/0x40
> [  151.372142]  ___sys_sendmsg+0x2b9/0x2d0
> [  151.372171]  ? __handle_mm_fault+0x7f8/0x1120
> [  151.372189]  ? __do_page_fault+0x2a5/0x530
> [  151.372200]  ? __sys_sendmsg+0x51/0x90
> [  151.372205]  __sys_sendmsg+0x51/0x90
> [  151.372225]  entry_SYSCALL_64_fastpath+0x29/0xa0
> [  151.372230] RIP: 0033:0x7f7049397490
> [  151.372233] RSP: 002b:7ffd0c432f48 EFLAGS: 0246
> [  151.372445] BUG: scheduling while atomic: tc/1135/0x0202
> [  151.372524] 3 locks held by tc/1135:
> [  151.372527]  #0:  (rtnl_mutex){+.+.}, at: [] 
> rtnetlink_rcv_msg+0x23b/0x5f0
> [  151.372544]  #1:  (_tx_lock){+...}, at: [] 
> htb_change_class+0x55f/0x880 [sch_htb]
> [  151.372560]  #2:  (rcu_sched_state.barrier_mutex){+.+.}, at: 
> [] _rcu_barrier+0x2f/0x170
> [  151.372575] Modules linked in: sch_sfq sch_htb ppdev input_leds serio_raw 
> joydev parport_pc parport i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm 
> ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs 
> zstd_decompress zstd_compress xxhash raid10 raid456 libcrc32c 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 
> raid0 multipath linear qxl drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops ttm drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
> aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse floppy 
> pata_acpi hid_generic usbhid hid
> [  151.372748] CPU: 2 PID: 1135 Comm: tc Tainted: GW   4.14.15 #2
> [  151.372751] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [  151.372753] Call Trace:
> [  151.372759]  dump_stack+0x7c/0xbe
> [  151.372767]  __schedule_bug+0x62/0x90
> [  151.372772]  __schedule+0x7d9/0xb80
> [  151.372788]  schedule+0x39/0x90
> [  151.372793]  schedule_timeout+0x24d/0x5c0
> [  151.372813]  ? mark_held_locks+0x71/0xa0
> [  151.372825]  ? wait_for_completion+0x12e/0x1a0
> [  151.372829]  wait_for_completion+0x12e/0x1a0
> [  151.372835]  ? wake_up_q+0x60/0x60
> [  151.372846]  _rcu_barrier+0x11a/0x170
> [  151.372853]  tcf_block_put+0x7f/0xa0
> [  151.372860]

Re: [PATCH net-next 00/12] ptr_ring fixes

2018-01-29 Thread David Miller

From: Jason Wang 
Date: Mon, 29 Jan 2018 15:10:37 +0800

> 
> 
> On 2018年01月26日 07:36, Michael S. Tsirkin wrote:
>> This fixes a bunch of issues around ptr_ring use in net core.
>> One of these: "tap: fix use-after-free" is also needed on net,
>> but can't be backported cleanly.
>>
>> I will post a net patch separately.
>>
>> Lightly tested - Jason, could you pls confirm this
>> addresses the security issue you saw with ptr_ring?
>> Testing reports would be appreciated too.
>>
>> Michael S. Tsirkin (12):
>>ptr_ring: keep consumer_head valid at all times
>>ptr_ring: clean up documentation
>>ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty
>>tap: fix use-after-free
>>ptr_ring: disallow lockless __ptr_ring_full
>>Revert "net: ptr_ring: otherwise safe empty checks can overrun array
>>  bounds"
>>skb_array: use __ptr_ring_empty
>>ptr_ring: prevent queue load/store tearing
>>tools/virtio: switch to __ptr_ring_empty
>>tools/virtio: more stubs to fix tools build
>>tools/virtio: copy READ/WRITE_ONCE
>>tools/virtio: fix smp_mb on x86
>>
>>   drivers/net/tap.c|  3 --
>>   include/linux/ptr_ring.h | 86 ++--
>>   include/linux/skb_array.h|  2 +-
>>   tools/virtio/linux/kernel.h  |  2 +-
>>   tools/virtio/linux/thread_info.h |  1 +
>>   tools/virtio/ringtest/main.h | 59 ++-
>>   tools/virtio/ringtest/ptr_ring.c |  2 +-
>>   7 files changed, 110 insertions(+), 45 deletions(-)
>>   create mode 100644 tools/virtio/linux/thread_info.h
>>
> 
> For the series:
> 
> Tested-by: Jason Wang 
> Acked-by: Jason Wang 

Series applied, thanks everyone.

Re: [PATCH net-next] ptr_ring: fix integer overflow

2018-01-29 Thread David Miller

From: Jason Wang 
Date: Thu, 25 Jan 2018 15:31:42 +0800

> We try to allocate one more entry for lockless peeking. The adding
> operation may overflow which causes zero to be passed to kmalloc().
> In this case, it returns ZERO_SIZE_PTR without any notice by ptr
> ring. Try to do producing or consuming on such ring will lead NULL
> dereference. Fix this detect and fail early.
> 
> Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun 
> array bounds")
> Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
> Cc: John Fastabend 
> Signed-off-by: Jason Wang 

I'm dropping this because I am to understand that Michael Tsirkin's patch
series covers this issue.

Let me know if I still need to apply this.

Thanks.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Florian Westphal

Kirill A. Shutemov  wrote:
> On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > off when the current task is killed") but then became unkillable by commit
> > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > killed""). Therefore, we can't handle this problem from MM side.
> > > Please consider adding some limit from networking side.
> > 
> > I don't know what "some limit" would be.  I would prefer if there was
> > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> 
> Just supressing OOM kill is a bad idea. We still leave a way to allocate
> arbitrary large buffer in kernel.

Isn't that what we do everywhere in network stack?

I think we should try to allocate whatever amount of memory is needed
for the given xtables ruleset, given that is what admin requested us to do.

I also would not know what limit is sane -- I've seen setups with as much
as 100k iptables rules, and that was 5 years ago.

And even if we add a "Xk rules" limit, it might be too much for
low-memory systems, or not enough for whatever other use case there
might be.

Re: [net-next 0/8][pull request] 1GbE Intel Wired LAN Driver Updates 2018-01-24

2018-01-29 Thread David Miller

From: Alexander Duyck 
Date: Wed, 24 Jan 2018 18:29:42 -0800

> I'll double check our VF drivers and make sure none of them are
> exposing a netdevice with an all 0's MAC address, and see what we can
> do about relocating the locally administered address generation into
> the PF.

If such netdevs are never exposed with all 0's MACs, then I'm happy.

Thanks for the explanation.

Re: [PATCH] ceph: Check memory allocation result

2018-01-29 Thread Ilya Dryomov

On Fri, Jan 26, 2018 at 7:54 AM, Chengguang Xu  wrote:
> Should check result of kstrndup() in case of memory allocation failure.
>
> Signed-off-by: Chengguang Xu 
> ---
>  net/ceph/ceph_common.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
> index 5c036d2..1e492ef 100644
> --- a/net/ceph/ceph_common.c
> +++ b/net/ceph/ceph_common.c
> @@ -421,6 +421,10 @@ struct ceph_options *
> opt->name = kstrndup(argstr[0].from,
>   argstr[0].to-argstr[0].from,
>   GFP_KERNEL);
> +   if (!opt->name) {
> +   err = -ENOMEM;
> +   goto out;
> +   }
> break;
> case Opt_secret:
> opt->key = kzalloc(sizeof(*opt->key), GFP_KERNEL);

Applied.

Thanks,

Ilya

Re: [PATCH net] ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only

2018-01-29 Thread David Miller

From: Martin KaFai Lau 
Date: Wed, 24 Jan 2018 23:15:27 -0800

> If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
> implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
> this addr_type checking and setting sk->sk_ipv6only to 1 are only done
> after sk->sk_prot->get_port(sk, snum) has been completed successfully.
> 
> This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
> the 'get_port()'.
> 
> In particular, when binding SO_REUSEPORT UDP sockets,
> udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
> checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
> sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
> 1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
> reuseport_alloc(sk) is called instead of adding sk to the existing
> sk2->sk_reuseport_cb.
> 
> It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
> IPv6 address (!ANY and !MAPPED).  Only one of the socket will
> receive packet.
> 
> The fix is to set the implicit sk_ipv6only before calling get_port().
> The original sk_ipv6only has to be saved such that it can be restored
> in case get_port() failed.  The situation is similar to the
> inet_reset_saddr(sk) after get_port() has failed.
> 
> Thanks to Calvin Owens  who created an easy
> reproduction which leads to a fix.
> 
> Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
> Signed-off-by: Martin KaFai Lau 

Applied and queued up for -stable, thanks Martin.

Re: [PATCH v2 0/4] Check size of packets before sending

2018-01-29 Thread David Miller

From: Daniel Axtens 
Date: Mon, 29 Jan 2018 14:20:58 +1100

> OK, so how about:
> 
>  - first, a series that introduces skb_gso_validate_mac_len and uses it
>in bnx2x. This should be backportable without difficulty.
> 
>  - then, a series that wires skb_gso_validate_mac_len into the core -
>validate_xmit_skb for instance, and reverts the bnx2x fix. This would
>not need to be backported.
> 
> If that's an approach we can agree on, I am ready to send it when
> net-next opens again.

Please send the bnx2x specific fix now, thank you.

Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-29 Thread David Miller

From: Christian Brauner 
Date: Wed, 24 Jan 2018 15:26:31 +0100

> Based on the previous discussion this enables passing a IFLA_IF_NETNSID
> property along with RTM_SETLINK and RTM_DELLINK requests. The patch for
> RTM_NEWLINK will be sent out in a separate patch since there are more
> corner-cases to think about.
 ...
> Changelog 2018-01-24:
> * Preserve old behavior and report -ENODEV when either ifindex or ifname is
>   provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller.

Series applied, and this seems to be consistent with what Jiri envisioned
in commit 79e1ad148c84 ("rtnetlink: use netnsid to query interface")

Thank you.

[ANNOUNCE] iproute2 4.15 release

2018-01-29 Thread Stephen Hemminger

Release of iproute2 for Linux 4.15

Update to iproute2 utility to support new features in Linux 4.15.
This release ands more JSON output and fixes some bugs that JSON
code introduced. Also more updates to active developing subsystems
such as devlink and bpf.

The tarball can be dowloaded from:
  https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.15.0.tar.gz

The upstream repositories for master and net-next branch are now
split. Master branch is at:
  git://git.kernel.org/pub/scm/network/iproute2/iproute2.gti

and patches for next release are in (master branch):
  git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git

There are symlinks so that old paths still work.

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Alexander Alemayhu (4):
  man: add examples to ip.8
  man: fix man page warnings
  tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support
  examples/bpf: update list of examples

Alexander Aring (5):
  tc: m_ife: allow ife type to zero
  tc: m_ife: print IEEE ethertype format
  tc: m_ife: report about kernels default type
  man: tc-ife: add default type note
  tc: m_ife: fix match tcindex parsing

Alexander Heinlein (1):
  ip/xfrm: Fix deleteall when having many policies installed

Alexander Zubkov (2):
  iproute: list/flush/save filter also by metric
  iproute: "list/flush/save default" selected all of the routes

Alexey Kodanev (1):
  fix typo in ip-xfrm man page, rmd610 -> rmd160

Amir Vadai (14):
  libnetlink: Introduce rta_getattr_be*()
  tc/cls_flower: Classify packet in ip tunnels
  tc/act_tunnel: Introduce ip tunnel action
  tc/pedit: Fix a typo in pedit usage message
  tc/pedit: Extend pedit to specify offset relative to mac/transport headers
  tc/pedit: Introduce 'add' operation
  tc/pedit: p_ip: introduce editing ttl header
  tc/pedit: Support fields bigger than 32 bits
  tc/pedit: p_eth: ETH header editor
  tc/pedit: p_tcp: introduce pedit tcp support
  pedit: Fix a typo in warning
  pedit: Do not allow using retain for too big fields
  pedit: Check for extended capability in protocol parser
  pedit: Introduce ipv6 support

Amritha Nambiar (4):
  tc/mqprio: Offload mode and shaper options in mqprio
  flower: Represent HW traffic classes as classid values
  man: tc-mqprio: add documentation for new offload options
  man: tc-flower: add explanation for hw_tc option

Andreas Henriksson (1):
  ss: fix help/man TCP-STATE description for listening

Antonio Quartulli (2):
  ss: fix crash when skipping disabled header field
  ss: fix NULL pointer access when parsing unix sockets with oldformat

Arkadi Sharshevsky (7):
  devlink: Change netlink attribute validation
  devlink: Add support for pipeline debug (dpipe)
  bridge: Distinguish between externally learned vs offloaded FDBs
  devlink: Make match/action parsing more flexible
  devlink: Add support for special format protocol headers
  devlink: Add support for protocol IPv4/IPv6/Ethernet special formats
  devlink: Ignore unknown attributes

Asbjørn Sloth Tønnesen (2):
  testsuite: refactor kernel config search
  testsuite: search for kernel config in /boot

Baruch Siach (3):
  tc: add missing limits.h header
  ip: include libc headers first
  lib: fix multiple strlcpy definition

Benjamin LaHaise (2):
  f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"
  tc: flower: support for matching MPLS labels

Boris Pismenny (1):
  ip xfrm: Add xfrm state crypto offload

Casey Callendrello (1):
  netns: make /var/run/netns bind-mount recursive

Chris Mi (1):
  tc: fix command "tc actions del" hang issue

Christian Ehrhardt (2):
  tests: read limited amount from /dev/urandom
  tests: make sure rand_dev suffix has 6 chars

Christoph Paasch (1):
  ip: add fastopen_no_cookie option to ip route

Craig Gallek (2):
  gre6: fix copy/paste bugs in GREv6 attribute manipulation
  iplink: Expose IFLA_*_FWMARK attributes for supported link types

Cyrill Gorcunov (2):
  libnetlink: Add test for error code returned from netlink reply
  ss: Add inet raw sockets information gathering via netlink diag interface

Daniel Borkmann (19):
  bpf: make tc's bpf loader generic and move into lib
  bpf: check for owner_prog_type and notify users when differ
  bpf: add initial support for attaching xdp progs
  {f,m}_bpf: dump tag over insns
  bpf: test for valid type in bpf_get_work_dir
  bpf: add support for generic xdp
  bpf: update printing of generic xdp mode
  bpf: dump error to the user when retrieving pinned prog fails
  bpf: indicate lderr when bpf_apply_relo_data fails
  bpf: remove obsolete samples
  bpf: support loading map in map from obj
  bpf: dump id/jited info for cls/act programs
  bpf: improve error

xdp_router_ipv4 mellanox problem

2018-01-29 Thread Paweł Staszewski


Hi

Want to do some tests with xdp_router on two 100G physical interfaces but:

Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0: MLX5E: StrdRq(0) 
RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0 enp175s0f0: Link up
Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1: MLX5E: StrdRq(0) 
RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1 enp175s0f1: Link up
Jan 29 17:00:41 HOST kernel: [ cut here ]
Jan 29 17:00:41 HOST kernel: Driver unsupported XDP return value 4, 
expect packet loss!
Jan 29 17:00:41 HOST kernel: WARNING: CPU: 43 PID: 0 at 
net/core/filter.c:3901 bpf_warn_invalid_xdp_action+0x34/0x40

Jan 29 17:00:41 HOST kernel: Modules linked in: x86_pkg_temp_thermal ipmi_si
Jan 29 17:00:41 HOST kernel: CPU: 43 PID: 0 Comm: swapper/43 Not tainted 
4.15.0-rc9+ #1

Jan 29 17:00:41 HOST kernel: RIP: 0010:bpf_warn_invalid_xdp_action+0x34/0x40
Jan 29 17:00:41 HOST kernel: RSP: 0018:88087f9c3dc8 EFLAGS: 00010296
Jan 29 17:00:41 HOST kernel: RAX: 003a RBX: 88081ea38000 
RCX: 0006
Jan 29 17:00:41 HOST kernel: RDX: 0007 RSI: 0092 
RDI: 88087f9d53d0
Jan 29 17:00:41 HOST kernel: RBP: 88087f9c3e58 R08: 0001 
R09: 0536
Jan 29 17:00:41 HOST kernel: R10: 0004 R11: 0536 
R12: 8808304d3000
Jan 29 17:00:41 HOST kernel: R13: 02c0 R14: 88081e53c000 
R15: c907d000
Jan 29 17:00:41 HOST kernel: FS:  () 
GS:88087f9c() knlGS:
Jan 29 17:00:41 HOST kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jan 29 17:00:41 HOST kernel: CR2: 02038648 CR3: 0220a002 
CR4: 007606e0
Jan 29 17:00:41 HOST kernel: DR0:  DR1:  
DR2: 
Jan 29 17:00:41 HOST kernel: DR3:  DR6: fffe0ff0 
DR7: 0400

Jan 29 17:00:41 HOST kernel: PKRU: 5554
Jan 29 17:00:41 HOST kernel: Call Trace:
Jan 29 17:00:41 HOST kernel:  
Jan 29 17:00:41 HOST kernel:  mlx5e_handle_rx_cqe+0x279/0x900
Jan 29 17:00:41 HOST kernel:  mlx5e_poll_rx_cq+0xb3/0x860
Jan 29 17:00:41 HOST kernel:  mlx5e_napi_poll+0x81/0x6f0
Jan 29 17:00:41 HOST kernel:  ? mlx5_cq_completion+0x4d/0xb0
Jan 29 17:00:41 HOST kernel:  net_rx_action+0x1cd/0x2f0
Jan 29 17:00:41 HOST kernel:  __do_softirq+0xe4/0x275
Jan 29 17:00:41 HOST kernel:  irq_exit+0x6b/0x70
Jan 29 17:00:41 HOST kernel:  do_IRQ+0x45/0xc0
Jan 29 17:00:41 HOST kernel:  common_interrupt+0x95/0x95
Jan 29 17:00:41 HOST kernel:  
Jan 29 17:00:41 HOST kernel: RIP: 0010:mwait_idle+0x59/0x160
Jan 29 17:00:41 HOST kernel: RSP: 0018:c90003497ef8 EFLAGS: 0246 
ORIG_RAX: ffdd
Jan 29 17:00:41 HOST kernel: RAX:  RBX: 002b 
RCX: 
Jan 29 17:00:41 HOST kernel: RDX:  RSI:  
RDI: 
Jan 29 17:00:41 HOST kernel: RBP: 002b R08: 1000 
R09: 
Jan 29 17:00:41 HOST kernel: R10:  R11: 000100130e40 
R12: 88086d165000
Jan 29 17:00:41 HOST kernel: R13: 88086d165000 R14:  
R15: 

Jan 29 17:00:41 HOST kernel:  do_idle+0x14e/0x160
Jan 29 17:00:41 HOST kernel:  cpu_startup_entry+0x14/0x20
Jan 29 17:00:41 HOST kernel:  secondary_startup_64+0xa5/0xb0
Jan 29 17:00:41 HOST kernel: Code: c3 83 ff 04 48 c7 c0 1a cf 10 82 89 
fa c6 05 9a df b4 00 01 48 c7 c6 22 cf 10 82 48 c7 c7 38 cf 10 82 48 0f 
47 f0 e8 ec 19 8b ff <0f> ff c3 66 0f 1f 84 00 00 00 00 00 81 fe ff ff 
00 00 55 48 89

Jan 29 17:00:41 HOST kernel: ---[ end trace 2b255fac8d0824de ]---


I can attach xdp_router_ipv4 to any vlan interface without crash

./xdp_router_ipv4 vlan4032

**loading bpf file*


Attached to 8
***ROUTE TABLE*


NEW Route entry
Destination Gateway Genmask Metric  Iface
192.168.32.0  0 24 0   vlan4032
***ARP TABLE***


Address HwAddress
7920a8c0    8da6fb902500
120a8c0 44fc9e0c5e4c



But after attaching to physical interface there is "above trace".



Thanks

Paweł

Re: [PATCH iproute2] police: don't skip parameters after actions

2018-01-29 Thread Stephen Hemminger

On Mon, 29 Jan 2018 12:13:11 +0100
Wolfgang Bumiller  wrote:

> The 'parse_action_control()' helper advances the argument
> pointers to past its parsed action already, so don't
> advance it further in 'act_parse_polic()'.
> 
> Fixes: e67aba559581 ("tc: actions: add helpers to parse and print control 
> actions")
> Signed-off-by: Wolfgang Bumiller 
> ---
> Basically parse_action_control() silently added a NEXT_ARG() while the
> cases before didn't have one. Not sure whether the goto is okay
> style-wise, let me know if you prefer some other solution.
> 
> Example for triggering this:
> Specifying a 'flowid X' after a `police ... drop` will skip the 'flowid'
> and error with "What is X"
> 
> $ tc filter add dev eth0 parent : basic police rate 13371337bps burst 
> 1337b mtu 64kb drop flowid :1
> What is ":1"?

Thank you for the patch. It is a real problem, and your patch addresses it.
I just don't like jumping around in the the argument parsing with goto's.
There was a similar problem recently, and the better fix was to fix the 
semantics
of the parsing function to not  do the extra implicit NEXT_ARG in the parsing 
logic.
There is less likely to be future problems if all parsing functions leave the
with the same argument location.

Please try that and resubmit.

Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Stephen Hemminger

On Mon, 29 Jan 2018 16:18:07 +0100
"Roland Franke"  wrote:

> Hello,
> 
> > To: Roland Franke ; netdev@vger.kernel.org 
> > Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic   
> >> 
> >> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
> >> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
> >> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
> >> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil 
> >> 1kbit prio 0
> >> 
> >> I become an Kernel-panic with the following output:
> >> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200  
> >   
> 
> > Would you have a stack trace to share with us ?  
> 
> As i will be an absolute newby here, i will not know how to 
> get the stack trace out.
> When i will get some information how to get this, i can try to 
> give you this information.
> But by my last tests i made the first 3 commands on an console
> and had no error. Only by typing the last line i will get the error and
> here i get actally only the "kern.err kernel: BUG: ." message.
> 
> Roland

It generates this with lockdep (on 4.15)

[  151.355076] HTB: quantum of class 27 is big. Consider r2q change.
[  151.371495] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:747
[  151.371815] in_atomic(): 1, irqs_disabled(): 0, pid: 1135, name: tc
[  151.371875] 2 locks held by tc/1135:
[  151.371878]  #0:  (rtnl_mutex){+.+.}, at: [] 
rtnetlink_rcv_msg+0x23b/0x5f0
[  151.371899]  #1:  (_tx_lock){+...}, at: [] 
htb_change_class+0x55f/0x880 [sch_htb]
[  151.371918] CPU: 2 PID: 1135 Comm: tc Not tainted 4.14.15 #2
[  151.371921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  151.371924] Call Trace:
[  151.371934]  dump_stack+0x7c/0xbe
[  151.371944]  ___might_sleep+0x21e/0x250
[  151.371953]  __mutex_lock+0x59/0x9a0
[  151.371960]  ? lock_acquire+0xec/0x1e0
[  151.371973]  ? __raw_spin_lock_init+0x1c/0x50
[  151.371990]  ? _rcu_barrier+0x2f/0x170
[  151.371995]  ? __lockdep_init_map+0x5c/0x1d0
[  151.371998]  _rcu_barrier+0x2f/0x170
[  151.372006]  tcf_block_put+0x7f/0xa0
[  151.372013]  sfq_destroy+0x15/0x50 [sch_sfq]
[  151.372021]  qdisc_destroy+0x6c/0xe0
[  151.372028]  htb_change_class+0x704/0x880 [sch_htb]
[  151.372050]  tc_ctl_tclass+0x193/0x3c0
[  151.372075]  rtnetlink_rcv_msg+0x270/0x5f0
[  151.372088]  ? rtnetlink_put_metrics+0x1c0/0x1c0
[  151.372096]  netlink_rcv_skb+0xde/0x110
[  151.372109]  netlink_unicast+0x1e4/0x310
[  151.372118]  netlink_sendmsg+0x2dc/0x3c0
[  151.372136]  sock_sendmsg+0x30/0x40
[  151.372142]  ___sys_sendmsg+0x2b9/0x2d0
[  151.372171]  ? __handle_mm_fault+0x7f8/0x1120
[  151.372189]  ? __do_page_fault+0x2a5/0x530
[  151.372200]  ? __sys_sendmsg+0x51/0x90
[  151.372205]  __sys_sendmsg+0x51/0x90
[  151.372225]  entry_SYSCALL_64_fastpath+0x29/0xa0
[  151.372230] RIP: 0033:0x7f7049397490
[  151.372233] RSP: 002b:7ffd0c432f48 EFLAGS: 0246
[  151.372445] BUG: scheduling while atomic: tc/1135/0x0202
[  151.372524] 3 locks held by tc/1135:
[  151.372527]  #0:  (rtnl_mutex){+.+.}, at: [] 
rtnetlink_rcv_msg+0x23b/0x5f0
[  151.372544]  #1:  (_tx_lock){+...}, at: [] 
htb_change_class+0x55f/0x880 [sch_htb]
[  151.372560]  #2:  (rcu_sched_state.barrier_mutex){+.+.}, at: 
[] _rcu_barrier+0x2f/0x170
[  151.372575] Modules linked in: sch_sfq sch_htb ppdev input_leds serio_raw 
joydev parport_pc parport i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs 
zstd_decompress zstd_compress xxhash raid10 raid456 libcrc32c async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath 
linear qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 
crypto_simd cryptd glue_helper psmouse floppy pata_acpi hid_generic usbhid hid
[  151.372748] CPU: 2 PID: 1135 Comm: tc Tainted: GW   4.14.15 #2
[  151.372751] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  151.372753] Call Trace:
[  151.372759]  dump_stack+0x7c/0xbe
[  151.372767]  __schedule_bug+0x62/0x90
[  151.372772]  __schedule+0x7d9/0xb80
[  151.372788]  schedule+0x39/0x90
[  151.372793]  schedule_timeout+0x24d/0x5c0
[  151.372813]  ? mark_held_locks+0x71/0xa0
[  151.372825]  ? wait_for_completion+0x12e/0x1a0
[  151.372829]  wait_for_completion+0x12e/0x1a0
[  151.372835]  ? wake_up_q+0x60/0x60
[  151.372846]  _rcu_barrier+0x11a/0x170
[  151.372853]  tcf_block_put+0x7f/0xa0
[  151.372860]  sfq_destroy+0x15/0x50 [sch_sfq]
[  151.372866]  qdisc_destroy+0x6c/0xe0
[  151.372872]  htb_change_class+0x704/0x880 [sch_htb]
[  151.372894]  tc_ctl_tclass+0x193/0x3c0
[  151.372918]  rtnetlink_rcv_msg+0x270/0x5f0
[  151.372932]  ? rtnetlink_put_metrics+0x1c0/0x1c0
[  151.372940]  netlink_rcv_skb+0xde/0x110
[  151.372952]

sctp netns "unregister_netdevice: waiting for lo to become free. Usage count = 1"

2018-01-29 Thread Tommi Rantala


Hi,

When running sctp_test from lksctp-tools in netns in 4.4 and 4.9 with 
suitable arguments, the local loopback device in the netns is not 
getting destroyed after deleting the netns.


For example:

ip netns add TEST
ip netns exec TEST ip link set lo up
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dev dummy0 netns TEST
ip link set dev dummy1 netns TEST
ip link set dev dummy2 netns TEST
ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
ip netns exec TEST ip link set dummy0 up
ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
ip netns exec TEST ip link set dummy1 up
ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
ip netns exec TEST ip link set dummy2 up
ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 
2 -s -B 192.168.1.3

ip netns del TEST

Results to:

[  354.179591] unregister_netdevice: waiting for lo to become free. 
Usage count = 1
[  364.419674] unregister_netdevice: waiting for lo to become free. 
Usage count = 1
[  374.663664] unregister_netdevice: waiting for lo to become free. 
Usage count = 1
[  384.903717] unregister_netdevice: waiting for lo to become free. 
Usage count = 1
[  395.143724] unregister_netdevice: waiting for lo to become free. 
Usage count = 1
[  405.383645] unregister_netdevice: waiting for lo to become free. 
Usage count = 1

...

Based on a quick test, 4.14 and 4.15 does not suffer from this, but its 
reproducible e.g. in 4.4.113 and 4.9.75


Any ideas?

Tommi

Re: [PATCH bpf-next 07/13] bpf, s390x: remove obsolete exception handling from div/mod

2018-01-29 Thread Daniel Borkmann

On 01/29/2018 03:33 PM, Michael Holzheu wrote:
> Am Fri, 26 Jan 2018 23:33:42 +0100
> schrieb Daniel Borkmann :
> 
>> Since we've changed div/mod exception handling for src_reg in
>> eBPF verifier itself, 
> 
> Maybe add the commit that introduced that to the patch description?

I couldn't add it here since it was all part of the same series and
thus didn't have a stable commit id for future reference yet. Maybe
I should have just put the commit subject in this case; will do next
time.

>> remove the leftovers from s390x JIT.
>>
>> Signed-off-by: Daniel Borkmann 
>> Cc: Michael Holzheu 
>> ---
>>  arch/s390/net/bpf_jit_comp.c | 10 --
>>  1 file changed, 10 deletions(-)
>>
>> diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
>> index e501887..78a19c9 100644
>> --- a/arch/s390/net/bpf_jit_comp.c
>> +++ b/arch/s390/net/bpf_jit_comp.c
>> @@ -610,11 +610,6 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
>> struct bpf_prog *fp, int i
>>  {
>>  int rc_reg = BPF_OP(insn->code) == BPF_DIV ? REG_W1 : REG_W0;
>>
>> -jit->seen |= SEEN_RET0;
>> -/* ltr %src,%src (if src == 0 goto fail) */
>> -EMIT2(0x1200, src_reg, src_reg);
>> -/* jz  */
>> -EMIT4_PCREL(0xa784, jit->ret0_ip - jit->prg);
>>  /* lhi %w0,0 */
>>  EMIT4_IMM(0xa708, REG_W0, 0);
>>  /* lr %w1,%dst */
>> @@ -630,11 +625,6 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
>> struct bpf_prog *fp, int i
>>  {
>>  int rc_reg = BPF_OP(insn->code) == BPF_DIV ? REG_W1 : REG_W0;
>>
>> -jit->seen |= SEEN_RET0;
>> -/* ltgr %src,%src (if src == 0 goto fail) */
>> -EMIT4(0xb902, src_reg, src_reg);
>> -/* jz  */
>> -EMIT4_PCREL(0xa784, jit->ret0_ip - jit->prg);
>>  /* lghi %w0,0 */
>>  EMIT4_IMM(0xa709, REG_W0, 0);
>>  /* lgr %w1,%dst */
> 
> If the check is done in the verifier now, this looks good to me.
> 
> Reviewed-by: Michael Holzheu 

Thanks for the review, Michael!

Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

2018-01-29 Thread Alexander Duyck

On Sun, Jan 28, 2018 at 11:21 PM, Benjamin Poirier  wrote:
> On 2018/01/26 09:03, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier  wrote:
>> > This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
>> > This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
>> >
>> > ... because they cause an extra 2s delay for the link to come up when
>> > autoneg is off.
>> >
>> > After reverting, the race condition described in the log of commit
>> > 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
>> > reintroduced. It may still be triggered by LSC events but this should not
>> > result in link flap. It may no longer be triggered by RXO events because
>> > commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
>> > restored reading icr in the Other handler.
>>
>> With the RXO events removed the only cause for us to transition the
>> bit should be LSC. I'm not sure if the race condition in that state is
>> a valid concern or not as the LSC should only get triggered if the
>> link state toggled, even briefly.
>>
>> The bigger concern I would have would be the opposite of the original
>> race that was pointed out:
>> \ e1000_watchdog_task
>> \ e1000e_has_link
>> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
>> /* link is up */
>> mac->get_link_status = false;
>>
>> /* interrupt */
>> \ e1000_msix_other
>> hw->mac.get_link_status = true;
>>
>> link_active = !hw->mac.get_link_status
>> /* link_active is false, wrongly */

 \ e1000_watchdog_task
 \ e1000e_has_link
 \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
 /* interrupt */
 \ e1000_msix_other
 hw->mac.get_link_status = true;

/* link is up */
 mac->get_link_status = false;

 link_active = !hw->mac.get_link_status
 /* link_active is true, wrongly */

So basically the idea would be that the interrupt fires after we check
for link, but before we have updated get_link_status. It should be a
pretty narrow window and I don't know if it will be much of an issue.

>> So the question I would have is what if we see the LSC for a link down
>> just after the check_for_copper_link call completes? It may not be
>
> Can you write out exactly what that race would be, in a format similar to the
> above?

I just did the copy/paste/edit above if that works. Hopefully that
makes it clearer.

>> anything seen in the real world since I don't know if we have any link
>> flapping issues on e1000e or not without this patch. It is something
>> to keep in mind for the future though.
>>
>>
>> > As discussed, the driver should be in "maintenance mode". In the interest
>> > of stability, revert to the original code as much as possible instead of a
>> > half-baked solution.
>>
>> If nothing else we may want to do a follow-up on this patch as we
>> probably shouldn't be returning the error values to trigger link up.
>> There are definitely issues to be found here. If nothing else we may
>> want to explore just returning 1 if auto-neg is disabled instead of
>> returning an error code.
>>
>> > Link: https://www.spinics.net/lists/netdev/msg479923.html
>> > Signed-off-by: Benjamin Poirier 
> [...]

Re: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-01-29 Thread Alexander Duyck

On Sun, Jan 28, 2018 at 11:20 PM, Benjamin Poirier  wrote:
> On 2018/01/26 08:50, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier  wrote:
>> > This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
>> >
>> > We keep the fix for the first part of the problem (1) described in the log
>> > of that commit however we use the implementation of e1000_msix_other() from
>> > before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> > We remove the fix for the second part of the problem (2).
>> >
>> > Bursts of "Other" interrupts may once again occur during rxo (receive
>> > overflow) traffic conditions. This is deemed acceptable in the interest of
>> > reverting driver code back to its previous state. The e1000e driver should
>> > be in "maintenance" mode and we want to avoid unforeseen fallout from
>> > changes that are not strictly necessary.
>> >
>> > Link: https://www.spinics.net/lists/netdev/msg480675.html
>> > Signed-off-by: Benjamin Poirier 
>>
>> Thanks for doing this.
>>
>> Only a few minor tweaks I would recommend. Otherwise it looks good.
>>
>> > ---
>> >  drivers/net/ethernet/intel/e1000e/defines.h |  1 -
>> >  drivers/net/ethernet/intel/e1000e/netdev.c  | 32 
>> > +++--
>> >  2 files changed, 12 insertions(+), 21 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
>> > b/drivers/net/ethernet/intel/e1000e/defines.h
>> > index afb7ebe20b24..0641c0098738 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/defines.h
>> > +++ b/drivers/net/ethernet/intel/e1000e/defines.h
>> > @@ -398,7 +398,6 @@
>> >  #define E1000_ICR_LSC   0x0004 /* Link Status Change */
>> >  #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
>> >  #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) 
>> > */
>> > -#define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
>> >  #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
>> >  #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
>> >  /* If this bit asserted, the driver should claim the interrupt */
>> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
>> > b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > index 9f18d39bdc8f..398e940436f8 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > @@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int 
>> > __always_unused irq, void *data)
>> > struct net_device *netdev = data;
>> > struct e1000_adapter *adapter = netdev_priv(netdev);
>> > struct e1000_hw *hw = >hw;
>> > -   u32 icr;
>> > -   bool enable = true;
>> > -
>> > -   icr = er32(ICR);
>> > -   if (icr & E1000_ICR_RXO) {
>> > -   ew32(ICR, E1000_ICR_RXO);
>> > -   enable = false;
>> > -   /* napi poll will re-enable Other, make sure it runs */
>> > -   if (napi_schedule_prep(>napi)) {
>> > -   adapter->total_rx_bytes = 0;
>> > -   adapter->total_rx_packets = 0;
>> > -   __napi_schedule(>napi);
>> > -   }
>> > -   }
>> > -   if (icr & E1000_ICR_LSC) {
>> > -   ew32(ICR, E1000_ICR_LSC);
>> > +   u32 icr = er32(ICR);
>> > +
>> > +   if (icr & adapter->eiac_mask)
>> > +   ew32(ICS, (icr & adapter->eiac_mask));
>>
>> I'm not sure about this bit as it includes queue interrupts if I am
>> not mistaken. What we should be focusing on are the bits that are not
>> auto-cleared so this should probably be icr & ~adapter->eiac_mask.
>> Actually what you might do is something like:
>> icr &= ~adapter->eiac_mask;
>> if (icr)
>> ew32(ICS, icr);
>>
>> Also a comment explaining that we have to clear the bits that are not
>> automatically cleared by other interrupt causes might be useful.
>
> I've re-read your comment many times and I think you misunderstood what
> that code is doing.

You are right. I did.

> This:
>> > +   if (icr & adapter->eiac_mask)
>> > +   ew32(ICS, (icr & adapter->eiac_mask));
>
> re-raises the queue interrupts in case the txq or rxq bits were set in
> icr and the Other interrupt handler read and cleared icr before the
> queue interrupt was raised. It's not about "clear the bits that are not
> automatically cleared". It's a write to ICS, not ICR.

Actually this raises other possible issues. Now I am wondering if the
actual issue is the fact that we are reading the ICR at all in MSI-X
mode with EIAME enabled. For now I guess we need it since reading the
ICR could potentially be causing the events to be cleared.

There is actually HW Errata 12 that we need to take into account as
well, the document is available at:

Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config

2018-01-29 Thread Daniel Borkmann

On 01/29/2018 12:40 AM, Daniel Borkmann wrote:
> On 01/28/2018 03:45 PM, Greg KH wrote:
>> On Wed, Jan 24, 2018 at 11:10:50AM +0100, Daniel Borkmann wrote:
>>> On 01/24/2018 11:07 AM, David Woodhouse wrote:
 On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote:
> On 01/09/2018 07:04 PM, Alexei Starovoitov wrote:
[...]
> Applied to bpf tree, thanks Alexei!

 For stable too?
>>>
>>> Yes, this will go into stable as well; batch of backports will come 
>>> Thurs/Fri.
>>
>> Any word on these?  Worse case, a simple list of git commit ids to
>> backport is all I need.
> 
> Sorry for the delay! There are various conflicts all over the place, so I had
> to backport manually. I just flushed out tested 4.14 batch, I'll see to get 
> 4.9
> out hopefully tonight as well, and the rest for 4.4 on Mon.

While 4.14 and 4.9 BPF backports are tested and out since yesterday, and I
saw Greg queued them up (thanks!), it looks like plain 4.4.113 doesn't even
boot on my machine. While I can shortly see the kernel log, my screen turns
black shortly thereafter and nothing reacts anymore. No such problems with
4.9 and 4.14 stables seen. (using x86_64, i7-6600U) Is this a known issue?

Thanks,
Daniel

RE: r8169 take too long to complete driver initialization

2018-01-29 Thread Hau

Hi Chris,

Could you test following patch?

 DECLARE_RTL_COND(rtl_ocp_tx_cond)
 {
void __iomem *ioaddr = tp->mmio_addr;
 
-   return RTL_R8(IBISR0) & 0x02;
+   return RTL_R8(IBISR0) & 0x20;
 }
 
 static void rtl8168ep_stop_cmac(struct rtl8169_private *tp)
 {
void __iomem *ioaddr = tp->mmio_addr;
 
RTL_W8(IBCR2, RTL_R8(IBCR2) & ~0x01);
-   rtl_msleep_loop_wait_low(tp, _ocp_tx_cond, 50, 2000);
+   rtl_msleep_loop_wait_high(tp, _ocp_tx_cond, 50, 2000);
RTL_W8(IBISR0, RTL_R8(IBISR0) | 0x20);
RTL_W8(IBCR0, RTL_R8(IBCR0) & ~0x01);
 }

Thanks.

--Please consider the environment before printing this e-mail.

> -Original Message-
> From: Chris Chiu [mailto:c...@endlessm.com]
> Sent: Monday, January 29, 2018 6:12 PM
> To: nic_swsd ; netdev@vger.kernel.org; Linux
> Kernel ; Linux Upstreaming Team
> 
> Subject: Re: r8169 take too long to complete driver initialization
> 
> On Fri, Jan 5, 2018 at 10:17 AM, Chris Chiu  wrote:
> > On Wed, Dec 20, 2017 at 4:41 PM, Chris Chiu  wrote:
> >> Hi,
> >> We've hit a suspend/resume issue on a Acer desktop caused by
> >> r8169 driver. The dmseg
> >> https://gist.github.com/mschiu77/b741849b5070281daaead8dfee312d1a
> >> shows it's still in msleep() within a mutex lock.
> >> After looking into the code, it's caused by the
> >> rtl8168ep_stop_cmac() which is waiting 100 seconds for
> >> rtl_ocp_tx_cond. The following dmesg states that the r8169 driver is
> >> loaded.
> >>
> >> [   20.270526] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> >>
> >> But it takes > 100 seconds to get the following messages
> >>
> >> [  140.400223] r8169 :02:00.0 (unnamed net_device)
> >> (uninitialized): rtl_ocp_tx_cond == 1 (loop: 2000, delay: 50).
> >> [  140.413294] r8169 :02:00.0 eth0: RTL8168ep/8111ep at
> >> 0xb16c80db1000, f8:0f:41:ea:74:0d, XID 10200800 IRQ 46 [
> >> 140.413297] r8169 :02:00.0 eth0: jumbo features [frames: 9200
> >> bytes, tx checksumming: ko]
> >>
> >> So any trial to suspend the machine during this period would always
> >> get device/resource busy message then abort. Is this  rtl_ocp_tx_cond
> >> necessary? Because the ethernet is still working and I don't see any
> >> problem. I don't know it should be considered normal or not. Please
> >> let me know if any more information required. Thanks
> >>
> >> Chris
> >
> > gentle ping,
> >
> > cheers.
> 
> Hi,
> Just found a r8168 driver which seems to be authrized by realtek for cross
> comparison. I tried applying the patch to latest 4.15 kernel and the driver
> done it's initialization in faily short time. The patch file is here
> https://gist.github.com/mschiu77/fcf406e64a1a437f46cf2be643f1057d.
> 
> In mainline r8169.c, the IBISR0 register need to be polled in the
> rtl8168ep_stop_cmac().
> In the patch file, there's also the same IBISR0 polling code in
> Dash2DisableTx(), but it's been bypassed since the chipset maches
> HW_DASH_SUPPORT_TYPE_2.
> Per the rtl_chip_info[] in r8168_n.c, CFG_METHOD_23/27/28 are
> HW_DASH_SUPPORT_TYPE_2, and they happens to be the only 3 named
> RTL8168EP/8111EP in the rtl_chip_info[].
> 
> To find the same matches in r8169.c, RTL_GIGA_MAC_VER_49/50/51
> seems share the same config. Can anyone clarify if the rtl_ocp_tx_cond()
> really necessary for 8168EP/8111EP?
> Or we can just ignore the condition check for RTL_GIGA_MAC_VER_49/50/51?
> 
> Chris
> 
> --Please consider the environment before printing this e-mail.

Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Roland Franke

Hello,

To: Roland Franke ; netdev@vger.kernel.org 
Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic 

tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
tc qdisc add dev eth0 parent 20:7 sfq perturb 10
tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil 
1kbit prio 0

I become an Kernel-panic with the following output:
kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200

Would you have a stack trace to share with us ?

As i will be an absolute newby here, i will not know how to 
get the stack trace out.
When i will get some information how to get this, i can try to 
give you this information.

But by my last tests i made the first 3 commands on an console
and had no error. Only by typing the last line i will get the error and
here i get actally only the "kern.err kernel: BUG: ." message.

Roland

Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Eric Dumazet

On Mon, 2018-01-29 at 15:51 +0100, Roland Franke wrote:
> Starting with Kernel 4.14.1 (x86_64) I found the following problem:
> When I made the following commands:
> 
> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil 
> 1kbit prio 0
> 
> I become an Kernel-panic with the following output:
> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200
> 
> The identical message come, when i try the kernel till 4.14.15.
> 
> When i have tested the same with one of the kernels from the 4.13 series,
> i had no problem.Will this be solved in the kernel-series 4.14 or will this 
> only with the
> coming series 4.15 be made?
> 
> Kernel 4.15 is not be tested till now.
> 
> Best regards,
> Roland Franke 
> 

Hi Roland

Would you have a stack trace to share with us ?

Thanks !

BUG: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Roland Franke


Starting with Kernel 4.14.1 (x86_64) I found the following problem:
When I made the following commands:

tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
tc qdisc add dev eth0 parent 20:7 sfq perturb 10
tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil 
1kbit prio 0


I become an Kernel-panic with the following output:
kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200

The identical message come, when i try the kernel till 4.14.15.

When i have tested the same with one of the kernels from the 4.13 series,
i had no problem.Will this be solved in the kernel-series 4.14 or will this 
only with the

coming series 4.15 be made?

Kernel 4.15 is not be tested till now.

Best regards,
Roland Franke

Re: [PATCH bpf-next 07/13] bpf, s390x: remove obsolete exception handling from div/mod

2018-01-29 Thread Michael Holzheu

Am Fri, 26 Jan 2018 23:33:42 +0100
schrieb Daniel Borkmann :

> Since we've changed div/mod exception handling for src_reg in
> eBPF verifier itself, 

Maybe add the commit that introduced that to the patch description?

> remove the leftovers from s390x JIT.
> 
> Signed-off-by: Daniel Borkmann 
> Cc: Michael Holzheu 
> ---
>  arch/s390/net/bpf_jit_comp.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
> index e501887..78a19c9 100644
> --- a/arch/s390/net/bpf_jit_comp.c
> +++ b/arch/s390/net/bpf_jit_comp.c
> @@ -610,11 +610,6 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
> struct bpf_prog *fp, int i
>   {
>   int rc_reg = BPF_OP(insn->code) == BPF_DIV ? REG_W1 : REG_W0;
> 
> - jit->seen |= SEEN_RET0;
> - /* ltr %src,%src (if src == 0 goto fail) */
> - EMIT2(0x1200, src_reg, src_reg);
> - /* jz  */
> - EMIT4_PCREL(0xa784, jit->ret0_ip - jit->prg);
>   /* lhi %w0,0 */
>   EMIT4_IMM(0xa708, REG_W0, 0);
>   /* lr %w1,%dst */
> @@ -630,11 +625,6 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
> struct bpf_prog *fp, int i
>   {
>   int rc_reg = BPF_OP(insn->code) == BPF_DIV ? REG_W1 : REG_W0;
> 
> - jit->seen |= SEEN_RET0;
> - /* ltgr %src,%src (if src == 0 goto fail) */
> - EMIT4(0xb902, src_reg, src_reg);
> - /* jz  */
> - EMIT4_PCREL(0xa784, jit->ret0_ip - jit->prg);
>   /* lghi %w0,0 */
>   EMIT4_IMM(0xa709, REG_W0, 0);
>   /* lgr %w1,%dst */

If the check is done in the verifier now, this looks good to me.

Reviewed-by: Michael Holzheu

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-29 Thread Willy Tarreau

Hi Eric,

On Mon, Jan 29, 2018 at 06:04:30AM -0800, Eric Dumazet wrote:
> > If these 4 bytes matter, why not use
> > cmpq with an immediate value instead, which saves 2 extra bytes ? :
> >
> >   - the mov above is 11 bytes total :
> >
> >0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
> >7:   12
> >8:   48 85 c0test   %rax,%rax
> >
> >   - the equivalent cmp is only 9 bytes :
> >
> >0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
> >7:   12 00
> >
> > And as a bonus, it doesn't even clobber rax.
> >
> > Just my two cents,
> 
> 
> Hi Willy
> 
> Please look more closely at following instructions.
> 
> We need the value later, not only testing it being zero :)

Ah OK that makes total sense then ;-)

Thanks,
willy

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-29 Thread Eric Dumazet

On Sun, Jan 28, 2018 at 10:39 PM, Willy Tarreau  wrote:
> Hi,
>
> [ replaced stable@ and greg@ by netdev@ as my question below is not
>   relevant to stable ]
>
> On Mon, Jan 29, 2018 at 02:48:54AM +0100, Daniel Borkmann wrote:
>> From: Eric Dumazet 
>>
>> [ upstream commit 84ccac6e7854ebbfb56d2fc6d5bef9be49bb304c ]
>>
>> Saves 4 bytes replacing following instructions :
>>
>> lea rax, [rsi + rdx * 8 + offsetof(...)]
>> mov rax, qword ptr [rax]
>> cmp rax, 0
>>
>> by :
>>
>> mov rax, [rsi + rdx * 8 + offsetof(...)]
>> test rax, rax
>
> I've just noticed this on stable@. If these 4 bytes matter, why not use
> cmpq with an immediate value instead, which saves 2 extra bytes ? :
>
>   - the mov above is 11 bytes total :
>
>0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
>7:   12
>8:   48 85 c0test   %rax,%rax
>
>   - the equivalent cmp is only 9 bytes :
>
>0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
>7:   12 00
>
> And as a bonus, it doesn't even clobber rax.
>
> Just my two cents,


Hi Willy

Please look more closely at following instructions.

We need the value later, not only testing it being zero :)

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Peter Zijlstra

On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> Peter Zijlstra wrote:
> > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > location of
> > > 
> > >   /* this guy won't enter reclaim */
> > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > >   return false;
> > > 
> > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > (__GFP_NOFS)").
> > 
> > I'm not entirly sure I get what you mean here. How did I move it? It was
> > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > mark the lock as held.
> 
> d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> 
> --
> # git show d92a8cfcb37ecd13 | grep recursion
> -# define INIT_LOCKDEP  .lockdep_recursion = 0, 
> .lockdep_reclaim_gfp = 0,
> +# define INIT_LOCKDEP  .lockdep_recursion = 0,
> unsigned intlockdep_recursion;
> -   if (unlikely(current->lockdep_recursion))
> -   current->lockdep_recursion = 1;
> -   current->lockdep_recursion = 0;
> -* context checking code. This tests GFP_FS recursion (a lock taken
> --

That should not matter at all. The only case that would matter for is if
lockdep itself would ever call into lockdep again. Not something that
happens here.

> > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> 
> Excuse me, but I can't catch.
> We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

Right, got the case inverted, same difference though. Before we'd do
mark_held_lock(), now we do acquire/release under the same conditions.

> > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > lockdep warning).
> > 
> > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > That's what the name says.
> 
> __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> request should use.

Right.

> But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

Ah indeed.

> Then, how can fs_reclaim contribute to deadlock?

Not sure it can. But if we're going to allow this, it needs to come with
a clear description on why. Not a few clues to a puzzle.

Now, even if its not strictly a deadlock, there is something to be said
for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
want to allow that?

Re: dvb usb issues since kernel 4.9

2018-01-29 Thread Mauro Carvalho Chehab

Em Fri, 26 Jan 2018 17:37:39 -0200
Mauro Carvalho Chehab  escreveu:

> Em Fri, 26 Jan 2018 12:17:37 -0200
> Mauro Carvalho Chehab  escreveu:
> 
> > Hi Alan,
> > 
> > Em Mon, 8 Jan 2018 14:15:35 -0500 (EST)
> > Alan Stern  escreveu:
> >   
> > > On Mon, 8 Jan 2018, Linus Torvalds wrote:
> > > 
> > > > Can somebody tell which softirq it is that dvb/usb cares about?  
> > > 
> > > I don't know about the DVB part.  The USB part is a little difficult to
> > > analyze, mostly because the bug reports I've seen are mostly from
> > > people running non-vanilla kernels. 
> > 
> > I suspect that the main reason for people not using non-vanilla Kernels
> > is that, among other bugs, the dwc2 upstream driver has serious troubles
> > handling ISOCH traffic.
> > 
> > Using Kernel 4.15-rc7 from this git tree:
> > https://git.linuxtv.org/mchehab/experimental.git/log/?h=softirq_fixup
> > 
> > (e. g. with the softirq bug partially reverted with Linux patch, and
> >  the DWC2 deferred probe fixed)
> > 
> > With a PCTV 461e device, with uses em28xx driver + Montage frontend
> > (with is the same used on dvbsky hardware - except for em28xx).
> > 
> > This device doesn't support bulk for DVB, just ISOCH. The drivers work 
> > fine on x86.
> > 
> > Using a test signal at the bit rate of 56698,4 Kbits/s, that's what
> > happens, when capturing less than one second of data:
> > 
> > $ dvbv5-zap -c ~/dvb_channel.conf "tv brasil" -l universal -X 100 -m 
> > -t2dvbv5-zap -c ~/dvb_channel.conf "tv brasil" -l universal -X 100 -m -t2
> > Using LNBf UNIVERSAL
> > Universal, Europe
> > Freqs : 10800 to 11800 MHz, LO: 9750 MHz
> > Freqs : 11600 to 12700 MHz, LO: 10600 MHz
> > using demux 'dvb0.demux0'
> > reading channels from file '/home/mchehab/dvb_channel.conf'
> > tuning to 11468000 Hz
> >(0x00) Signal= -33.90dBm
> > Lock   (0x1f) Signal= -33.90dBm C/N= 30.28dB postBER= 2.33x10^-6
> > dvb_dev_set_bufsize: buffer set to 6160384
> >   dvb_set_pesfilter to 0x2000
> > 354.08s: Starting capture
> > 354.73s: only read 59220 bytes
> > 354.73s: Stopping capture
> > 
> > [  354.000827] dwc2 3f98.usb: DWC OTG HCD EP DISABLE: 
> > bEndpointAddress=0x84, ep->hcpriv=116f41b2
> > [  354.000859] dwc2 3f98.usb: DWC OTG HCD EP RESET: 
> > bEndpointAddress=0x84
> > [  354.010744] dwc2 3f98.usb: --Host Channel 5 Interrupt: Frame 
> > Overrun--
> > ... (hundreds of thousands of Frame Overrun messages)
> > [  354.660857] dwc2 3f98.usb: --Host Channel 5 Interrupt: Frame 
> > Overrun--
> > [  354.660935] dwc2 3f98.usb: DWC OTG HCD URB Dequeue
> > [  354.660959] dwc2 3f98.usb: Called usb_hcd_giveback_urb()
> > [  354.660966] dwc2 3f98.usb:   urb->status = 0
> > [  354.660992] dwc2 3f98.usb: DWC OTG HCD URB Dequeue
> > [  354.661001] dwc2 3f98.usb: Called usb_hcd_giveback_urb()
> > [  354.661008] dwc2 3f98.usb:   urb->status = 0
> > [  354.661054] dwc2 3f98.usb: DWC OTG HCD URB Dequeue
> > [  354.661065] dwc2 3f98.usb: Called usb_hcd_giveback_urb()
> > [  354.661072] dwc2 3f98.usb:   urb->status = 0
> > [  354.661107] dwc2 3f98.usb: DWC OTG HCD URB Dequeue
> > [  354.661120] dwc2 3f98.usb: Called usb_hcd_giveback_urb()
> > [  354.661127] dwc2 3f98.usb:   urb->status = 0
> > [  354.661146] dwc2 3f98.usb: DWC OTG HCD URB Dequeue
> > [  354.661158] dwc2 3f98.usb: Called usb_hcd_giveback_urb()
> > [  354.661165] dwc2 3f98.usb:   urb->status = 0  
> 
> Btw, 
> 
> Just in case, I also applied all recent pending dwc2 patches I found at
> linux-usb (even trivial unrelated ones) at:
> 
>   https://git.linuxtv.org/mchehab/experimental.git/log/?h=dwc2_patches
> 
> No differences. ISOCH is still broken.
> 
> If anyone wants to see the full logs, it is there:
>   https://pastebin.com/XJYyTwPv

Someone pointed me in priv that applying a change at DWC2 BRCM profile to
enable uframe_sched might help.

So, I wrote this patch:

https://git.linuxtv.org/mchehab/experimental.git/commit/?h=v4.15%2bmedia%2bdwc2=19abf0026b7bf1bd44aa9d2add9f958935760ded

And applied on the top of this branch:

https://git.linuxtv.org/mchehab/experimental.git/log/?h=v4.15%2bmedia%2bdwc2

It is based on Kernel 4.15 vanilla. I applied:
- all media -next patches that will be sent to Kernel 4.16-rc1;
- DWC2 patches submitted by Gregor at linux-usb ML;
- Linus softirq test patch:

https://git.linuxtv.org/mchehab/experimental.git/commit/?h=v4.15%2bmedia%2bdwc2=ccf833fd4a5b99c3d3cf2c09c065670f74a230a7
- A DT patch that enables VCIQ (needed by some GPU drivers):

https://git.linuxtv.org/mchehab/experimental.git/commit/?h=v4.15%2bmedia%2bdwc2=fd4e9ca6f41d35b6234c30fa29937141e0c09570
- a few debug patches like this one:

[PATCH] netfilter: fix pointer leaks to userspace

2018-01-29 Thread Dmitry Vyukov

Several netfilter matches and targets put kernel pointers into
info objects, but don't set usersize in descriptors.
This leads to kernel pointer leaks if a match/target is set
and then read back to userspace.

Properly set usersize for these matches/targets.

Found with manual code inspection.

Signed-off-by: Dmitry Vyukov 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Cc: "David S. Miller" 
Cc: Kees Cook 
Cc: netfilter-de...@vger.kernel.org
Cc: coret...@netfilter.org
Cc: netdev@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 net/netfilter/xt_IDLETIMER.c | 1 +
 net/netfilter/xt_LED.c   | 1 +
 net/netfilter/xt_limit.c | 3 +--
 net/netfilter/xt_nfacct.c| 1 +
 net/netfilter/xt_statistic.c | 1 +
 5 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c
index ee3421ad108d..6c2482b709b1 100644
--- a/net/netfilter/xt_IDLETIMER.c
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -252,6 +252,7 @@ static struct xt_target idletimer_tg __read_mostly = {
.family = NFPROTO_UNSPEC,
.target = idletimer_tg_target,
.targetsize = sizeof(struct idletimer_tg_info),
+   .usersize   = offsetof(struct idletimer_tg_info, timer),
.checkentry = idletimer_tg_checkentry,
.destroy= idletimer_tg_destroy,
.me = THIS_MODULE,
diff --git a/net/netfilter/xt_LED.c b/net/netfilter/xt_LED.c
index 0971634e5444..1dcad893df78 100644
--- a/net/netfilter/xt_LED.c
+++ b/net/netfilter/xt_LED.c
@@ -198,6 +198,7 @@ static struct xt_target led_tg_reg __read_mostly = {
.family = NFPROTO_UNSPEC,
.target = led_tg,
.targetsize = sizeof(struct xt_led_info),
+   .usersize   = offsetof(struct xt_led_info, internal_data),
.checkentry = led_tg_check,
.destroy= led_tg_destroy,
.me = THIS_MODULE,
diff --git a/net/netfilter/xt_limit.c b/net/netfilter/xt_limit.c
index d27b5f1ea619..61403b77361c 100644
--- a/net/netfilter/xt_limit.c
+++ b/net/netfilter/xt_limit.c
@@ -193,9 +193,8 @@ static struct xt_match limit_mt_reg __read_mostly = {
.compatsize   = sizeof(struct compat_xt_rateinfo),
.compat_from_user = limit_mt_compat_from_user,
.compat_to_user   = limit_mt_compat_to_user,
-#else
-   .usersize = offsetof(struct xt_rateinfo, prev),
 #endif
+   .usersize = offsetof(struct xt_rateinfo, prev),
.me   = THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_nfacct.c b/net/netfilter/xt_nfacct.c
index cc0518fe598e..6f92d25590a8 100644
--- a/net/netfilter/xt_nfacct.c
+++ b/net/netfilter/xt_nfacct.c
@@ -62,6 +62,7 @@ static struct xt_match nfacct_mt_reg __read_mostly = {
.match  = nfacct_mt,
.destroy= nfacct_mt_destroy,
.matchsize  = sizeof(struct xt_nfacct_match_info),
+   .usersize   = offsetof(struct xt_nfacct_match_info, nfacct),
.me = THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_statistic.c b/net/netfilter/xt_statistic.c
index 11de55e7a868..8710fdba2ae2 100644
--- a/net/netfilter/xt_statistic.c
+++ b/net/netfilter/xt_statistic.c
@@ -84,6 +84,7 @@ static struct xt_match xt_statistic_mt_reg __read_mostly = {
.checkentry = statistic_mt_check,
.destroy= statistic_mt_destroy,
.matchsize  = sizeof(struct xt_statistic_info),
+   .usersize   = offsetof(struct xt_statistic_info, master),
.me = THIS_MODULE,
 };
 
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Tetsuo Handa

Peter Zijlstra wrote:
> On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > This warning seems to be caused by commit d92a8cfcb37ecd13
> > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > location of
> > 
> >   /* this guy won't enter reclaim */
> >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> >   return false;
> > 
> > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > (__GFP_NOFS)").
> 
> I'm not entirly sure I get what you mean here. How did I move it? It was
> part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> mark the lock as held.

d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
fs_reclaim_acquire(), and removed current->lockdep_recursion handling.

--
# git show d92a8cfcb37ecd13 | grep recursion
-# define INIT_LOCKDEP  .lockdep_recursion = 0, 
.lockdep_reclaim_gfp = 0,
+# define INIT_LOCKDEP  .lockdep_recursion = 0,
unsigned intlockdep_recursion;
-   if (unlikely(current->lockdep_recursion))
-   current->lockdep_recursion = 1;
-   current->lockdep_recursion = 0;
-* context checking code. This tests GFP_FS recursion (a lock taken
--

> 
> The new code has it in fs_reclaim_acquire/release to the same effect, if
> __GFP_NOMEMALLOC, we'll not acquire/release the lock.

Excuse me, but I can't catch.
We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

--
+static bool __need_fs_reclaim(gfp_t gfp_mask)
+{
(...snipped...)
+   /* this guy won't enter reclaim */
+   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   return false;
(...snipped...)
+}
--

> 
> 
> > Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > failing to return false despite PF_MEMALLOC context (and resulted in
> > lockdep warning).
> 
> But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> That's what the name says.

__GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
request should use.

--
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
return 0;
if (gfp_mask & __GFP_MEMALLOC)
return ALLOC_NO_WATERMARKS;
if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
return ALLOC_NO_WATERMARKS;
if (!in_interrupt()) {
if (current->flags & PF_MEMALLOC)
return ALLOC_NO_WATERMARKS;
else if (oom_reserves_allowed(current))
return ALLOC_OOM;
}

return 0;
}
--

But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

--
/* Attempt with potentially adjusted zonelist and alloc_flags */
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto got_pg;

/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;

/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
_some_progress);
if (page)
goto got_pg;

/* Try direct compaction and then allocating */
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, _result);
if (page)
goto got_pg;

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;
--

Then, how can fs_reclaim contribute to deadlock?

> 
> > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> > return false.
> 
> This does not in fact explain what's going on, it just points to
> 'random' patches.
> 
> Are you talking about this:
> 
> +   /* Avoid recursion of direct reclaim */
> +   if (p->flags & PF_MEMALLOC)
> +   goto nopage;
> 
> bit?

Yes.

> 
> > Reported-by: Dave Jones 
> > Signed-off-by: Tetsuo Handa 
> > Cc: Peter Zijlstra 
> > Cc: Nick Piggin 
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed,

[PATCH iproute2] police: don't skip parameters after actions

2018-01-29 Thread Wolfgang Bumiller

The 'parse_action_control()' helper advances the argument
pointers to past its parsed action already, so don't
advance it further in 'act_parse_polic()'.

Fixes: e67aba559581 ("tc: actions: add helpers to parse and print control 
actions")
Signed-off-by: Wolfgang Bumiller 
---
Basically parse_action_control() silently added a NEXT_ARG() while the
cases before didn't have one. Not sure whether the goto is okay
style-wise, let me know if you prefer some other solution.

Example for triggering this:
Specifying a 'flowid X' after a `police ... drop` will skip the 'flowid'
and error with "What is X"

$ tc filter add dev eth0 parent : basic police rate 13371337bps burst 1337b 
mtu 64kb drop flowid :1
What is ":1"?
...

 tc/m_police.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tc/m_police.c b/tc/m_police.c
index ff1dcb7d..f0878b3a 100644
--- a/tc/m_police.c
+++ b/tc/m_police.c
@@ -154,6 +154,7 @@ int act_parse_police(struct action_util *a, int *argc_p, 
char ***argv_p,
   matches(*argv, "goto") == 0) {
if (parse_action_control(, , , 
false))
return -1;
+   goto keep_arg;
} else if (strcmp(*argv, "conform-exceed") == 0) {
NEXT_ARG();
if (parse_action_control_slash(, , ,
@@ -174,8 +175,9 @@ int act_parse_police(struct action_util *a, int *argc_p, 
char ***argv_p,
} else {
break;
}
-   ok++;
argc--; argv++;
+keep_arg:
+   ok++;
}
 
if (!ok)
-- 
2.11.0

Re: [PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support

2018-01-29 Thread Pablo Neira Ayuso

Hi Jakub,

On Thu, Jan 25, 2018 at 02:38:46PM -0800, Jakub Kicinski wrote:
> On Thu, 25 Jan 2018 12:28:58 +0100, Pablo Neira Ayuso wrote:
[...]
> Ah, I must be misunderstanding.  I meant when device is removed, not
> the flow_table_hw module.  Does the nf_flow_table_hw_module_exit() run
> when device is removed?  I was expecting that, for example something
> like nft_flow_offload_iterate_cleanup() would queue up all the flow
> remove calls and then call flush_work() (not cancel_work). 

Oh right indeed, I need some code there to release all hw resources on
module removal.

Will revamp and send v5.

Thanks for reviewing!

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Peter Zijlstra

On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> This warning seems to be caused by commit d92a8cfcb37ecd13
> ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> location of
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>   return false;
> 
> check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)").

I'm not entirly sure I get what you mean here. How did I move it? It was
part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
mark the lock as held.

The new code has it in fs_reclaim_acquire/release to the same effect, if
__GFP_NOMEMALLOC, we'll not acquire/release the lock.

> Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> failing to return false despite PF_MEMALLOC context (and resulted in
> lockdep warning).

But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
That's what the name says.

> Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> return false.

This does not in fact explain what's going on, it just points to
'random' patches.

Are you talking about this:

+   /* Avoid recursion of direct reclaim */
+   if (p->flags & PF_MEMALLOC)
+   goto nopage;

bit?

> Reported-by: Dave Jones 
> Signed-off-by: Tetsuo Handa 
> Cc: Peter Zijlstra 
> Cc: Nick Piggin 
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c9688..7804b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>   return false;
>  
>   /* this guy won't enter reclaim */
> - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> + if (current->flags & PF_MEMALLOC)
>   return false;

I'm _really_ uncomfortable doing that. Esp. without a solid explanation
of how this raelly can't possibly lead to trouble. Which the above semi
incoherent rambling is not.

Your backtrace shows the btrfs shrinker doing an allocation, that's the
exact kind of thing we need to be extremely careful with.

Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

2018-01-29 Thread Benjamin Poirier

On 2018/01/26 09:03, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier  wrote:
> > This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> > This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
> >
> > ... because they cause an extra 2s delay for the link to come up when
> > autoneg is off.
> >
> > After reverting, the race condition described in the log of commit
> > 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
> > reintroduced. It may still be triggered by LSC events but this should not
> > result in link flap. It may no longer be triggered by RXO events because
> > commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > restored reading icr in the Other handler.
> 
> With the RXO events removed the only cause for us to transition the
> bit should be LSC. I'm not sure if the race condition in that state is
> a valid concern or not as the LSC should only get triggered if the
> link state toggled, even briefly.
> 
> The bigger concern I would have would be the opposite of the original
> race that was pointed out:
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
> 
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
> 
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
> 
> So the question I would have is what if we see the LSC for a link down
> just after the check_for_copper_link call completes? It may not be

Can you write out exactly what that race would be, in a format similar to the
above?

> anything seen in the real world since I don't know if we have any link
> flapping issues on e1000e or not without this patch. It is something
> to keep in mind for the future though.
> 
> 
> > As discussed, the driver should be in "maintenance mode". In the interest
> > of stability, revert to the original code as much as possible instead of a
> > half-baked solution.
> 
> If nothing else we may want to do a follow-up on this patch as we
> probably shouldn't be returning the error values to trigger link up.
> There are definitely issues to be found here. If nothing else we may
> want to explore just returning 1 if auto-neg is disabled instead of
> returning an error code.
> 
> > Link: https://www.spinics.net/lists/netdev/msg479923.html
> > Signed-off-by: Benjamin Poirier 
[...]

Re: r8169 take too long to complete driver initialization

2018-01-29 Thread Chris Chiu

On Fri, Jan 5, 2018 at 10:17 AM, Chris Chiu  wrote:
> On Wed, Dec 20, 2017 at 4:41 PM, Chris Chiu  wrote:
>> Hi,
>> We've hit a suspend/resume issue on a Acer desktop caused by r8169
>> driver. The dmseg
>> https://gist.github.com/mschiu77/b741849b5070281daaead8dfee312d1a
>> shows it's still in msleep() within a mutex lock.
>> After looking into the code, it's caused by the
>> rtl8168ep_stop_cmac() which is waiting 100 seconds for
>> rtl_ocp_tx_cond. The following dmesg states that the r8169 driver is
>> loaded.
>>
>> [   20.270526] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
>>
>> But it takes > 100 seconds to get the following messages
>>
>> [  140.400223] r8169 :02:00.0 (unnamed net_device)
>> (uninitialized): rtl_ocp_tx_cond == 1 (loop: 2000, delay: 50).
>> [  140.413294] r8169 :02:00.0 eth0: RTL8168ep/8111ep at
>> 0xb16c80db1000, f8:0f:41:ea:74:0d, XID 10200800 IRQ 46
>> [  140.413297] r8169 :02:00.0 eth0: jumbo features [frames: 9200
>> bytes, tx checksumming: ko]
>>
>> So any trial to suspend the machine during this period would always
>> get device/resource busy message then abort. Is this  rtl_ocp_tx_cond
>> necessary? Because the ethernet is still working and I don't see any
>> problem. I don't know it should be considered normal or not. Please
>> let me know if any more information required. Thanks
>>
>> Chris
>
> gentle ping,
>
> cheers.

Hi,
Just found a r8168 driver which seems to be authrized by realtek for cross
comparison. I tried applying the patch to latest 4.15 kernel and the driver done
it's initialization in faily short time. The patch file is here
https://gist.github.com/mschiu77/fcf406e64a1a437f46cf2be643f1057d.

In mainline r8169.c, the IBISR0 register need to be polled in the
rtl8168ep_stop_cmac().
In the patch file, there's also the same IBISR0 polling code in
Dash2DisableTx(),
but it's been bypassed since the chipset maches HW_DASH_SUPPORT_TYPE_2.
Per the rtl_chip_info[] in r8168_n.c, CFG_METHOD_23/27/28 are
HW_DASH_SUPPORT_TYPE_2,
and they happens to be the only 3 named RTL8168EP/8111EP in the rtl_chip_info[].

To find the same matches in r8169.c, RTL_GIGA_MAC_VER_49/50/51
seems share the
same config. Can anyone clarify if the rtl_ocp_tx_cond() really
necessary for 8168EP/8111EP?
Or we can just ignore the condition check for RTL_GIGA_MAC_VER_49/50/51?

Chris

[PATCH] qlcnic: fix deadlock bug

2018-01-29 Thread Junxiao Bi

The following soft lockup was caught. This is a deadlock caused by
recusive locking.

Process kworker/u40:1:28016 was holding spin lock "mbx->queue_lock" in
qlcnic_83xx_mailbox_worker(), while a softirq came in and ask the same spin
lock in qlcnic_83xx_enqueue_mbx_cmd(). This lock should be hold by disable
bh..

[161846.962125] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[kworker/u40:1:28016]
[161846.962367] Modules linked in: tun ocfs2 xen_netback xen_blkback 
xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd autofs4 ocfs2_dlmfs 
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc 
fcoe libfcoe libfc sunrpc 8021q mrp garp bridge stp llc bonding dm_round_robin 
dm_multipath iTCO_wdt iTCO_vendor_support pcspkr sb_edac edac_core i2c_i801 
shpchp lpc_ich mfd_core ioatdma ipmi_devintf ipmi_si ipmi_msghandler sg ext4 
jbd2 mbcache2 sr_mod cdrom sd_mod igb i2c_algo_bit i2c_core ahci libahci 
megaraid_sas ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel qla2xxx 
scsi_transport_fc qlcnic crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 
cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi 
scsi_transport_iscsi dm_mirror dm_region_hash dm_log dm_mod
[161846.962454]
[161846.962460] CPU: 1 PID: 28016 Comm: kworker/u40:1 Not tainted 
4.1.12-94.5.9.el6uek.x86_64 #2
[161846.962463] Hardware name: Oracle Corporation SUN SERVER X4-2L  
/ASSY,MB,X4-2L , BIOS 26050100 09/19/2017
[161846.962489] Workqueue: qlcnic_mailbox qlcnic_83xx_mailbox_worker [qlcnic]
[161846.962493] task: 8801f2e34600 ti: 88004ca5c000 task.ti: 
88004ca5c000
[161846.962496] RIP: e030:[]  [] 
xen_hypercall_sched_op+0xa/0x20
[161846.962506] RSP: e02b:880202e43388  EFLAGS: 0206
[161846.962509] RAX:  RBX: 8801f6996b70 RCX: 
810013aa
[161846.962511] RDX: 880202e433cc RSI: 880202e433b0 RDI: 
0003
[161846.962513] RBP: 880202e433d0 R08:  R09: 
8801fe893200
[161846.962516] R10: 8801fe400538 R11: 0206 R12: 
880202e4b000
[161846.962518] R13: 0050 R14: 0001 R15: 
020d
[161846.962528] FS:  () GS:880202e4() 
knlGS:880202e4
[161846.962531] CS:  e033 DS:  ES:  CR0: 80050033
[161846.962533] CR2: 02612640 CR3: 0001bb796000 CR4: 
00042660
[161846.962536] Stack:
[161846.962538]  880202e43608  813f0442 
880202e433b0
[161846.962543]   880202e433cc 0001 

[161846.962547]  0009813f03d6 880202e433e0 813f0460 
880202e43440
[161846.962552] Call Trace:
[161846.962555]  
[161846.962565]  [] ? xen_poll_irq_timeout+0x42/0x50
[161846.962570]  [] xen_poll_irq+0x10/0x20
[161846.962578]  [] xen_lock_spinning+0xe2/0x110
[161846.962583]  [] 
__raw_callee_save_xen_lock_spinning+0x11/0x20
[161846.962592]  [] ? _raw_spin_lock+0x57/0x80
[161846.962609]  [] qlcnic_83xx_enqueue_mbx_cmd+0x7c/0xe0 
[qlcnic]
[161846.962623]  [] qlcnic_83xx_issue_cmd+0x58/0x210 [qlcnic]
[161846.962636]  [] 
qlcnic_83xx_sre_macaddr_change+0x162/0x1d0 [qlcnic]
[161846.962649]  [] qlcnic_83xx_change_l2_filter+0x2b/0x30 
[qlcnic]
[161846.962657]  [] ? __skb_flow_dissect+0x18b/0x650
[161846.962670]  [] qlcnic_send_filter+0x205/0x250 [qlcnic]
[161846.962682]  [] qlcnic_xmit_frame+0x547/0x7b0 [qlcnic]
[161846.962691]  [] xmit_one+0x82/0x1a0
[161846.962696]  [] dev_hard_start_xmit+0x50/0xa0
[161846.962701]  [] sch_direct_xmit+0x112/0x220
[161846.962706]  [] __dev_queue_xmit+0x1df/0x5e0
[161846.962710]  [] dev_queue_xmit_sk+0x13/0x20
[161846.962721]  [] bond_dev_queue_xmit+0x35/0x80 [bonding]
[161846.962729]  [] __bond_start_xmit+0x1cb/0x210 [bonding]
[161846.962736]  [] bond_start_xmit+0x31/0x60 [bonding]
[161846.962740]  [] xmit_one+0x82/0x1a0
[161846.962745]  [] dev_hard_start_xmit+0x50/0xa0
[161846.962749]  [] __dev_queue_xmit+0x4ee/0x5e0
[161846.962754]  [] dev_queue_xmit_sk+0x13/0x20
[161846.962760]  [] vlan_dev_hard_start_xmit+0xb2/0x150 
[8021q]
[161846.962764]  [] xmit_one+0x82/0x1a0
[161846.962769]  [] dev_hard_start_xmit+0x50/0xa0
[161846.962773]  [] __dev_queue_xmit+0x4ee/0x5e0
[161846.962777]  [] dev_queue_xmit_sk+0x13/0x20
[161846.962789]  [] br_dev_queue_push_xmit+0x54/0xa0 [bridge]
[161846.962797]  [] br_forward_finish+0x2f/0x90 [bridge]
[161846.962807]  [] ? ttwu_do_wakeup+0x1d/0x100
[161846.962811]  [] ? __alloc_skb+0x8b/0x1f0
[161846.962818]  [] __br_forward+0x8d/0x120 [bridge]
[161846.962822]  [] ? __kmalloc_reserve+0x3b/0xa0
[161846.962829]  [] ? update_rq_runnable_avg+0xee/0x230
[161846.962836]  [] br_forward+0x96/0xb0 [bridge]
[161846.962845]  [] br_handle_frame_finish+0x1ae/0x420 
[bridge]
[161846.962853]  [] br_handle_frame+0x17f/0x260 [bridge]
[161846.962862]  [] ? br_handle_frame_finish+0x420/0x420 
[bridge]
[161846.962867]  [] __netif_receive_skb_core+0x1f7/0x870
[161846.962872]  []

Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-29 Thread Tobias Hommel

On Wed, Jan 24, 2018 at 10:59:21AM +0100, Steffen Klassert wrote:
> On Fri, Jan 19, 2018 at 03:45:46PM +0100, Tobias Hommel wrote:
> > 
> > I tried to strip down the system configuration and was able to reproduce the
> > problem with a minimal configuration:
> > * ipsets are not used anymore
> > * no firewall markings are used any longer
> > * iptables are "completely empty", i.e. all policies set to ACCEPT and 
> > there is
> >   no rule in any table
> > * no additional routing policies (ip rule) except the default ones
> > * only main routing table is used
> > * using a "minimal" kernel config:
> >  * run `make defconfig`
> >  * add basic things (ESP, IGB driver, some crypto algorithms)
> >  * add options required to boot up the system (TPM crypt, some device mapper
> >options, overlayfs)
> > 
> > I attached the minimal config (minimal.config) and the defconfig for 
> > reference
> > (minimal.defconfig).
> > 
> > The setup is really simple now, the gateway is forwarding HTTP connections
> > between eth1(IPSec tunnels) and eth0 without any firewall, NAT, whatsoever.
> 
> Thanks a lot for your debugging effort!
> 
> > 
> > The only thing I can think of are the rather aggressive roadwarrior clients.
> > There are 750 roadwarriors that are controlled by a script which starts and
> > stops the IPSec connection.
> 
> I still can't reproduce it with my tests. This is probably some race
> triggered due to your aggressive roadwarrior setup which I don't have.
> 
> > I tried 4.15-rc8 and have the same problem here (see attached
> > kernel-4.15-rc8.log). SMP affinity for IRQs has changed in 4.15 and 
> > something's
> 
> There is one patch that could influence this which is not in v4.15-rc8:
> 
> commit 76a4201191814a0061cb5c861fafb9ecaa764846
> ("xfrm: Fix a race in the xdst pcpu cache.")
> 
> It is included in v4.15-rc9.
I already tested that one some weeks ago, when it appeared on the mailing list,
with 4.14. Without any luck.

> 
> If this does not fix your problem, I'm out of ideas. In this case
> I have to ask to do a bisection to find the offending commit.
> 
I'll do a bisect session then. It'll take some time though as the hardware is
currently occupied with other tests. I'll keep you up-to-date about the
results.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Kirill A. Shutemov

On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > off when the current task is killed") but then became unkillable by commit
> > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > killed""). Therefore, we can't handle this problem from MM side.
> > Please consider adding some limit from networking side.
> 
> I don't know what "some limit" would be.  I would prefer if there was
> a way to supress OOM Killer in first place so we can just -ENOMEM user.

Just supressing OOM kill is a bad idea. We still leave a way to allocate
arbitrary large buffer in kernel.

-- 
 Kirill A. Shutemov

92 matches

Mail list logo