Re: [PATCH 1/4] net: thunderx: Fix IOMMU translation faults

2017-03-03 Thread Sunil Kovvuri
On Fri, Mar 3, 2017 at 11:26 PM, David Miller  wrote:
> From: sunil.kovv...@gmail.com
> Date: Fri,  3 Mar 2017 16:17:47 +0530
>
>> @@ -1643,6 +1650,9 @@ static int nicvf_probe(struct pci_dev *pdev, const 
>> struct pci_device_id *ent)
>>   if (!pass1_silicon(nic->pdev))
>>   nic->hw_tso = true;
>>
>> + /* Check if we are attached to IOMMU */
>> + nic->iommu_domain = iommu_get_domain_for_dev(dev);
>
> This function is not universally available.

Even if CONFIG_IOMMU_API is not enabled, it will return NULL and will be okay.
http://lxr.free-electrons.com/source/include/linux/iommu.h#L400

>
> This looks very hackish to me anyways, how all of this stuff is supposed
> to work is that you simply use the DMA interfaces unconditionally and
> whatever is behind the operations takes care of everything.
>
> Doing it conditionally in the driver with all of this special IOMMU
> domain et al. knowledge makes no sense to me at all.
>
> I don't see other drivers doing stuff like this at all, so if you're
> going to handle this in a unique way like this you better write
> several paragraphs in your commit message explaining why this weird
> crap is necessary.

I already tried to explain in the commit message that HW anyway takes care
of data coherency, so calling DMA interfaces when there is no IOMMU will
only result in performance drop.

We are seeing a 0.75Mpps drop with IP forwarding rate due to that.
Hence I have restricted calling DMA interfaces to only when IOMMU is enabled.


Re: [PATCH net 1/2] net: fix socket refcounting in skb_complete_wifi_ack()

2017-03-03 Thread Soheil Hassas Yeganeh
On Fri, Mar 3, 2017 at 7:01 PM, Eric Dumazet  wrote:
> TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
> By the time TX completion happens, sk_refcnt might be already 0.
>
> sock_hold()/sock_put() would then corrupt critical state, like
> sk_wmem_alloc.
>
> Fixes: bf7fa551e0ce ("mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi 
> ack path")
> Signed-off-by: Eric Dumazet 
> Cc: Alexander Duyck 
> Cc: Johannes Berg 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 

Acked-by: Soheil Hassas Yeganeh 

> ---
>  net/core/skbuff.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> f3557958e9bf147631a90b51fef0630920acd97b..e2f37a560ec43ccf60a71f190423bd265eccf594
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3893,7 +3893,7 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
> acked)
>  {
> struct sock *sk = skb->sk;
> struct sock_exterr_skb *serr;
> -   int err;
> +   int err = 1;
>
> skb->wifi_acked_valid = 1;
> skb->wifi_acked = acked;
> @@ -3903,14 +3903,15 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
> acked)
> serr->ee.ee_errno = ENOMSG;
> serr->ee.ee_origin = SO_EE_ORIGIN_TXSTATUS;
>
> -   /* take a reference to prevent skb_orphan() from freeing the socket */
> -   sock_hold(sk);
> -
> -   err = sock_queue_err_skb(sk, skb);
> +   /* Take a reference to prevent skb_orphan() from freeing the socket,
> +* but only if the socket refcount is not zero.
> +*/
> +   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
> +   err = sock_queue_err_skb(sk, skb);
> +   sock_put(sk);
> +   }
> if (err)
> kfree_skb(skb);
> -
> -   sock_put(sk);
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
>
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

Nice fix!


Re: [PATCH net 2/2] net: fix socket refcounting in skb_complete_tx_timestamp()

2017-03-03 Thread Soheil Hassas Yeganeh
On Fri, Mar 3, 2017 at 7:01 PM, Eric Dumazet  wrote:
>
> TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
> By the time TX completion happens, sk_refcnt might be already 0.
>
> sock_hold()/sock_put() would then corrupt critical state, like
> sk_wmem_alloc and lead to leaks or use after free.
>
> Fixes: 62bccb8cdb69 ("net-timestamp: Make the clone operation stand-alone 
> from phy timestamping")
> Signed-off-by: Eric Dumazet 
> Cc: Alexander Duyck 
> Cc: Johannes Berg 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 

Acked-by: Soheil Hassas Yeganeh 

> ---
>  net/core/skbuff.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> e2f37a560ec43ccf60a71f190423bd265eccf594..cd4ba8c6b6091651403cf74de8c60ccf69aa3e7b
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3828,13 +3828,14 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> if (!skb_may_tx_timestamp(sk, false))
> return;
>
> -   /* take a reference to prevent skb_orphan() from freeing the socket */
> -   sock_hold(sk);
> -
> -   *skb_hwtstamps(skb) = *hwtstamps;
> -   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
> -
> -   sock_put(sk);
> +   /* Take a reference to prevent skb_orphan() from freeing the socket,
> +* but only if the socket refcount is not zero.
> +*/
> +   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
> +   *skb_hwtstamps(skb) = *hwtstamps;
> +   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
> +   sock_put(sk);
> +   }
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
>
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

Thanks for the fix!


[PATCH net 2/2] net: fix socket refcounting in skb_complete_tx_timestamp()

2017-03-03 Thread Eric Dumazet
TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
By the time TX completion happens, sk_refcnt might be already 0.

sock_hold()/sock_put() would then corrupt critical state, like
sk_wmem_alloc and lead to leaks or use after free.

Fixes: 62bccb8cdb69 ("net-timestamp: Make the clone operation stand-alone from 
phy timestamping")
Signed-off-by: Eric Dumazet 
Cc: Alexander Duyck 
Cc: Johannes Berg 
Cc: Soheil Hassas Yeganeh 
Cc: Willem de Bruijn 
---
 net/core/skbuff.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
e2f37a560ec43ccf60a71f190423bd265eccf594..cd4ba8c6b6091651403cf74de8c60ccf69aa3e7b
 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3828,13 +3828,14 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
if (!skb_may_tx_timestamp(sk, false))
return;
 
-   /* take a reference to prevent skb_orphan() from freeing the socket */
-   sock_hold(sk);
-
-   *skb_hwtstamps(skb) = *hwtstamps;
-   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
-
-   sock_put(sk);
+   /* Take a reference to prevent skb_orphan() from freeing the socket,
+* but only if the socket refcount is not zero.
+*/
+   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
+   *skb_hwtstamps(skb) = *hwtstamps;
+   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
+   sock_put(sk);
+   }
 }
 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
 
-- 
2.12.0.rc1.440.g5b76565f74-goog



[PATCH net 0/2] net: fix possible sock_hold() misuses

2017-03-03 Thread Eric Dumazet
skb_complete_wifi_ack() and skb_complete_tx_timestamp() currently
call sock_hold() on sockets that might have transitioned their sk_refcnt
to zero already.

Eric Dumazet (2):
  net: fix socket refcounting in skb_complete_wifi_ack()
  net: fix socket refcounting in skb_complete_tx_timestamp()

 net/core/skbuff.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

-- 
2.12.0.rc1.440.g5b76565f74-goog



[PATCH net 1/2] net: fix socket refcounting in skb_complete_wifi_ack()

2017-03-03 Thread Eric Dumazet
TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
By the time TX completion happens, sk_refcnt might be already 0.

sock_hold()/sock_put() would then corrupt critical state, like
sk_wmem_alloc.

Fixes: bf7fa551e0ce ("mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi 
ack path")
Signed-off-by: Eric Dumazet 
Cc: Alexander Duyck 
Cc: Johannes Berg 
Cc: Soheil Hassas Yeganeh 
Cc: Willem de Bruijn 
---
 net/core/skbuff.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
f3557958e9bf147631a90b51fef0630920acd97b..e2f37a560ec43ccf60a71f190423bd265eccf594
 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3893,7 +3893,7 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
acked)
 {
struct sock *sk = skb->sk;
struct sock_exterr_skb *serr;
-   int err;
+   int err = 1;
 
skb->wifi_acked_valid = 1;
skb->wifi_acked = acked;
@@ -3903,14 +3903,15 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
acked)
serr->ee.ee_errno = ENOMSG;
serr->ee.ee_origin = SO_EE_ORIGIN_TXSTATUS;
 
-   /* take a reference to prevent skb_orphan() from freeing the socket */
-   sock_hold(sk);
-
-   err = sock_queue_err_skb(sk, skb);
+   /* Take a reference to prevent skb_orphan() from freeing the socket,
+* but only if the socket refcount is not zero.
+*/
+   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
+   err = sock_queue_err_skb(sk, skb);
+   sock_put(sk);
+   }
if (err)
kfree_skb(skb);
-
-   sock_put(sk);
 }
 EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
 
-- 
2.12.0.rc1.440.g5b76565f74-goog



Re: [PATCH net] sctp: change to save MSG_MORE flag into assoc

2017-03-03 Thread Xin Long
On Sat, Mar 4, 2017 at 1:57 AM, Xin Long  wrote:
> On Sat, Mar 4, 2017 at 12:31 AM, David Laight  wrote:
>> From: Xin Long
>>> Sent: 03 March 2017 15:43
>> ...
>>> > It is much more important to get MSG_MORE working 'properly' for SCTP
>>> > than for TCP. For TCP an application can always use a long send.
>>
>>> "long send" ?, you mean bigger data, or keeping sending?
>>> I didn't get the difference between SCTP and TCP, they
>>> are similar when sending data.
>>
>> With tcp an application can always replace two send()/write()
>> calls with a single call to writev().
>> For sctp two send() calls must be made in order to generate two
>> data chunks.
>> So it is much easier for a tcp application to generate 'full'
>> ethernet packets.
> okay, it should not be a important reason, and sctp might also support
> it one day. :-)
>
>>
>>>
>>> >
>>> > ...
>>> >> @@ -1982,6 +1982,7 @@ static int sctp_sendmsg(struct sock *sk, struct 
>>> >> msghdr *msg, size_t msg_len)
>>> >>* breaks.
>>> >>*/
>>> >>   err = sctp_primitive_SEND(net, asoc, datamsg);
>>> >> + asoc->force_delay = 0;
>>> >>   /* Did the lower layer accept the chunk? */
>>> >>   if (err) {
>>> >>   sctp_datamsg_free(datamsg);
>>> >
>>> > I don't think this is right - or needed.
>>> > You only get to the above if some test has decided to send data chunks.
>>> > So it just means that the NEXT time someone tries to send data all the
>>> > queued data gets sent.
>>
>>> the NEXT time someone tries to send data with "MSG_MORE clear",
>>> yes, but with "MSG_MORE set", it will still delay.
>>>
>>> > I'm guessing that the whole thing gets called in a loop (definitely needed
>>> > for very long data chunks, or after the window is opened).
>>
>>> yes, if users keep sending data chunks with MSG_MORE set, no
>>> data with "MSG_MORE clear" gap.
>>>
>>> > Now if an application sends a lot of (say) 100 byte chunks with MSG_MORE
>>> > set it would expect to see a lot of full ethernet frames be sent.
>>
>>> right.
>>
>>> > With the above a frame will be sent (containing all but 1 chunk) when the
>>> > amount of queued data becomes too large for an ethernet frame, and 
>>> > immediately
>>> > followed by a second ethernet frame with 1 chunk in it.
>>
>>> "followed by a second ethernet frame with 1 chunk in it.", I think this's
>>> what you're really worried about, right ?
>>> But sctp flush data queue NOT like what you think, it's not keep traversing
>>> the queue untill the queue is empty.
>>> once a packet with chunks in one ethernet frame is sent, sctp_outq_flush
>>> will return. it will pack chunks and send the next packet again untill some
>>> other 'event' triggers it, like retransmission or data received from peer.
>>> I don't think this is a problem.
>>
>> Erm that can't work.
>> I think there is code to convert a large user send into multiple data chunks.
>> So if the user does a 4k (say) send several large chunks get queued.
>> These would need to all be sent at once.
>>
>> Similarly when the transmit window is received.
>> So somewhere there ought to be a loop that will send more than one packet.
> As far as I can see, no loop like you said, mostly, the incoming
> chunk (like SACK) from peer will trigger the next flush out.
> I can try to trace the path in kernel for sure tomorrow.
okay, you are right, I missed sctp_packet_transmit_chunk also call
sctp_packet_transmit to send the current packet. :)

But if we keep sending data with "MSG_MORE", after one ethernet frame
is sent, "followed by a second ethernet frame with 1 chunk in it" will NOT
happen, as in this loop the asoc's msg_more flag is still set, and this flush
is called by sctp_sendmsg(the function msg_more should care more).

did I miss something ?

>
>>
>>> > Now it might be that the flag needs clearing when retransmissions are 
>>> > queued.
>>> > OTOH they might get sent for other reasons.
>>
>>> Before we really overthought about MSG_MORE, no need to care about
>>> retransmissions, define MSG_MORE, in my opinion, it works more for
>>> *inflight is 0*, if it's not 0, we shouldn't stop other places flushing 
>>> them.
>>
>> Eh? and when nagle disabled.
>> If 'inflight' isn't 0 then most paths don't flush data.
> I knew, but MSG_MORE is different thing, it should only try to work for the
> current and following data.
>
>>
>>> We cannot let asoc's more_more flag work as global, it will block elsewhere
>>> sending data chunks, not only sctp_sendmsg.
>>
>> If the connection was flow controlled off, and more 'credit' arrives and 
>> there
>> is less that an ethernet frame's worth of data pending, and the last send
>> said 'MSG_MORE' there is no point sending anything until the application
>> does a send with MSG_MORE clear.
> got you, I think you have different understanding about MSG_MORE
> while this patch just try to make it work like TCP's msg_more, but what
> you mentioned here is the same as TCP thing, 

Re: [Patch net] strparser: destroy workqueue on module exit

2017-03-03 Thread David Miller
From: Cong Wang 
Date: Fri,  3 Mar 2017 12:21:14 -0800

> Fixes: 43a0c6751a32 ("strparser: Stream parser for messages")
> Cc: Tom Herbert 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks.


Re: [PATCH 0/4] Netfilter fixes for net

2017-03-03 Thread David Miller
From: Pablo Neira Ayuso 
Date: Fri,  3 Mar 2017 20:22:21 +0100

> The following patchset contains Netfilter fixes for your net tree,
> they are:
> 
> 1) Missing check for full sock in ip_route_me_harder(), from
>Florian Westphal.
> 
> 2) Incorrect sip helper structure initilization that breaks it when
>several ports are used, from Christophe Leroy.
> 
> 3) Fix incorrect assumption when looking up for matching with adjacent
>intervals in the nft_set_rbtree.
> 
> 4) Fix broken netlink event error reporting in nf_tables that results
>in misleading ESRCH errors propagated to userspace listeners.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks a lot Pablo.


[PATCH net] bpf: disable broken write protection on i386

2017-03-03 Thread Daniel Borkmann
Since d2852a224050 ("arch: add ARCH_HAS_SET_MEMORY config") and
9d876e79df6a ("bpf: fix unlocking of jited image when module ronx
not set") that uses the former, Fengguang reported random corruptions
on his i386 test machine [1]. On i386 there is no JIT available,
and since his kernel config doesn't have kernel modules enabled,
there was also no DEBUG_SET_MODULE_RONX enabled before which would
set interpreted bpf_prog image as read-only like we do in various
other cases for quite some time now, e.g. x86_64, arm64, etc. Thus,
the difference with above commits was that we now used set_memory_ro()
and set_memory_rw() on i386, which resulted in these issues. When
reproducing this with Fengguang's config and qemu image, I changed
lib/test_bpf.c to be run during boot instead of relying on trinity
to fiddle with cBPF.

The issues I saw with the BPF test suite when set_memory_ro() and
set_memory_rw() is used to write protect image on i386 is that after
a number of tests I noticed a corruption happening in bpf_prog_realloc().
Specifically, fp_old's content gets corrupted right *after* the
(unrelated) __vmalloc() call and contains only zeroes right after
the call instead of the original prog data. fp_old should have been
freed later on via __bpf_prog_free() *after* we copied all the data
over to the newly allocated fp. Result looks like:

  [...]
  [   13.107240] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:0 17 
PASS
  [   13.108182] test_bpf: #250 JMP_JSET_X: if (0x3 & 0x) return 1 
jited:0 17 PASS
  [   13.109206] test_bpf: #251 JMP_JA: Jump, gap, jump, ... jited:0 16 PASS
  [   13.110493] test_bpf: #252 BPF_MAXINSNS: Maximum possible literals jited:0 
12 PASS
  [   13.111885] test_bpf: #253 BPF_MAXINSNS: Single literal jited:0 8 PASS
  [   13.112804] test_bpf: #254 BPF_MAXINSNS: Run/add until end jited:0 6341 
PASS
  [   13.177195] test_bpf: #255 BPF_MAXINSNS: Too many instructions PASS
  [   13.177689] test_bpf: #256 BPF_MAXINSNS: Very long jump jited:0 9 PASS
  [   13.178611] test_bpf: #257 BPF_MAXINSNS: Ctx heavy transformations
  [   13.178713] BUG: unable to handle kernel NULL pointer dereference at 
0034
  [   13.179740] IP: bpf_prog_realloc+0x5b/0x90
  [   13.180017] *pde = 
  [   13.180017]
  [   13.180017] Oops: 0002 [#1] DEBUG_PAGEALLOC
  [   13.180017] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.10.0-57268-gd627975-dirty #50
  [   13.180017] task: 401ec000 task.stack: 401f2000
  [   13.180017] EIP: bpf_prog_realloc+0x5b/0x90
  [   13.180017] EFLAGS: 00210246 CPU: 0
  [   13.180017] EAX:  EBX: 57ae1000 ECX:  EDX: 57ae1000
  [   13.180017] ESI: 0019 EDI: 57b07000 EBP: 401f3e74 ESP: 401f3e68
  [   13.180017]  DS: 007b ES: 007b FS:  GS:  SS: 0068
  [   13.180017] CR0: 80050033 CR2: 0034 CR3: 12cb1000 CR4: 0610
  [   13.180017] DR0:  DR1:  DR2:  DR3: 
  [   13.180017] DR6: fffe0ff0 DR7: 0400
  [   13.180017] Call Trace:
  [   13.180017]  bpf_prepare_filter+0x317/0x3a0
  [   13.180017]  bpf_prog_create+0x65/0xa0
  [   13.180017]  test_bpf_init+0x1ca/0x628
  [   13.180017]  ? test_hexdump_init+0xb5/0xb5
  [   13.180017]  do_one_initcall+0x7c/0x11c
  [...]

When using trinity from Fengguang's reproducer, the corruptions were
at inconsistent places, presumably from code dealing with allocations
and seeing similar effects as mentioned above.

Not using set_memory_ro() and set_memory_rw() lets the test suite
run just fine as expected, thus it looks like using set_memory_*()
on i386 seems broken and mentioned commits just uncovered it. Also,
for checking, I enabled DEBUG_RODATA_TEST for that kernel.

Latter shows that memory protecting the kernel seems not working either
on i386 (!). Test suite output:

  [...]
  [   12.692836] Write protecting the kernel text: 13416k
  [   12.693309] Write protecting the kernel read-only data: 5292k
  [   12.693802] rodata_test: test data was not read only
  [...]

Work-around to not enable ARCH_HAS_SET_MEMORY for i386 is not optimal
as it doesn't fix the issue in presumably broken set_memory_*(), but
it at least avoids people avoid having to deal with random corruptions
that are hard to track down for the time being until a real fix can
be found.

  [1] https://lkml.org/lkml/2017/3/2/648

Reported-by: Fengguang Wu 
Signed-off-by: Daniel Borkmann 
Cc: Laura Abbott 
Cc: Kees Cook 
Cc: Alexei Starovoitov 
---
 [ Sending to -net as bpf related, but I don't mind to route it
   elsewhere, too. ]

 arch/x86/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a..626dc6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,7 +54,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_APIif X86_64
-   

[PATCH v3 net-next 6/6] MAINTAINERS: Add entry for APM X-Gene SoC Ethernet (v2) driver

2017-03-03 Thread Iyappan Subramanian
This patch adds a MAINTAINERS entry for the ethernet driver for
the on-chip ethernet interface which uses a linked list of DMA
descriptor architecture (v2) for APM X-Gene SoCs.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 846f97a..ccb9814 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -902,6 +902,12 @@ F: drivers/net/phy/mdio-xgene.c
 F: Documentation/devicetree/bindings/net/apm-xgene-enet.txt
 F: Documentation/devicetree/bindings/net/apm-xgene-mdio.txt
 
+APPLIED MICRO (APM) X-GENE SOC ETHERNET (V2) DRIVER
+M: Iyappan Subramanian 
+M: Keyur Chudgar 
+S: Supported
+F: drivers/net/ethernet/apm/xgene-v2/
+
 APPLIED MICRO (APM) X-GENE SOC PMU
 M: Tai Nguyen 
 S: Supported
-- 
1.9.1



[PATCH v3 net-next 4/6] drivers: net: xgene-v2: Add base driver

2017-03-03 Thread Iyappan Subramanian
This patch adds,

 - probe, remove, shutdown
 - open, close and stats
 - create and delete ring
 - request and delete irq

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 drivers/net/ethernet/apm/xgene-v2/main.c | 510 +++
 1 file changed, 510 insertions(+)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/main.c

diff --git a/drivers/net/ethernet/apm/xgene-v2/main.c 
b/drivers/net/ethernet/apm/xgene-v2/main.c
new file mode 100644
index 000..c96b4cc
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/main.c
@@ -0,0 +1,510 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include "main.h"
+
+static const struct acpi_device_id xge_acpi_match[];
+
+static int xge_get_resources(struct xge_pdata *pdata)
+{
+   struct platform_device *pdev;
+   struct net_device *ndev;
+   struct device *dev;
+   struct resource *res;
+   int phy_mode, ret = 0;
+
+   pdev = pdata->pdev;
+   dev = >dev;
+   ndev = pdata->ndev;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   if (!res) {
+   dev_err(dev, "Resource enet_csr not defined\n");
+   return -ENODEV;
+   }
+
+   pdata->resources.base_addr = devm_ioremap(dev, res->start,
+ resource_size(res));
+   if (!pdata->resources.base_addr) {
+   dev_err(dev, "Unable to retrieve ENET Port CSR region\n");
+   return -ENOMEM;
+   }
+
+   if (!device_get_mac_address(dev, ndev->dev_addr, ETH_ALEN))
+   eth_hw_addr_random(ndev);
+
+   memcpy(ndev->perm_addr, ndev->dev_addr, ndev->addr_len);
+
+   phy_mode = device_get_phy_mode(dev);
+   if (phy_mode < 0) {
+   dev_err(dev, "Unable to get phy-connection-type\n");
+   return phy_mode;
+   }
+   pdata->resources.phy_mode = phy_mode;
+
+   if (pdata->resources.phy_mode != PHY_INTERFACE_MODE_RGMII) {
+   dev_err(dev, "Incorrect phy-connection-type specified\n");
+   return -ENODEV;
+   }
+
+   ret = platform_get_irq(pdev, 0);
+   if (ret <= 0) {
+   dev_err(dev, "Unable to get ENET IRQ\n");
+   ret = ret ? : -ENXIO;
+   return ret;
+   }
+   pdata->resources.irq = ret;
+
+   return 0;
+}
+
+static int xge_refill_buffers(struct net_device *ndev, u32 nbuf)
+{
+   struct xge_pdata *pdata = netdev_priv(ndev);
+   struct xge_desc_ring *ring = pdata->rx_ring;
+   const u8 slots = XGENE_ENET_NUM_DESC - 1;
+   struct device *dev = >pdev->dev;
+   struct xge_raw_desc *raw_desc;
+   u64 addr_lo, addr_hi;
+   u8 tail = ring->tail;
+   struct sk_buff *skb;
+   dma_addr_t dma_addr;
+   u16 len;
+   int i;
+
+   for (i = 0; i < nbuf; i++) {
+   raw_desc = >raw_desc[tail];
+
+   len = XGENE_ENET_STD_MTU;
+   skb = netdev_alloc_skb(ndev, len);
+   if (unlikely(!skb))
+   return -ENOMEM;
+
+   dma_addr = dma_map_single(dev, skb->data, len, DMA_FROM_DEVICE);
+   if (dma_mapping_error(dev, dma_addr)) {
+   netdev_err(ndev, "DMA mapping error\n");
+   dev_kfree_skb_any(skb);
+   return -EINVAL;
+   }
+
+   ring->pkt_info[tail].skb = skb;
+   ring->pkt_info[tail].dma_addr = dma_addr;
+
+   addr_hi = GET_BITS(NEXT_DESC_ADDRH, le64_to_cpu(raw_desc->m1));
+   addr_lo = GET_BITS(NEXT_DESC_ADDRL, le64_to_cpu(raw_desc->m1));
+   raw_desc->m1 = cpu_to_le64(SET_BITS(NEXT_DESC_ADDRL, addr_lo) |
+  SET_BITS(NEXT_DESC_ADDRH, addr_hi) |
+  SET_BITS(PKT_ADDRH,
+   dma_addr >> PKT_ADDRL_LEN));
+
+   dma_wmb();
+   raw_desc->m0 = cpu_to_le64(SET_BITS(PKT_ADDRL, dma_addr) |
+  

[PATCH v3 net-next 3/6] drivers: net: xgene-v2: Add ethernet hardware configuration

2017-03-03 Thread Iyappan Subramanian
This patch adds functions to configure ethernet hardware.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 drivers/net/ethernet/apm/xgene-v2/enet.c | 71 
 drivers/net/ethernet/apm/xgene-v2/enet.h | 43 +++
 2 files changed, 114 insertions(+)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/enet.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/enet.h

diff --git a/drivers/net/ethernet/apm/xgene-v2/enet.c 
b/drivers/net/ethernet/apm/xgene-v2/enet.c
new file mode 100644
index 000..b49edee
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/enet.c
@@ -0,0 +1,71 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include "main.h"
+
+void xge_wr_csr(struct xge_pdata *pdata, u32 offset, u32 val)
+{
+   void __iomem *addr = pdata->resources.base_addr + offset;
+
+   iowrite32(val, addr);
+}
+
+u32 xge_rd_csr(struct xge_pdata *pdata, u32 offset)
+{
+   void __iomem *addr = pdata->resources.base_addr + offset;
+
+   return ioread32(addr);
+}
+
+int xge_port_reset(struct net_device *ndev)
+{
+   struct xge_pdata *pdata = netdev_priv(ndev);
+
+   xge_wr_csr(pdata, ENET_SRST, 0x3);
+   xge_wr_csr(pdata, ENET_SRST, 0x2);
+   xge_wr_csr(pdata, ENET_SRST, 0x0);
+
+   xge_wr_csr(pdata, ENET_SHIM, DEVM_ARAUX_COH | DEVM_AWAUX_COH);
+
+   return 0;
+}
+
+static void xge_traffic_resume(struct net_device *ndev)
+{
+   struct xge_pdata *pdata = netdev_priv(ndev);
+
+   xge_wr_csr(pdata, CFG_FORCE_LINK_STATUS_EN, 1);
+   xge_wr_csr(pdata, FORCE_LINK_STATUS, 1);
+
+   xge_wr_csr(pdata, CFG_LINK_AGGR_RESUME, 1);
+   xge_wr_csr(pdata, RX_DV_GATE_REG, 1);
+}
+
+int xge_port_init(struct net_device *ndev)
+{
+   struct xge_pdata *pdata = netdev_priv(ndev);
+
+   pdata->phy_speed = SPEED_1000;
+   xge_mac_init(pdata);
+   xge_traffic_resume(ndev);
+
+   return 0;
+}
diff --git a/drivers/net/ethernet/apm/xgene-v2/enet.h 
b/drivers/net/ethernet/apm/xgene-v2/enet.h
new file mode 100644
index 000..40371cf
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/enet.h
@@ -0,0 +1,43 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#ifndef __XGENE_ENET_V2_ENET_H__
+#define __XGENE_ENET_V2_ENET_H__
+
+#define ENET_CLKEN 0xc008
+#define ENET_SRST  0xc000
+#define ENET_SHIM  0xc010
+#define CFG_MEM_RAM_SHUTDOWN   0xd070
+#define BLOCK_MEM_RDY  0xd074
+
+#define DEVM_ARAUX_COH BIT(19)
+#define DEVM_AWAUX_COH BIT(3)
+
+#define CFG_FORCE_LINK_STATUS_EN   0x229c
+#define FORCE_LINK_STATUS  0x22a0
+#define CFG_LINK_AGGR_RESUME   0x27c8
+#define RX_DV_GATE_REG 0x2dfc
+
+void xge_wr_csr(struct xge_pdata *pdata, u32 offset, u32 val);
+u32 xge_rd_csr(struct xge_pdata *pdata, u32 offset);
+int xge_port_reset(struct net_device *ndev);
+
+#endif  /* __XGENE_ENET_V2_ENET__H__ */
-- 
1.9.1



[PATCH v3 net-next 1/6] drivers: net: xgene-v2: Add DMA descriptor

2017-03-03 Thread Iyappan Subramanian
This patch adds DMA descriptor setup and interrupt enable/disable
functions.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 drivers/net/ethernet/apm/xgene-v2/main.h |  74 +++
 drivers/net/ethernet/apm/xgene-v2/ring.c |  81 +
 drivers/net/ethernet/apm/xgene-v2/ring.h | 119 +++
 3 files changed, 274 insertions(+)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/main.h
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/ring.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/ring.h

diff --git a/drivers/net/ethernet/apm/xgene-v2/main.h 
b/drivers/net/ethernet/apm/xgene-v2/main.h
new file mode 100644
index 000..a2f8712
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/main.h
@@ -0,0 +1,74 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#ifndef __XGENE_ENET_V2_MAIN_H__
+#define __XGENE_ENET_V2_MAIN_H__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "mac.h"
+#include "enet.h"
+#include "ring.h"
+
+#define XGENE_ENET_V2_VERSION  "v1.0"
+#define XGENE_ENET_STD_MTU 1536
+#define XGENE_ENET_MIN_FRAME   60
+#define IRQ_ID_SIZE 16
+
+struct xge_resource {
+   void __iomem *base_addr;
+   int phy_mode;
+   u32 irq;
+};
+
+struct xge_stats {
+   u64 tx_packets;
+   u64 tx_bytes;
+   u64 rx_packets;
+   u64 rx_bytes;
+};
+
+/* ethernet private data */
+struct xge_pdata {
+   struct xge_resource resources;
+   struct xge_desc_ring *tx_ring;
+   struct xge_desc_ring *rx_ring;
+   struct platform_device *pdev;
+   char irq_name[IRQ_ID_SIZE];
+   struct net_device *ndev;
+   struct napi_struct napi;
+   struct xge_stats stats;
+   int phy_speed;
+   u8 nbufs;
+};
+
+#endif /* __XGENE_ENET_V2_MAIN_H__ */
diff --git a/drivers/net/ethernet/apm/xgene-v2/ring.c 
b/drivers/net/ethernet/apm/xgene-v2/ring.c
new file mode 100644
index 000..3881082
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/ring.c
@@ -0,0 +1,81 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include "main.h"
+
+/* create circular linked list of descriptors */
+void xge_setup_desc(struct xge_desc_ring *ring)
+{
+   struct xge_raw_desc *raw_desc;
+   dma_addr_t dma_h, next_dma;
+   u16 offset;
+   int i;
+
+   for (i = 0; i < XGENE_ENET_NUM_DESC; i++) {
+   raw_desc = >raw_desc[i];
+
+   offset = (i + 1) & (XGENE_ENET_NUM_DESC - 1);
+   next_dma = ring->dma_addr + (offset * XGENE_ENET_DESC_SIZE);
+
+   raw_desc->m0 = cpu_to_le64(SET_BITS(E, 1) |
+  SET_BITS(PKT_SIZE, SLOT_EMPTY));
+   dma_h = upper_32_bits(next_dma);
+   raw_desc->m1 = cpu_to_le64(SET_BITS(NEXT_DESC_ADDRL, next_dma) |
+  SET_BITS(NEXT_DESC_ADDRH, dma_h));
+   }
+}
+
+void xge_update_tx_desc_addr(struct xge_pdata *pdata)
+{
+   struct xge_desc_ring *ring = pdata->tx_ring;
+   dma_addr_t dma_addr = ring->dma_addr;
+
+   xge_wr_csr(pdata, DMATXDESCL, dma_addr);
+   xge_wr_csr(pdata, DMATXDESCH, 

[PATCH v3 net-next 2/6] drivers: net: xgene-v2: Add mac configuration

2017-03-03 Thread Iyappan Subramanian
This patch adds functions to configure and control mac.  This
patch also adds helper functions to get/set registers.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 drivers/net/ethernet/apm/xgene-v2/mac.c | 116 
 drivers/net/ethernet/apm/xgene-v2/mac.h | 110 ++
 2 files changed, 226 insertions(+)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/mac.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/mac.h

diff --git a/drivers/net/ethernet/apm/xgene-v2/mac.c 
b/drivers/net/ethernet/apm/xgene-v2/mac.c
new file mode 100644
index 000..9c3d32d
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/mac.c
@@ -0,0 +1,116 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan Subramanian 
+ *   Keyur Chudgar 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include "main.h"
+
+void xge_mac_reset(struct xge_pdata *pdata)
+{
+   xge_wr_csr(pdata, MAC_CONFIG_1, SOFT_RESET);
+   xge_wr_csr(pdata, MAC_CONFIG_1, 0);
+}
+
+static void xge_mac_set_speed(struct xge_pdata *pdata)
+{
+   u32 icm0, icm2, ecm0, mc2;
+   u32 intf_ctrl, rgmii;
+
+   icm0 = xge_rd_csr(pdata, ICM_CONFIG0_REG_0);
+   icm2 = xge_rd_csr(pdata, ICM_CONFIG2_REG_0);
+   ecm0 = xge_rd_csr(pdata, ECM_CONFIG0_REG_0);
+   rgmii = xge_rd_csr(pdata, RGMII_REG_0);
+   mc2 = xge_rd_csr(pdata, MAC_CONFIG_2);
+   intf_ctrl = xge_rd_csr(pdata, INTERFACE_CONTROL);
+   icm2 |= CFG_WAITASYNCRD_EN;
+
+   switch (pdata->phy_speed) {
+   case SPEED_10:
+   SET_REG_BITS(, INTF_MODE, 1);
+   SET_REG_BITS(_ctrl, HD_MODE, 0);
+   SET_REG_BITS(, CFG_MACMODE, 0);
+   SET_REG_BITS(, CFG_WAITASYNCRD, 500);
+   SET_REG_BIT(, CFG_SPEED_125, 0);
+   break;
+   case SPEED_100:
+   SET_REG_BITS(, INTF_MODE, 1);
+   SET_REG_BITS(_ctrl, HD_MODE, 1);
+   SET_REG_BITS(, CFG_MACMODE, 1);
+   SET_REG_BITS(, CFG_WAITASYNCRD, 80);
+   SET_REG_BIT(, CFG_SPEED_125, 0);
+   break;
+   default:
+   SET_REG_BITS(, INTF_MODE, 2);
+   SET_REG_BITS(_ctrl, HD_MODE, 2);
+   SET_REG_BITS(, CFG_MACMODE, 2);
+   SET_REG_BITS(, CFG_WAITASYNCRD, 16);
+   SET_REG_BIT(, CFG_SPEED_125, 1);
+   break;
+   }
+
+   mc2 |= FULL_DUPLEX | CRC_EN | PAD_CRC;
+   SET_REG_BITS(, CFG_WFIFOFULLTHR, 0x32);
+
+   xge_wr_csr(pdata, MAC_CONFIG_2, mc2);
+   xge_wr_csr(pdata, INTERFACE_CONTROL, intf_ctrl);
+   xge_wr_csr(pdata, RGMII_REG_0, rgmii);
+   xge_wr_csr(pdata, ICM_CONFIG0_REG_0, icm0);
+   xge_wr_csr(pdata, ICM_CONFIG2_REG_0, icm2);
+   xge_wr_csr(pdata, ECM_CONFIG0_REG_0, ecm0);
+}
+
+void xge_mac_set_station_addr(struct xge_pdata *pdata)
+{
+   u32 addr0, addr1;
+   u8 *dev_addr = pdata->ndev->dev_addr;
+
+   addr0 = (dev_addr[3] << 24) | (dev_addr[2] << 16) |
+   (dev_addr[1] << 8) | dev_addr[0];
+   addr1 = (dev_addr[5] << 24) | (dev_addr[4] << 16);
+
+   xge_wr_csr(pdata, STATION_ADDR0, addr0);
+   xge_wr_csr(pdata, STATION_ADDR1, addr1);
+}
+
+void xge_mac_init(struct xge_pdata *pdata)
+{
+   xge_mac_reset(pdata);
+   xge_mac_set_speed(pdata);
+   xge_mac_set_station_addr(pdata);
+}
+
+void xge_mac_enable(struct xge_pdata *pdata)
+{
+   u32 data;
+
+   data = xge_rd_csr(pdata, MAC_CONFIG_1);
+   data |= TX_EN | RX_EN;
+   xge_wr_csr(pdata, MAC_CONFIG_1, data);
+
+   data = xge_rd_csr(pdata, MAC_CONFIG_1);
+}
+
+void xge_mac_disable(struct xge_pdata *pdata)
+{
+   u32 data;
+
+   data = xge_rd_csr(pdata, MAC_CONFIG_1);
+   data &= ~(TX_EN | RX_EN);
+   xge_wr_csr(pdata, MAC_CONFIG_1, data);
+}
diff --git a/drivers/net/ethernet/apm/xgene-v2/mac.h 
b/drivers/net/ethernet/apm/xgene-v2/mac.h
new file mode 100644
index 000..0fce6ae
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/mac.h
@@ -0,0 +1,110 @@
+/*
+ * Applied Micro X-Gene SoC Ethernet v2 Driver
+ *
+ * Copyright (c) 2017, Applied Micro Circuits Corporation
+ * Author(s): Iyappan 

[PATCH v3 net-next 0/6] drivers: net: xgene-v2: Add RGMII based 1G driver

2017-03-03 Thread Iyappan Subramanian
This patch set adds support for RGMII based 1GbE hardware which uses a linked
list of DMA descriptor architecture (v2) for APM X-Gene SoCs.

Signed-off-by: Iyappan Subramanian 
---
v3: Address review comments from v2
- fix kbuild warnings (this 'if' clause does not guard)

v2: Address review comments from v1
- moved create_desc_ring and delete_desc_ring to open() and close()
  respectively
- changed to use dma_zalloc APIs
- fixed tx_timeout()
- removed tx completion polling upper bound
- added error checking on rx packets
- added netif_stop_queue() and netif_wake_queue()

v1:
- Initial version
---

Iyappan Subramanian (6):
  drivers: net: xgene-v2: Add DMA descriptor
  drivers: net: xgene-v2: Add mac configuration
  drivers: net: xgene-v2: Add ethernet hardware configuration
  drivers: net: xgene-v2: Add base driver
  drivers: net: xgene-v2: Add transmit and receive
  MAINTAINERS: Add entry for APM X-Gene SoC Ethernet (v2) driver

 MAINTAINERS|   6 +
 drivers/net/ethernet/apm/Kconfig   |   1 +
 drivers/net/ethernet/apm/Makefile  |   1 +
 drivers/net/ethernet/apm/xgene-v2/Kconfig  |  11 +
 drivers/net/ethernet/apm/xgene-v2/Makefile |   6 +
 drivers/net/ethernet/apm/xgene-v2/enet.c   |  71 +++
 drivers/net/ethernet/apm/xgene-v2/enet.h   |  43 ++
 drivers/net/ethernet/apm/xgene-v2/mac.c| 116 +
 drivers/net/ethernet/apm/xgene-v2/mac.h| 110 +
 drivers/net/ethernet/apm/xgene-v2/main.c   | 756 +
 drivers/net/ethernet/apm/xgene-v2/main.h   |  75 +++
 drivers/net/ethernet/apm/xgene-v2/ring.c   |  81 
 drivers/net/ethernet/apm/xgene-v2/ring.h   | 119 +
 13 files changed, 1396 insertions(+)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/Kconfig
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/Makefile
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/enet.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/enet.h
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/mac.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/mac.h
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/main.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/main.h
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/ring.c
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/ring.h

-- 
1.9.1



[PATCH v3 net-next 5/6] drivers: net: xgene-v2: Add transmit and receive

2017-03-03 Thread Iyappan Subramanian
This patch adds,
- Transmit
- Transmit completion poll
- Receive poll
- NAPI handler

and enables the driver.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
---
 drivers/net/ethernet/apm/Kconfig   |   1 +
 drivers/net/ethernet/apm/Makefile  |   1 +
 drivers/net/ethernet/apm/xgene-v2/Kconfig  |  11 ++
 drivers/net/ethernet/apm/xgene-v2/Makefile |   6 +
 drivers/net/ethernet/apm/xgene-v2/main.c   | 248 -
 drivers/net/ethernet/apm/xgene-v2/main.h   |   1 +
 6 files changed, 267 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/Kconfig
 create mode 100644 drivers/net/ethernet/apm/xgene-v2/Makefile

diff --git a/drivers/net/ethernet/apm/Kconfig b/drivers/net/ethernet/apm/Kconfig
index ec63d70..59efe5b 100644
--- a/drivers/net/ethernet/apm/Kconfig
+++ b/drivers/net/ethernet/apm/Kconfig
@@ -1 +1,2 @@
 source "drivers/net/ethernet/apm/xgene/Kconfig"
+source "drivers/net/ethernet/apm/xgene-v2/Kconfig"
diff --git a/drivers/net/ethernet/apm/Makefile 
b/drivers/net/ethernet/apm/Makefile
index 65ce32a..946b2a4 100644
--- a/drivers/net/ethernet/apm/Makefile
+++ b/drivers/net/ethernet/apm/Makefile
@@ -3,3 +3,4 @@
 #
 
 obj-$(CONFIG_NET_XGENE) += xgene/
+obj-$(CONFIG_NET_XGENE_V2) += xgene-v2/
diff --git a/drivers/net/ethernet/apm/xgene-v2/Kconfig 
b/drivers/net/ethernet/apm/xgene-v2/Kconfig
new file mode 100644
index 000..1205861
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/Kconfig
@@ -0,0 +1,11 @@
+config NET_XGENE_V2
+   tristate "APM X-Gene SoC Ethernet-v2 Driver"
+   depends on HAS_DMA
+   depends on ARCH_XGENE || COMPILE_TEST
+   help
+ This is the Ethernet driver for the on-chip ethernet interface
+ which uses a linked list of DMA descriptor architecture (v2) for
+ APM X-Gene SoCs.
+
+ To compile this driver as a module, choose M here. This module will
+ be called xgene-enet-v2.
diff --git a/drivers/net/ethernet/apm/xgene-v2/Makefile 
b/drivers/net/ethernet/apm/xgene-v2/Makefile
new file mode 100644
index 000..735309c
--- /dev/null
+++ b/drivers/net/ethernet/apm/xgene-v2/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for APM X-Gene Ethernet v2 driver
+#
+
+xgene-enet-v2-objs := main.o mac.o enet.o ring.o
+obj-$(CONFIG_NET_XGENE_V2) += xgene-enet-v2.o
diff --git a/drivers/net/ethernet/apm/xgene-v2/main.c 
b/drivers/net/ethernet/apm/xgene-v2/main.c
index c96b4cc..b16ef43 100644
--- a/drivers/net/ethernet/apm/xgene-v2/main.c
+++ b/drivers/net/ethernet/apm/xgene-v2/main.c
@@ -113,7 +113,7 @@ static int xge_refill_buffers(struct net_device *ndev, u32 
nbuf)
raw_desc->m1 = cpu_to_le64(SET_BITS(NEXT_DESC_ADDRL, addr_lo) |
   SET_BITS(NEXT_DESC_ADDRH, addr_hi) |
   SET_BITS(PKT_ADDRH,
-   dma_addr >> PKT_ADDRL_LEN));
+   upper_32_bits(dma_addr)));
 
dma_wmb();
raw_desc->m0 = cpu_to_le64(SET_BITS(PKT_ADDRL, dma_addr) |
@@ -177,6 +177,194 @@ static void xge_free_irq(struct net_device *ndev)
devm_free_irq(dev, pdata->resources.irq, pdata);
 }
 
+static bool is_tx_slot_available(struct xge_raw_desc *raw_desc)
+{
+   if (GET_BITS(E, le64_to_cpu(raw_desc->m0)) &&
+   (GET_BITS(PKT_SIZE, le64_to_cpu(raw_desc->m0)) == SLOT_EMPTY))
+   return true;
+
+   return false;
+}
+
+static netdev_tx_t xge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+   struct xge_pdata *pdata = netdev_priv(ndev);
+   struct device *dev = >pdev->dev;
+   static dma_addr_t dma_addr;
+   struct xge_desc_ring *tx_ring;
+   struct xge_raw_desc *raw_desc;
+   u64 addr_lo, addr_hi;
+   void *pkt_buf;
+   u8 tail;
+   u16 len;
+
+   tx_ring = pdata->tx_ring;
+   tail = tx_ring->tail;
+   len = skb_headlen(skb);
+   raw_desc = _ring->raw_desc[tail];
+
+   if (!is_tx_slot_available(raw_desc)) {
+   netif_stop_queue(ndev);
+   return NETDEV_TX_BUSY;
+   }
+
+   /* Packet buffers should be 64B aligned */
+   pkt_buf = dma_zalloc_coherent(dev, XGENE_ENET_STD_MTU, _addr,
+ GFP_ATOMIC);
+   if (unlikely(!pkt_buf)) {
+   dev_kfree_skb_any(skb);
+   return NETDEV_TX_OK;
+   }
+   memcpy(pkt_buf, skb->data, len);
+
+   addr_hi = GET_BITS(NEXT_DESC_ADDRH, le64_to_cpu(raw_desc->m1));
+   addr_lo = GET_BITS(NEXT_DESC_ADDRL, le64_to_cpu(raw_desc->m1));
+   raw_desc->m1 = cpu_to_le64(SET_BITS(NEXT_DESC_ADDRL, addr_lo) |
+  SET_BITS(NEXT_DESC_ADDRH, addr_hi) |
+  SET_BITS(PKT_ADDRH,
+   upper_32_bits(dma_addr)));
+
+

[PATCH net] rxrpc: Call state should be read with READ_ONCE() under some circumstances

2017-03-03 Thread David Howells
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.

Signed-off-by: David Howells 
---

 net/rxrpc/input.c   |   12 +++-
 net/rxrpc/recvmsg.c |4 ++--
 net/rxrpc/sendmsg.c |   48 ++--
 3 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 9f4cfa25af7c..d74921c4969b 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -420,6 +420,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
 u16 skew)
 {
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
+   enum rxrpc_call_state state;
unsigned int offset = sizeof(struct rxrpc_wire_header);
unsigned int ix;
rxrpc_serial_t serial = sp->hdr.serial, ack_serial = 0;
@@ -434,14 +435,15 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
_proto("Rx DATA %%%u { #%u f=%02x }",
   sp->hdr.serial, seq, sp->hdr.flags);
 
-   if (call->state >= RXRPC_CALL_COMPLETE)
+   state = READ_ONCE(call->state);
+   if (state >= RXRPC_CALL_COMPLETE)
return;
 
/* Received data implicitly ACKs all of the request packets we sent
 * when we're acting as a client.
 */
-   if ((call->state == RXRPC_CALL_CLIENT_SEND_REQUEST ||
-call->state == RXRPC_CALL_CLIENT_AWAIT_REPLY) &&
+   if ((state == RXRPC_CALL_CLIENT_SEND_REQUEST ||
+state == RXRPC_CALL_CLIENT_AWAIT_REPLY) &&
!rxrpc_receiving_reply(call))
return;
 
@@ -799,7 +801,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct 
sk_buff *skb,
return rxrpc_proto_abort("AK0", call, 0);
 
/* Ignore ACKs unless we are or have just been transmitting. */
-   switch (call->state) {
+   switch (READ_ONCE(call->state)) {
case RXRPC_CALL_CLIENT_SEND_REQUEST:
case RXRPC_CALL_CLIENT_AWAIT_REPLY:
case RXRPC_CALL_SERVER_SEND_REPLY:
@@ -940,7 +942,7 @@ static void rxrpc_input_call_packet(struct rxrpc_call *call,
 static void rxrpc_input_implicit_end_call(struct rxrpc_connection *conn,
  struct rxrpc_call *call)
 {
-   switch (call->state) {
+   switch (READ_ONCE(call->state)) {
case RXRPC_CALL_SERVER_AWAIT_ACK:
rxrpc_call_completed(call);
break;
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 22447dbcc380..46b1a93be03a 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -525,7 +525,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t len,
msg->msg_namelen = len;
}
 
-   switch (call->state) {
+   switch (READ_ONCE(call->state)) {
case RXRPC_CALL_SERVER_ACCEPTING:
ret = rxrpc_recvmsg_new_call(rx, call, msg, flags);
break;
@@ -638,7 +638,7 @@ int rxrpc_kernel_recv_data(struct socket *sock, struct 
rxrpc_call *call,
 
mutex_lock(>user_mutex);
 
-   switch (call->state) {
+   switch (READ_ONCE(call->state)) {
case RXRPC_CALL_CLIENT_RECV_REPLY:
case RXRPC_CALL_SERVER_RECV_REQUEST:
case RXRPC_CALL_SERVER_ACK_REQUEST:
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 27685d8cba1a..9c2e443de811 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -486,6 +486,7 @@ rxrpc_new_client_call_for_sendmsg(struct rxrpc_sock *rx, 
struct msghdr *msg,
 int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
__releases(>sk.sk_lock.slock)
 {
+   enum rxrpc_call_state state;
enum rxrpc_command cmd;
struct rxrpc_call *call;
unsigned long user_call_ID = 0;
@@ -524,13 +525,17 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr 
*msg, size_t len)
return PTR_ERR(call);
/* ... and we have the call lock. */
} else {
-   ret = -EBUSY;
-   if (call->state == RXRPC_CALL_UNINITIALISED ||
-   call->state == RXRPC_CALL_CLIENT_AWAIT_CONN ||
-   call->state == RXRPC_CALL_SERVER_PREALLOC ||
-   call->state == RXRPC_CALL_SERVER_SECURING ||
-   call->state == RXRPC_CALL_SERVER_ACCEPTING)
+   switch (READ_ONCE(call->state)) {
+   case RXRPC_CALL_UNINITIALISED:
+   case RXRPC_CALL_CLIENT_AWAIT_CONN:
+   case RXRPC_CALL_SERVER_PREALLOC:
+   case RXRPC_CALL_SERVER_SECURING:
+   case RXRPC_CALL_SERVER_ACCEPTING:
+   ret = -EBUSY;
goto error_release_sock;
+   default:
+   break;
+

[PATCH] net: smsc: smc91c92_cs: use new api ethtool_{get|set}_link_ksettings

2017-03-03 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/smsc/smc91c92_cs.c |   98 --
 1 files changed, 52 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smc91c92_cs.c 
b/drivers/net/ethernet/smsc/smc91c92_cs.c
index 97280da..976aa87 100644
--- a/drivers/net/ethernet/smsc/smc91c92_cs.c
+++ b/drivers/net/ethernet/smsc/smc91c92_cs.c
@@ -1843,56 +1843,60 @@ static int smc_link_ok(struct net_device *dev)
 }
 }
 
-static int smc_netdev_get_ecmd(struct net_device *dev, struct ethtool_cmd 
*ecmd)
+static int smc_netdev_get_ecmd(struct net_device *dev,
+  struct ethtool_link_ksettings *ecmd)
 {
-u16 tmp;
-unsigned int ioaddr = dev->base_addr;
+   u16 tmp;
+   unsigned int ioaddr = dev->base_addr;
+   u32 supported;
 
-ecmd->supported = (SUPPORTED_TP | SUPPORTED_AUI |
-   SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full);
-   
-SMC_SELECT_BANK(1);
-tmp = inw(ioaddr + CONFIG);
-ecmd->port = (tmp & CFG_AUI_SELECT) ? PORT_AUI : PORT_TP;
-ecmd->transceiver = XCVR_INTERNAL;
-ethtool_cmd_speed_set(ecmd, SPEED_10);
-ecmd->phy_address = ioaddr + MGMT;
+   supported = (SUPPORTED_TP | SUPPORTED_AUI |
+SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full);
 
-SMC_SELECT_BANK(0);
-tmp = inw(ioaddr + TCR);
-ecmd->duplex = (tmp & TCR_FDUPLX) ? DUPLEX_FULL : DUPLEX_HALF;
+   SMC_SELECT_BANK(1);
+   tmp = inw(ioaddr + CONFIG);
+   ecmd->base.port = (tmp & CFG_AUI_SELECT) ? PORT_AUI : PORT_TP;
+   ecmd->base.speed = SPEED_10;
+   ecmd->base.phy_address = ioaddr + MGMT;
 
-return 0;
+   SMC_SELECT_BANK(0);
+   tmp = inw(ioaddr + TCR);
+   ecmd->base.duplex = (tmp & TCR_FDUPLX) ? DUPLEX_FULL : DUPLEX_HALF;
+
+   ethtool_convert_legacy_u32_to_link_mode(ecmd->link_modes.supported,
+   supported);
+
+   return 0;
 }
 
-static int smc_netdev_set_ecmd(struct net_device *dev, struct ethtool_cmd 
*ecmd)
+static int smc_netdev_set_ecmd(struct net_device *dev,
+  const struct ethtool_link_ksettings *ecmd)
 {
-u16 tmp;
-unsigned int ioaddr = dev->base_addr;
+   u16 tmp;
+   unsigned int ioaddr = dev->base_addr;
 
-if (ethtool_cmd_speed(ecmd) != SPEED_10)
-   return -EINVAL;
-if (ecmd->duplex != DUPLEX_HALF && ecmd->duplex != DUPLEX_FULL)
-   return -EINVAL;
-if (ecmd->port != PORT_TP && ecmd->port != PORT_AUI)
-   return -EINVAL;
-if (ecmd->transceiver != XCVR_INTERNAL)
-   return -EINVAL;
+   if (ecmd->base.speed != SPEED_10)
+   return -EINVAL;
+   if (ecmd->base.duplex != DUPLEX_HALF &&
+   ecmd->base.duplex != DUPLEX_FULL)
+   return -EINVAL;
+   if (ecmd->base.port != PORT_TP && ecmd->base.port != PORT_AUI)
+   return -EINVAL;
 
-if (ecmd->port == PORT_AUI)
-   smc_set_xcvr(dev, 1);
-else
-   smc_set_xcvr(dev, 0);
+   if (ecmd->base.port == PORT_AUI)
+   smc_set_xcvr(dev, 1);
+   else
+   smc_set_xcvr(dev, 0);
 
-SMC_SELECT_BANK(0);
-tmp = inw(ioaddr + TCR);
-if (ecmd->duplex == DUPLEX_FULL)
-   tmp |= TCR_FDUPLX;
-else
-   tmp &= ~TCR_FDUPLX;
-outw(tmp, ioaddr + TCR);
-   
-return 0;
+   SMC_SELECT_BANK(0);
+   tmp = inw(ioaddr + TCR);
+   if (ecmd->base.duplex == DUPLEX_FULL)
+   tmp |= TCR_FDUPLX;
+   else
+   tmp &= ~TCR_FDUPLX;
+   outw(tmp, ioaddr + TCR);
+
+   return 0;
 }
 
 static int check_if_running(struct net_device *dev)
@@ -1908,7 +1912,8 @@ static void smc_get_drvinfo(struct net_device *dev, 
struct ethtool_drvinfo *info
strlcpy(info->version, DRV_VERSION, sizeof(info->version));
 }
 
-static int smc_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd)
+static int smc_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *ecmd)
 {
struct smc_private *smc = netdev_priv(dev);
unsigned int ioaddr = dev->base_addr;
@@ -1919,7 +1924,7 @@ static int smc_get_settings(struct net_device *dev, 
struct ethtool_cmd *ecmd)
spin_lock_irqsave(>lock, flags);
SMC_SELECT_BANK(3);
if (smc->cfg & CFG_MII_SELECT)
-   ret = mii_ethtool_gset(>mii_if, ecmd);
+   ret = mii_ethtool_get_link_ksettings(>mii_if, ecmd);
else
ret = smc_netdev_get_ecmd(dev, ecmd);
SMC_SELECT_BANK(saved_bank);
@@ -1927,7 +1932,8 @@ static int smc_get_settings(struct net_device *dev, 
struct ethtool_cmd *ecmd)
return ret;
 }
 
-static int smc_set_settings(struct net_device 

[PATCH net] tcp: fix various issues for sockets morphing to listen state

2017-03-03 Thread Eric Dumazet
From: Eric Dumazet 

Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)

I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.

1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction

Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
 states from tcp_v6_mtu_reduced()


Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
---
 net/ipv4/tcp_ipv4.c  |7 +--
 net/ipv4/tcp_timer.c |6 --
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
9a89b8deafae1e9b2e8d1d9bc211c9c30b8dd8ec..8f3ec1365497a58972d31d419f88b34457b5ae39
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -279,10 +279,13 @@ EXPORT_SYMBOL(tcp_v4_connect);
  */
 void tcp_v4_mtu_reduced(struct sock *sk)
 {
-   struct dst_entry *dst;
struct inet_sock *inet = inet_sk(sk);
-   u32 mtu = tcp_sk(sk)->mtu_info;
+   struct dst_entry *dst;
+   u32 mtu;
 
+   if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))
+   return;
+   mtu = tcp_sk(sk)->mtu_info;
dst = inet_csk_update_pmtu(sk, mtu);
if (!dst)
return;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
40d893556e6701ace6a02903e53c45822d6fa56d..b2ab411c6d3728fa7dbdebde045532a7317f5166
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -249,7 +249,8 @@ void tcp_delack_timer_handler(struct sock *sk)
 
sk_mem_reclaim_partial(sk);
 
-   if (sk->sk_state == TCP_CLOSE || !(icsk->icsk_ack.pending & 
ICSK_ACK_TIMER))
+   if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
+   !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
goto out;
 
if (time_after(icsk->icsk_ack.timeout, jiffies)) {
@@ -552,7 +553,8 @@ void tcp_write_timer_handler(struct sock *sk)
struct inet_connection_sock *icsk = inet_csk(sk);
int event;
 
-   if (sk->sk_state == TCP_CLOSE || !icsk->icsk_pending)
+   if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
+   !icsk->icsk_pending)
goto out;
 
if (time_after(icsk->icsk_timeout, jiffies)) {




Re: [Bug 194749] New: kernel bonding does not work in a network nameservice in versions above 3.10.0-229.20.1

2017-03-03 Thread Jiri Pirko
Fri, Mar 03, 2017 at 04:19:13PM CET, nicolas.dich...@6wind.com wrote:
>Le 02/03/2017 à 21:39, Dan Geist a écrit :
>> - On Mar 2, 2017, at 3:11 PM, Cong Wang xiyou.wangc...@gmail.com wrote
>> 
>>> On Thu, Mar 2, 2017 at 10:32 AM, Stephen Hemminger
>>>  wrote:


 Begin forwarded message:

 Date: Wed, 01 Mar 2017 21:08:01 +
 From: bugzilla-dae...@bugzilla.kernel.org
 To: step...@networkplumber.org
 Subject: [Bug 194749] New: kernel bonding does not work in a network 
 nameservice
 in versions above 3.10.0-229.20.1


 https://bugzilla.kernel.org/show_bug.cgi?id=194749

 Bug ID: 194749
Summary: kernel bonding does not work in a network nameservice
 in versions above 3.10.0-229.20.1
Product: Networking
Version: 2.5
 Kernel Version: > 3.10.0-229.20.1
   Hardware: x86-64
 OS: Linux
   Tree: Mainline
 Status: NEW
   Severity: blocking
   Priority: P1
  Component: Other
   Assignee: step...@networkplumber.org
   Reporter: d...@polter.net
 Regression: No

 bond interface is being used in active/standby mode with two physical NICs
 inside a network nameservice to provide switchpath redundancy.

 netns is instantiated post-boot with the following:

 ip netns add vntp
 ip link set p4p1 netns vntp
 ip link set p4p2 netns vntp
 ip link set bond0 netns vntp
 ip netns exec vntp ip link set lo up
 ip netns exec vntp ip link set p4p1 up
 ip netns exec vntp ip link set p4p2 up
 ip netns exec vntp ip link set bond0 up
 ip netns exec vntp ifenslave bond0 p4p1 p4p2
>>>
>>> This is due to the following commit:
>>>
>>> commit f9399814927ad9bb995a6e109c2a5f9d8a848209
>>> Author: Weilong Chen 
>>> Date:   Wed Jan 22 17:16:30 2014 +0800
>>>
>>>bonding: Don't allow bond devices to change network namespaces.
>>>
>>>Like bridge, bonding as netdevice doesn't cross netns boundaries.
>>>
>>>Bonding ports and bonding itself live in same netns.
>>>
>>>Signed-off-by: Weilong Chen 
>>>Signed-off-by: David S. Miller 
>>>
>>>
>>> NETIF_F_NETNS_LOCAL was introduced for loopback device which
>>> is created for each netns, it is not clear why we need to add it to bond
>>> and bridge...
>> 
>> Thank you for tracking this down. Without digging through the code to figure 
>> it out, does this imply that the existence of a bond interface is not 
>> possible AT ALL within a netns or simply that it may not be "migrated" 
>> between the global scope and a netns?
>It means that the migration is not possible. I think the only reason to have
>this flag on bonding and bridge is the lack of test and fix. There is probably
>some work to be done to have this feature. But are there real use cases of
>x-netns bonding or x-netns bridge?

If that use case exists I believe it is an abuse. Soft devices that are
by definition in upper-lower relationships with other devices should not
move to other namespaces. Prevents all kinds of issues. If you need a
soft device like bridge of bond within a namespace, just create it there.



Re: net/ipv4: deadlock in ip_ra_control

2017-03-03 Thread Dmitry Vyukov
On Thu, Mar 2, 2017 at 10:40 AM, Dmitry Vyukov  wrote:
> On Wed, Mar 1, 2017 at 6:18 PM, Cong Wang  wrote:
>> On Wed, Mar 1, 2017 at 2:44 AM, Dmitry Vyukov  wrote:
>>> Hello,
>>>
>>> I've got the following deadlock report while running syzkaller fuzzer
>>> on linux-next/51788aebe7cae79cb334ad50641347465fc188fd:
>>>
>>> ==
>>> [ INFO: possible circular locking dependency detected ]
>>> 4.10.0-next-20170301+ #1 Not tainted
>>> ---
>>> syz-executor1/3394 is trying to acquire lock:
>>>  (sk_lock-AF_INET){+.+.+.}, at: [] lock_sock
>>> include/net/sock.h:1460 [inline]
>>>  (sk_lock-AF_INET){+.+.+.}, at: []
>>> do_ip_setsockopt.isra.12+0x21c/0x3540 net/ipv4/ip_sockglue.c:652
>>>
>>> but task is already holding lock:
>>>  (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:70
>>>
>>> which lock already depends on the new lock.
>>>
>>>
>>> the existing dependency chain (in reverse order) is:
>>>
>>> -> #1 (rtnl_mutex){+.+.+.}:
>>>validate_chain kernel/locking/lockdep.c:2265 [inline]
>>>__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
>>>lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
>>>__mutex_lock_common kernel/locking/mutex.c:754 [inline]
>>>__mutex_lock+0x172/0x1730 kernel/locking/mutex.c:891
>>>mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:906
>>>rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70
>>>mrtsock_destruct+0x86/0x2c0 net/ipv4/ipmr.c:1281
>>>ip_ra_control+0x459/0x600 net/ipv4/ip_sockglue.c:372
>>>do_ip_setsockopt.isra.12+0x1064/0x3540 net/ipv4/ip_sockglue.c:1161
>>>ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
>>>raw_setsockopt+0xb7/0xd0 net/ipv4/raw.c:839
>>>sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
>>>SYSC_setsockopt net/socket.c:1786 [inline]
>>>SyS_setsockopt+0x25c/0x390 net/socket.c:1765
>>>entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>
>>> -> #0 (sk_lock-AF_INET){+.+.+.}:
>>>check_prev_add kernel/locking/lockdep.c:1828 [inline]
>>>check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
>>>validate_chain kernel/locking/lockdep.c:2265 [inline]
>>>__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
>>>lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
>>>lock_sock_nested+0xcb/0x120 net/core/sock.c:2530
>>>lock_sock include/net/sock.h:1460 [inline]
>>>do_ip_setsockopt.isra.12+0x21c/0x3540 net/ipv4/ip_sockglue.c:652
>>>ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
>>>tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2721
>>>sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
>>>SYSC_setsockopt net/socket.c:1786 [inline]
>>>SyS_setsockopt+0x25c/0x390 net/socket.c:1765
>>>entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>
>>
>> Please try the attached patch (compile only).
>
>
> Pushed the patch to the bots.
> Thanks


This patch triggers:

[   57.748990] RTNL: assertion failed at net/ipv4/ipmr.c (1236)
[   57.749022] CPU: 1 PID: 5301 Comm: syz-executor2 Not tainted 4.10.0+ #15
[   57.749026] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   57.749028] Call Trace:
[   57.749042]  dump_stack+0x2ee/0x3ef
[   57.749219]  mrtsock_destruct+0x27e/0x2f0
[   57.749241]  ip_ra_control+0x459/0x600
[   57.749287]  raw_close+0x19/0x30
[   57.749295]  inet_release+0xed/0x1c0
[   57.749303]  sock_release+0x8d/0x1e0
[   57.749316]  sock_close+0x16/0x20
[   57.749323]  __fput+0x332/0x7f0
[   57.749340]  fput+0x15/0x20
[   57.749347]  task_work_run+0x18a/0x260
[   57.749372]  do_exit+0x18ef/0x28b0
[   57.749641]  do_group_exit+0x149/0x420
[   57.749656]  get_signal+0x7e0/0x1820
[   57.749697]  do_signal+0xd2/0x2190
[   57.749746]  exit_to_usermode_loop+0x200/0x2a0
[   57.749758]  syscall_return_slowpath+0x4d3/0x570
[   57.749835]  entry_SYSCALL_64_fastpath+0xc0/0xc2
[   57.749840] RIP: 0033:0x44fb79
[   57.749843] RSP: 002b:7fbba84d9cf8 EFLAGS: 0246 ORIG_RAX:
00ca
[   57.749850] RAX: fe00 RBX: 00708218 RCX: 0044fb79
[   57.749854] RDX:  RSI:  RDI: 00708218
[   57.749857] RBP: 007081f8 R08:  R09: 
[   57.749860] R10:  R11: 0246 R12: 
[   57.749864] R13: 00a5fc57 R14: 7fbba84da9c0 R15: 000c
[   57.749964]
[   57.749966] ===
[   57.749967] [ INFO: suspicious RCU usage. ]
[   57.749971] 4.10.0+ #15 Not tainted
[   57.749972] ---
[   57.749975] net/ipv4/ipmr.c:1238 suspicious
rcu_dereference_protected() usage!
[   57.749977]
[   57.749977] other info that might help us debug 

Re: net/ipv4: division by 0 in tcp_select_window

2017-03-03 Thread Eric Dumazet
On Fri, 2017-03-03 at 10:25 -0800, Eric Dumazet wrote:
> On Fri, Mar 3, 2017 at 10:10 AM, Dmitry Vyukov  wrote:
> > Hello,
> >
> > The following program triggers division by 0 in tcp_select_window:
> >
> > https://gist.githubusercontent.com/dvyukov/ef28c0fd2ab57a655508ef7621b12e6c/raw/079011e2a9523a390b0621cbc1e5d9d5e637fd6d/gistfile1.txt
> 
> Yeah, tcp_disconnect() should never have existed in the first place.
> 
> We'll send a patch, unless you take care of this before us .

Could you try this first patch ?

Probably others will also be needed.

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
40d893556e6701ace6a02903e53c45822d6fa56d..2187ebf1f270d19e6dd019b8f9df5eef8d018e03
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -552,7 +552,8 @@ void tcp_write_timer_handler(struct sock *sk)
struct inet_connection_sock *icsk = inet_csk(sk);
int event;
 
-   if (sk->sk_state == TCP_CLOSE || !icsk->icsk_pending)
+   if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
+   !icsk->icsk_pending)
goto out;
 
if (time_after(icsk->icsk_timeout, jiffies)) {




[Patch net] strparser: destroy workqueue on module exit

2017-03-03 Thread Cong Wang
Fixes: 43a0c6751a32 ("strparser: Stream parser for messages")
Cc: Tom Herbert 
Signed-off-by: Cong Wang 
---
 net/strparser/strparser.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 41adf36..b5c279b 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -504,6 +504,7 @@ static int __init strp_mod_init(void)
 
 static void __exit strp_mod_exit(void)
 {
+   destroy_workqueue(strp_wq);
 }
 module_init(strp_mod_init);
 module_exit(strp_mod_exit);
-- 
2.5.5



Re: [PATCH] selinux: check for address length in selinux_socket_bind()

2017-03-03 Thread Eric Dumazet
On Fri, Mar 3, 2017 at 9:23 AM, Alexander Potapenko  wrote:
> This happens because bind() unconditionally copies |size| bytes of
> |addr| to the kernel, leaving the rest uninitialized. Then
> security_socket_bind() reads the IP address bytes, including the
> uninitialized ones, to determine the port, or e.g. pass them further to
> sel_netnode_find(), which uses them to calculate a hash.
>
> Signed-off-by: Alexander Potapenko 
> ---
>  security/selinux/hooks.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 0a4b4b040e0a..eba54489b11b 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4351,10 +4351,19 @@ static int selinux_socket_bind(struct socket *sock, 
> struct sockaddr *address, in
> u32 sid, node_perm;
>
> if (family == PF_INET) {
> +   if (addrlen != sizeof(struct sockaddr_in)) {

Please take a look at inet_bind()

The correct test would be :

 if (addrlen < sizeof(struct sockaddr_in))
 err = -EINVAL;
...


> +   err = -EINVAL;
> +   goto out;
> +   }
> addr4 = (struct sockaddr_in *)address;
> snum = ntohs(addr4->sin_port);
> addrp = (char *)>sin_addr.s_addr;
> +
> } else {
> +   if (addrlen != sizeof(struct sockaddr_in6)) {

Look at inet6_bind()

if (addrlen < SIN6_LEN_RFC2133)

> +   err = -EINVAL;
> +   goto out;
> +   }
> addr6 = (struct sockaddr_in6 *)address;
> snum = ntohs(addr6->sin6_port);
> addrp = (char *)>sin6_addr.s6_addr;
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>


[PATCH net v2] xen-netback: fix race condition on XenBus disconnect

2017-03-03 Thread Igor Druzhinin
In some cases during XenBus disconnect event handling and subsequent
queue resource release there may be some TX handlers active on
other processors. Use RCU in order to synchronize with them.

Signed-off-by: Igor Druzhinin 
---
v2:
 * Add protection for xenvif_get_ethtool_stats
 * Additional comments and fixes
---
 drivers/net/xen-netback/interface.c | 29 ++---
 drivers/net/xen-netback/netback.c   |  2 +-
 drivers/net/xen-netback/xenbus.c| 20 ++--
 3 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index a2d32676..266b7cd 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -164,13 +164,17 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
 {
struct xenvif *vif = netdev_priv(dev);
struct xenvif_queue *queue = NULL;
-   unsigned int num_queues = vif->num_queues;
+   unsigned int num_queues;
u16 index;
struct xenvif_rx_cb *cb;
 
BUG_ON(skb->dev != dev);
 
-   /* Drop the packet if queues are not set up */
+   /* Drop the packet if queues are not set up.
+* This handler should be called inside an RCU read section
+* so we don't need to enter it here explicitly.
+*/
+   num_queues = rcu_dereference(vif)->num_queues;
if (num_queues < 1)
goto drop;
 
@@ -221,18 +225,21 @@ static struct net_device_stats *xenvif_get_stats(struct 
net_device *dev)
 {
struct xenvif *vif = netdev_priv(dev);
struct xenvif_queue *queue = NULL;
+   unsigned int num_queues;
u64 rx_bytes = 0;
u64 rx_packets = 0;
u64 tx_bytes = 0;
u64 tx_packets = 0;
unsigned int index;
 
-   spin_lock(>lock);
-   if (vif->queues == NULL)
+   rcu_read_lock();
+
+   num_queues = rcu_dereference(vif)->num_queues;
+   if (num_queues < 1)
goto out;
 
/* Aggregate tx and rx stats from each queue */
-   for (index = 0; index < vif->num_queues; ++index) {
+   for (index = 0; index < num_queues; ++index) {
queue = >queues[index];
rx_bytes += queue->stats.rx_bytes;
rx_packets += queue->stats.rx_packets;
@@ -241,7 +248,7 @@ static struct net_device_stats *xenvif_get_stats(struct 
net_device *dev)
}
 
 out:
-   spin_unlock(>lock);
+   rcu_read_unlock();
 
vif->dev->stats.rx_bytes = rx_bytes;
vif->dev->stats.rx_packets = rx_packets;
@@ -377,10 +384,16 @@ static void xenvif_get_ethtool_stats(struct net_device 
*dev,
 struct ethtool_stats *stats, u64 * data)
 {
struct xenvif *vif = netdev_priv(dev);
-   unsigned int num_queues = vif->num_queues;
+   unsigned int num_queues;
int i;
unsigned int queue_index;
 
+   rcu_read_lock();
+
+   num_queues = rcu_dereference(vif)->num_queues;
+   if (num_queues < 1)
+   goto out;
+
for (i = 0; i < ARRAY_SIZE(xenvif_stats); i++) {
unsigned long accum = 0;
for (queue_index = 0; queue_index < num_queues; ++queue_index) {
@@ -389,6 +402,8 @@ static void xenvif_get_ethtool_stats(struct net_device *dev,
}
data[i] = accum;
}
+out:
+   rcu_read_unlock();
 }
 
 static void xenvif_get_strings(struct net_device *dev, u32 stringset, u8 * 
data)
diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index f9bcf4a..62fa74d 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -214,7 +214,7 @@ static void xenvif_fatal_tx_err(struct xenvif *vif)
netdev_err(vif->dev, "fatal error; disabling device\n");
vif->disabled = true;
/* Disable the vif from queue 0's kthread */
-   if (vif->queues)
+   if (vif->num_queues > 0)
xenvif_kick_thread(>queues[0]);
 }
 
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index d2d7cd9..a56d3ea 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -495,26 +495,26 @@ static void backend_disconnect(struct backend_info *be)
struct xenvif *vif = be->vif;
 
if (vif) {
+   unsigned int num_queues = vif->num_queues;
unsigned int queue_index;
-   struct xenvif_queue *queues;
 
xen_unregister_watchers(vif);
 #ifdef CONFIG_DEBUG_FS
xenvif_debugfs_delif(vif);
 #endif /* CONFIG_DEBUG_FS */
xenvif_disconnect_data(vif);
-   for (queue_index = 0;
-queue_index < vif->num_queues;
-++queue_index)
-   xenvif_deinit_queue(>queues[queue_index]);
 
-   spin_lock(>lock);
-   

Re: net/kcm: use-after-free in kcm_wq

2017-03-03 Thread Cong Wang
On Fri, Mar 3, 2017 at 2:11 AM, Dmitry Vyukov  wrote:
> Also like this one:
>
> ==
> BUG: KASAN: use-after-free in atomic_long_read
> include/linux/compiler.h:254 [inline] at addr 8800538aba60
> BUG: KASAN: use-after-free in get_work_pool+0x2f2/0x340
> kernel/workqueue.c:709 at addr 8800538aba60
> Read of size 8 by task syz-executor6/7965
> CPU: 2 PID: 7965 Comm: syz-executor6 Not tainted 4.10.0+ #248
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:15 [inline]
>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:166
>  print_address_description mm/kasan/report.c:204 [inline]
>  kasan_report_error mm/kasan/report.c:288 [inline]
>  kasan_report.part.2+0x198/0x440 mm/kasan/report.c:310
>  kasan_report mm/kasan/report.c:331 [inline]
>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:331
>  atomic_long_read include/linux/compiler.h:254 [inline]
>  get_work_pool+0x2f2/0x340 kernel/workqueue.c:709
>  __queue_work+0x2b3/0x1210 kernel/workqueue.c:1401
>  queue_work_on+0x2e9/0x330 kernel/workqueue.c:1486
>  queue_work include/linux/workqueue.h:487 [inline]
>  strp_check_rcv+0x25/0x30 net/strparser/strparser.c:494


It is not kcm_wq, it is strp_wq, and the work struct is strp->rx_work
which lives in struct kcm_psock. The work is cancelled by strp_done(),
it seems get queued again after strp_done()...


>  kcm_attach net/kcm/kcmsock.c:1434 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1455 [inline]
>  kcm_ioctl+0x8bb/0x1800 net/kcm/kcmsock.c:1690
>  sock_do_ioctl+0x65/0xb0 net/socket.c:895
>  sock_ioctl+0x2c2/0x440 net/socket.c:993
>  vfs_ioctl fs/ioctl.c:43 [inline]
>  do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
>  SYSC_ioctl fs/ioctl.c:698 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458d9
> RSP: 002b:7f1dce9d1b58 EFLAGS: 0286 ORIG_RAX: 0010
> RAX: ffda RBX: 0024 RCX: 004458d9
> RDX: 20b68000 RSI: 89e0 RDI: 0024
> RBP: 006e0220 R08:  R09: 
> R10:  R11: 0286 R12: 007080a8
> R13:  R14: 7f1dce9d29c0 R15: 7f1dce9d2700
> Object at 8800538ab940, in cache kcm_psock_cache size: 616
> Allocated:
> PID = 7965
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>  set_track mm/kasan/kasan.c:514 [inline]
>  kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:605
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
>  kmem_cache_alloc+0x102/0x680 mm/slab.c:3571
>  kmem_cache_zalloc include/linux/slab.h:653 [inline]
>  kcm_attach net/kcm/kcmsock.c:1384 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1455 [inline]
>  kcm_ioctl+0x303/0x1800 net/kcm/kcmsock.c:1690
>  sock_do_ioctl+0x65/0xb0 net/socket.c:895
>  sock_ioctl+0x2c2/0x440 net/socket.c:993
>  vfs_ioctl fs/ioctl.c:43 [inline]
>  do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
>  SYSC_ioctl fs/ioctl.c:698 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Freed:
> PID = 7982
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>  set_track mm/kasan/kasan.c:514 [inline]
>  kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:578
>  __cache_free mm/slab.c:3513 [inline]
>  kmem_cache_free+0x71/0x240 mm/slab.c:3773
>  kcm_unattach+0xee7/0x1520 net/kcm/kcmsock.c:1558
>  kcm_unattach_ioctl net/kcm/kcmsock.c:1603 [inline]
>  kcm_ioctl+0xfae/0x1800 net/kcm/kcmsock.c:1700
>  sock_do_ioctl+0x65/0xb0 net/socket.c:895
>  sock_ioctl+0x2c2/0x440 net/socket.c:993
>  vfs_ioctl fs/ioctl.c:43 [inline]
>  do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
>  SYSC_ioctl fs/ioctl.c:698 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Memory state around the buggy address:
>  8800538ab900: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>  8800538ab980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>8800538aba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>^
>  8800538aba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  8800538abb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==


Re: [PATCH 25/26] isdn: eicon: mark divascapi incompatible with kasan

2017-03-03 Thread Arnd Bergmann
On Fri, Mar 3, 2017 at 3:20 PM, Andrey Ryabinin  wrote:
>
>
> On 03/02/2017 07:38 PM, Arnd Bergmann wrote:
>> When CONFIG_KASAN is enabled, we have several functions that use rather
>> large kernel stacks, e.g.
>>
>> drivers/isdn/hardware/eicon/message.c: In function 'group_optimization':
>> drivers/isdn/hardware/eicon/message.c:14841:1: warning: the frame size of 
>> 864 bytes is larger than 500 bytes [-Wframe-larger-than=]
>> drivers/isdn/hardware/eicon/message.c: In function 'add_b1':
>> drivers/isdn/hardware/eicon/message.c:7925:1: warning: the frame size of 
>> 1008 bytes is larger than 500 bytes [-Wframe-larger-than=]
>> drivers/isdn/hardware/eicon/message.c: In function 'add_b23':
>> drivers/isdn/hardware/eicon/message.c:8551:1: warning: the frame size of 928 
>> bytes is larger than 500 bytes [-Wframe-larger-than=]
>> drivers/isdn/hardware/eicon/message.c: In function 'sig_ind':
>> drivers/isdn/hardware/eicon/message.c:6113:1: warning: the frame size of 
>> 2112 bytes is larger than 500 bytes [-Wframe-larger-than=]
>>
>> To be on the safe side, and to enable a lower frame size warning limit, let's
>> just mark this driver as broken when KASAN is in use. I have tried to reduce
>> the stack size as I did with dozens of other drivers, but failed to come up
>> with a good solution for this one.
>>
>
> This is kinda radical solution.
> Wouldn't be better to just increase -Wframe-larger-than for this driver 
> through Makefile?

I thought about it too, and decided for disabling the driver entirely
since I suspected that
not only the per-function stack frame is overly large here but also
depth of the call chain,
which would then lead us to hiding an actual stack overflow.

Note that this driver is almost certainly broken, it hasn't seen any
updates other than
style and compile-warning fixes in 10 years and doesn't support any of
the hardware
introduced since 2002 (the company still makes PCIe ISDN adapters, but
the driver
only supports legacy PCI versions and older buses).

Arnd


[PATCH 0/4] Netfilter fixes for net

2017-03-03 Thread Pablo Neira Ayuso
Hi David,

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Missing check for full sock in ip_route_me_harder(), from
   Florian Westphal.

2) Incorrect sip helper structure initilization that breaks it when
   several ports are used, from Christophe Leroy.

3) Fix incorrect assumption when looking up for matching with adjacent
   intervals in the nft_set_rbtree.

4) Fix broken netlink event error reporting in nf_tables that results
   in misleading ESRCH errors propagated to userspace listeners.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks!



The following changes since commit 2f44f75257d57f0d5668dba3a6ada0f4872132c9:

  Merge branch 'qed-fixes' (2017-02-27 09:22:10 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 25e94a997b324b5f167f56d56d7106d38b78c9de:

  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() 
fails (2017-03-03 13:48:34 +0100)


Christophe Leroy (1):
  netfilter: nf_conntrack_sip: fix wrong memory initialisation

Florian Westphal (1):
  netfilter: use skb_to_full_sk in ip_route_me_harder

Pablo Neira Ayuso (2):
  netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() 
fails

 include/net/netfilter/nf_tables.h |   6 +-
 net/ipv4/netfilter.c  |   7 +-
 net/netfilter/nf_conntrack_sip.c  |   2 -
 net/netfilter/nf_tables_api.c | 133 --
 net/netfilter/nft_set_rbtree.c|   9 ++-
 5 files changed, 66 insertions(+), 91 deletions(-)


[PATCH 1/4] netfilter: use skb_to_full_sk in ip_route_me_harder

2017-03-03 Thread Pablo Neira Ayuso
From: Florian Westphal 

inet_sk(skb->sk) is illegal in case skb is attached to request socket.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of 
listener")
Reported by: Daniel J Blueman 
Signed-off-by: Florian Westphal 
Tested-by: Daniel J Blueman 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index b3cc1335adbc..c0cc6aa8cfaa 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff *skb, 
unsigned int addr_t
struct rtable *rt;
struct flowi4 fl4 = {};
__be32 saddr = iph->saddr;
-   __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0;
+   const struct sock *sk = skb_to_full_sk(skb);
+   __u8 flags = sk ? inet_sk_flowi_flags(sk) : 0;
struct net_device *dev = skb_dst(skb)->dev;
unsigned int hh_len;
 
@@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff *skb, 
unsigned int addr_t
fl4.daddr = iph->daddr;
fl4.saddr = saddr;
fl4.flowi4_tos = RT_TOS(iph->tos);
-   fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
+   fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
if (!fl4.flowi4_oif)
fl4.flowi4_oif = l3mdev_master_ifindex(dev);
fl4.flowi4_mark = skb->mark;
@@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff *skb, 
unsigned int addr_t
xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) {
struct dst_entry *dst = skb_dst(skb);
skb_dst_set(skb, NULL);
-   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, 0);
+   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0);
if (IS_ERR(dst))
return PTR_ERR(dst);
skb_dst_set(skb, dst);
-- 
2.1.4



[PATCH 4/4] netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails

2017-03-03 Thread Pablo Neira Ayuso
The underlying nlmsg_multicast() already sets sk->sk_err for us to
notify socket overruns, so we should not do anything with this return
value. So we just call nfnetlink_set_err() if:

1) We fail to allocate the netlink message.

or

2) We don't have enough space in the netlink message to place attributes,
   which means that we likely need to allocate a larger message.

Before this patch, the internal ESRCH netlink error code was propagated
to userspace, which is quite misleading. Netlink semantics mandate that
listeners just hit ENOBUFS if the socket buffer overruns.

Reported-by: Alexander Alemayhu 
Tested-by: Alexander Alemayhu 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |   6 +-
 net/netfilter/nf_tables_api.c | 133 --
 2 files changed, 58 insertions(+), 81 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index ac84686aaafb..2aa8a9d80fbe 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -988,9 +988,9 @@ struct nft_object *nf_tables_obj_lookup(const struct 
nft_table *table,
const struct nlattr *nla, u32 objtype,
u8 genmask);
 
-int nft_obj_notify(struct net *net, struct nft_table *table,
-  struct nft_object *obj, u32 portid, u32 seq,
-  int event, int family, int report, gfp_t gfp);
+void nft_obj_notify(struct net *net, struct nft_table *table,
+   struct nft_object *obj, u32 portid, u32 seq,
+   int event, int family, int report, gfp_t gfp);
 
 /**
  * struct nft_object_type - stateful object type
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index ff7304ae58ac..5e0ccfd5bb37 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -461,16 +461,15 @@ static int nf_tables_fill_table_info(struct sk_buff *skb, 
struct net *net,
return -1;
 }
 
-static int nf_tables_table_notify(const struct nft_ctx *ctx, int event)
+static void nf_tables_table_notify(const struct nft_ctx *ctx, int event)
 {
struct sk_buff *skb;
int err;
 
if (!ctx->report &&
!nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES))
-   return 0;
+   return;
 
-   err = -ENOBUFS;
skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
if (skb == NULL)
goto err;
@@ -482,14 +481,11 @@ static int nf_tables_table_notify(const struct nft_ctx 
*ctx, int event)
goto err;
}
 
-   err = nfnetlink_send(skb, ctx->net, ctx->portid, NFNLGRP_NFTABLES,
-ctx->report, GFP_KERNEL);
+   nfnetlink_send(skb, ctx->net, ctx->portid, NFNLGRP_NFTABLES,
+  ctx->report, GFP_KERNEL);
+   return;
 err:
-   if (err < 0) {
-   nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES,
- err);
-   }
-   return err;
+   nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS);
 }
 
 static int nf_tables_dump_tables(struct sk_buff *skb,
@@ -1050,16 +1046,15 @@ static int nf_tables_fill_chain_info(struct sk_buff 
*skb, struct net *net,
return -1;
 }
 
-static int nf_tables_chain_notify(const struct nft_ctx *ctx, int event)
+static void nf_tables_chain_notify(const struct nft_ctx *ctx, int event)
 {
struct sk_buff *skb;
int err;
 
if (!ctx->report &&
!nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES))
-   return 0;
+   return;
 
-   err = -ENOBUFS;
skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
if (skb == NULL)
goto err;
@@ -1072,14 +1067,11 @@ static int nf_tables_chain_notify(const struct nft_ctx 
*ctx, int event)
goto err;
}
 
-   err = nfnetlink_send(skb, ctx->net, ctx->portid, NFNLGRP_NFTABLES,
-ctx->report, GFP_KERNEL);
+   nfnetlink_send(skb, ctx->net, ctx->portid, NFNLGRP_NFTABLES,
+  ctx->report, GFP_KERNEL);
+   return;
 err:
-   if (err < 0) {
-   nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES,
- err);
-   }
-   return err;
+   nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS);
 }
 
 static int nf_tables_dump_chains(struct sk_buff *skb,
@@ -1934,18 +1926,16 @@ static int nf_tables_fill_rule_info(struct sk_buff 
*skb, struct net *net,
return -1;
 }
 
-static int nf_tables_rule_notify(const struct nft_ctx *ctx,
-const struct nft_rule *rule,
-int event)
+static void nf_tables_rule_notify(const struct nft_ctx *ctx,
+ const struct 

[PATCH 3/4] netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups

2017-03-03 Thread Pablo Neira Ayuso
In case of adjacent ranges, we may indeed see either the high part of
the range in first place or the low part of it. Remove this incorrect
assumption, let's make sure we annotate the low part of the interval in
case of we have adjacent interva intervals so we hit a matching in
lookups.

Reported-by: Simon Hanisch 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_set_rbtree.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 71e8fb886a73..78dfbf9588b3 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -60,11 +60,10 @@ static bool nft_rbtree_lookup(const struct net *net, const 
struct nft_set *set,
d = memcmp(this, key, set->klen);
if (d < 0) {
parent = parent->rb_left;
-   /* In case of adjacent ranges, we always see the high
-* part of the range in first place, before the low one.
-* So don't update interval if the keys are equal.
-*/
-   if (interval && nft_rbtree_equal(set, this, interval))
+   if (interval &&
+   nft_rbtree_equal(set, this, interval) &&
+   nft_rbtree_interval_end(this) &&
+   !nft_rbtree_interval_end(interval))
continue;
interval = rbe;
} else if (d > 0)
-- 
2.1.4



[PATCH 2/4] netfilter: nf_conntrack_sip: fix wrong memory initialisation

2017-03-03 Thread Pablo Neira Ayuso
From: Christophe Leroy 

In commit 82de0be6862cd ("netfilter: Add helper array
register/unregister functions"),
struct nf_conntrack_helper sip[MAX_PORTS][4] was changed to
sip[MAX_PORTS * 4], so the memory init should have been changed to
memset([4 * i], 0, 4 * sizeof(sip[i]));

But as the sip[] table is allocated in the BSS, it is already set to 0

Fixes: 82de0be6862cd ("netfilter: Add helper array register/unregister 
functions")
Signed-off-by: Christophe Leroy 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_sip.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 24174c520239..0d17894798b5 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1628,8 +1628,6 @@ static int __init nf_conntrack_sip_init(void)
ports[ports_c++] = SIP_PORT;
 
for (i = 0; i < ports_c; i++) {
-   memset([i], 0, sizeof(sip[i]));
-
nf_ct_helper_init([4 * i], AF_INET, IPPROTO_UDP, "sip",
  SIP_PORT, ports[i], i, sip_exp_policy,
  SIP_EXPECT_MAX,
-- 
2.1.4



Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-03-03 Thread David Ahern
On 3/3/17 6:39 AM, Dmitry Vyukov wrote:
> I am getting heap out-of-bounds reports in
> fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone while running
> syzkaller fuzzer on 86292b33d4b79ee03e2f43ea0381ef85f077c760. They all
> follow the same pattern: an object of size 216 is allocated from
> ip_dst_cache slab, and then accessed at offset 272/276 withing
> fib6_walk. Looks like type confusion. Unfortunately this is not
> reproducible.

I'll take a look this weekend or Monday at the latest.


Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 8:12 PM, David Ahern  wrote:
> On 3/3/17 6:39 AM, Dmitry Vyukov wrote:
>> I am getting heap out-of-bounds reports in
>> fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone while running
>> syzkaller fuzzer on 86292b33d4b79ee03e2f43ea0381ef85f077c760. They all
>> follow the same pattern: an object of size 216 is allocated from
>> ip_dst_cache slab, and then accessed at offset 272/276 withing
>> fib6_walk. Looks like type confusion. Unfortunately this is not
>> reproducible.
>
> I'll take a look this weekend or Monday at the latest.


This is not from fib6_walk, but looks like the same problem:

==
BUG: KASAN: slab-out-of-bounds in find_rr_leaf net/ipv6/route.c:722
[inline] at addr 88004afe6f68
BUG: KASAN: slab-out-of-bounds in rt6_select net/ipv6/route.c:758
[inline] at addr 88004afe6f68
BUG: KASAN: slab-out-of-bounds in ip6_pol_route+0x19ff/0x1f30
net/ipv6/route.c:1091 at addr 88004afe6f68
Read of size 4 by task syz-executor0/24839
CPU: 1 PID: 24839 Comm: syz-executor0 Not tainted 4.10.0+ #248
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:166
 print_address_description mm/kasan/report.c:204 [inline]
 kasan_report_error mm/kasan/report.c:288 [inline]
 kasan_report.part.2+0x198/0x440 mm/kasan/report.c:310
 kasan_report mm/kasan/report.c:330 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:330
 find_rr_leaf net/ipv6/route.c:722 [inline]
 rt6_select net/ipv6/route.c:758 [inline]
 ip6_pol_route+0x19ff/0x1f30 net/ipv6/route.c:1091
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
 fib6_rule_lookup+0x52/0x150 net/ipv6/ip6_fib.c:291
 ip6_route_output_flags+0x1f1/0x2b0 net/ipv6/route.c:1240
 ip6_route_output include/net/ip6_route.h:79 [inline]
 ip6_dst_lookup_tail+0x4fb/0x990 net/ipv6/ip6_output.c:954
 ip6_dst_lookup+0x4b/0x60 net/ipv6/ip6_output.c:1056
 icmpv6_route_lookup+0x107/0x750 net/ipv6/icmp.c:347
 icmp6_send+0x145e/0x24d0 net/ipv6/icmp.c:536
 icmpv6_send+0x12e/0x260 net/ipv6/ip6_icmp.c:42
 ip6_fragment+0x57f/0x38a0 net/ipv6/ip6_output.c:865
 ip6_finish_output+0x319/0x950 net/ipv6/ip6_output.c:147
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip6_output+0x1cb/0x8c0 net/ipv6/ip6_output.c:163
 dst_output include/net/dst.h:486 [inline]
 ip6_local_out+0x95/0x170 net/ipv6/output_core.c:172
 ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1734
 ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1754
 rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
 rawv6_sendmsg+0x2e10/0x3fd0 net/ipv6/raw.c:930
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1685
 SyS_sendto+0x40/0x50 net/socket.c:1653
 entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458d9
RSP: 002b:7f227bcfab58 EFLAGS: 0282 ORIG_RAX: 002c
RAX: ffda RBX: 0006 RCX: 004458d9
RDX: 1001 RSI: 20725000 RDI: 0006
RBP: 006e1bb0 R08: 201ccff8 R09: 0018
R10: 00404004 R11: 0282 R12: 00708000
R13: 20001ff7 R14: 0003 R15: 00060040
Object at 88004afe6e00, in cache ip_dst_cache size: 216
Allocated:
PID = 1307
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:605
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
 kmem_cache_alloc+0x102/0x680 mm/slab.c:3571
 dst_alloc+0x11b/0x1a0 net/core/dst.c:209
 rt_dst_alloc+0xf0/0x580 net/ipv4/route.c:1482
 ip_route_input_slow+0xdf2/0x2160 net/ipv4/route.c:1935
 ip_route_input_noref+0x137/0x10e0 net/ipv4/route.c:2056
 ip_rcv_finish+0x301/0x1b40 net/ipv4/ip_input.c:344
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip_rcv+0xd75/0x19a0 net/ipv4/ip_input.c:487
 __netif_receive_skb_core+0x1ac8/0x33f0 net/core/dev.c:4179
 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4217
 netif_receive_skb_internal+0xf0/0x400 net/core/dev.c:4245
 napi_skb_finish net/core/dev.c:4602 [inline]
 napi_gro_receive+0x4d4/0x670 net/core/dev.c:4636
 e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4033 [inline]
 e1000_clean_rx_irq+0x5e0/0x1490
drivers/net/ethernet/intel/e1000/e1000_main.c:4489
 e1000_clean+0xb94/0x2920 drivers/net/ethernet/intel/e1000/e1000_main.c:3834
 napi_poll net/core/dev.c:5171 [inline]
 net_rx_action+0xeb4/0x1580 net/core/dev.c:5236
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
Freed:
PID = 22752
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_slab_free+0x6f/0xb0 

Re: net/ipv4: deadlock in ip_ra_control

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 7:43 PM, Dmitry Vyukov  wrote:
> On Thu, Mar 2, 2017 at 10:40 AM, Dmitry Vyukov  wrote:
>> On Wed, Mar 1, 2017 at 6:18 PM, Cong Wang  wrote:
>>> On Wed, Mar 1, 2017 at 2:44 AM, Dmitry Vyukov  wrote:
 Hello,

 I've got the following deadlock report while running syzkaller fuzzer
 on linux-next/51788aebe7cae79cb334ad50641347465fc188fd:

 ==
 [ INFO: possible circular locking dependency detected ]
 4.10.0-next-20170301+ #1 Not tainted
 ---
 syz-executor1/3394 is trying to acquire lock:
  (sk_lock-AF_INET){+.+.+.}, at: [] lock_sock
 include/net/sock.h:1460 [inline]
  (sk_lock-AF_INET){+.+.+.}, at: []
 do_ip_setsockopt.isra.12+0x21c/0x3540 net/ipv4/ip_sockglue.c:652

 but task is already holding lock:
  (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
 net/core/rtnetlink.c:70

 which lock already depends on the new lock.


 the existing dependency chain (in reverse order) is:

 -> #1 (rtnl_mutex){+.+.+.}:
validate_chain kernel/locking/lockdep.c:2265 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
__mutex_lock_common kernel/locking/mutex.c:754 [inline]
__mutex_lock+0x172/0x1730 kernel/locking/mutex.c:891
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:906
rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70
mrtsock_destruct+0x86/0x2c0 net/ipv4/ipmr.c:1281
ip_ra_control+0x459/0x600 net/ipv4/ip_sockglue.c:372
do_ip_setsockopt.isra.12+0x1064/0x3540 net/ipv4/ip_sockglue.c:1161
ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
raw_setsockopt+0xb7/0xd0 net/ipv4/raw.c:839
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
SYSC_setsockopt net/socket.c:1786 [inline]
SyS_setsockopt+0x25c/0x390 net/socket.c:1765
entry_SYSCALL_64_fastpath+0x1f/0xc2

 -> #0 (sk_lock-AF_INET){+.+.+.}:
check_prev_add kernel/locking/lockdep.c:1828 [inline]
check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
validate_chain kernel/locking/lockdep.c:2265 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
lock_sock_nested+0xcb/0x120 net/core/sock.c:2530
lock_sock include/net/sock.h:1460 [inline]
do_ip_setsockopt.isra.12+0x21c/0x3540 net/ipv4/ip_sockglue.c:652
ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2721
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
SYSC_setsockopt net/socket.c:1786 [inline]
SyS_setsockopt+0x25c/0x390 net/socket.c:1765
entry_SYSCALL_64_fastpath+0x1f/0xc2

>>>
>>> Please try the attached patch (compile only).
>>
>>
>> Pushed the patch to the bots.
>> Thanks
>
>
> This patch triggers:
>
> [   57.748990] RTNL: assertion failed at net/ipv4/ipmr.c (1236)
> [   57.749022] CPU: 1 PID: 5301 Comm: syz-executor2 Not tainted 4.10.0+ #15
> [   57.749026] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [   57.749028] Call Trace:
> [   57.749042]  dump_stack+0x2ee/0x3ef
> [   57.749219]  mrtsock_destruct+0x27e/0x2f0
> [   57.749241]  ip_ra_control+0x459/0x600
> [   57.749287]  raw_close+0x19/0x30
> [   57.749295]  inet_release+0xed/0x1c0
> [   57.749303]  sock_release+0x8d/0x1e0
> [   57.749316]  sock_close+0x16/0x20
> [   57.749323]  __fput+0x332/0x7f0
> [   57.749340]  fput+0x15/0x20
> [   57.749347]  task_work_run+0x18a/0x260
> [   57.749372]  do_exit+0x18ef/0x28b0
> [   57.749641]  do_group_exit+0x149/0x420
> [   57.749656]  get_signal+0x7e0/0x1820
> [   57.749697]  do_signal+0xd2/0x2190
> [   57.749746]  exit_to_usermode_loop+0x200/0x2a0
> [   57.749758]  syscall_return_slowpath+0x4d3/0x570
> [   57.749835]  entry_SYSCALL_64_fastpath+0xc0/0xc2
> [   57.749840] RIP: 0033:0x44fb79
> [   57.749843] RSP: 002b:7fbba84d9cf8 EFLAGS: 0246 ORIG_RAX:
> 00ca
> [   57.749850] RAX: fe00 RBX: 00708218 RCX: 
> 0044fb79
> [   57.749854] RDX:  RSI:  RDI: 
> 00708218
> [   57.749857] RBP: 007081f8 R08:  R09: 
> 
> [   57.749860] R10:  R11: 0246 R12: 
> 
> [   57.749864] R13: 00a5fc57 R14: 7fbba84da9c0 R15: 
> 000c
> [   57.749964]
> [   57.749966] ===
> [   57.749967] [ INFO: suspicious RCU usage. ]
> [   

Re: net/ipv4: division by 0 in tcp_select_window

2017-03-03 Thread Eric Dumazet
On Fri, Mar 3, 2017 at 10:24 AM, Dmitry Vyukov  wrote:
> On Fri, Mar 3, 2017 at 7:10 PM, Dmitry Vyukov  wrote:
>> Hello,
>>

> Wonder if this has been causing other crashes like this one?
>
> [ cut here ]
> kernel BUG at net/ipv4/tcp_output.c:2748!
> Call Trace:
>  
>  tcp_retransmit_skb+0x2e/0x230 net/ipv4/tcp_output.c:2822
>  tcp_retransmit_timer+0x104c/0x2d50 net/ipv4/tcp_timer.c:491
>  tcp_write_timer_handler+0x334/0x9d0 net/ipv4/tcp_timer.c:574
>  tcp_write_timer+0x164/0x180 net/ipv4/tcp_timer.c:592
>  call_timer_fn+0x241/0x820 kernel/time/timer.c:1266
>  expire_timers kernel/time/timer.c:1305 [inline]
>  __run_timers+0x960/0xcf0 kernel/time/timer.c:1599
>  run_timer_softirq+0x21/0x80 kernel/time/timer.c:1612
>  __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
>  invoke_softirq kernel/softirq.c:364 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:658 [inline]
>  smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
>  apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:487
>
> if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
>   if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
> BUG();

This path uses a socket lock. Probably different problem.


net/ipv4: division by 0 in tcp_select_window

2017-03-03 Thread Dmitry Vyukov
Hello,

The following program triggers division by 0 in tcp_select_window:

https://gist.githubusercontent.com/dvyukov/ef28c0fd2ab57a655508ef7621b12e6c/raw/079011e2a9523a390b0621cbc1e5d9d5e637fd6d/gistfile1.txt

divide error:  [#1] SMP KASAN
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.10.0+ #270
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 88006c236340 task.stack: 88006c248000
RIP: 0010:__tcp_select_window+0x6db/0x920 net/ipv4/tcp_output.c:2585
RSP: 0018:88006cf86b40 EFLAGS: 00010206
RAX: 00c4 RBX: 88006cf86cd8 RCX: dc00
RDX:  RSI:  RDI: 8800686228bd
RBP: 88006cf86d00 R08:  R09: 
R10: 0004 R11: ed000d9f0e18 R12: 00c4
R13: a700 R14: 880068622040 R15: 
FS:  () GS:88006cf8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20e5cff8 CR3: 05021000 CR4: 001406e0
Call Trace:
 
 tcp_select_window net/ipv4/tcp_output.c:270 [inline]
 tcp_transmit_skb+0xc35/0x3460 net/ipv4/tcp_output.c:1014
 tcp_xmit_probe_skb+0x36d/0x440 net/ipv4/tcp_output.c:3528
 tcp_write_wakeup+0x23b/0x6d0 net/ipv4/tcp_output.c:3577
 tcp_send_probe0+0xbf/0x5d0 net/ipv4/tcp_output.c:3593
 tcp_probe_timer net/ipv4/tcp_timer.c:362 [inline]
 tcp_write_timer_handler+0x849/0x9d0 net/ipv4/tcp_timer.c:578
 tcp_write_timer+0x164/0x180 net/ipv4/tcp_timer.c:592
 call_timer_fn+0x241/0x820 kernel/time/timer.c:1266
 expire_timers kernel/time/timer.c:1305 [inline]
 __run_timers+0x960/0xcf0 kernel/time/timer.c:1599
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1612
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:658 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:487
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
RSP: 0018:88006c24fc10 EFLAGS: 0282 ORIG_RAX: ff10
RAX: dc00 RBX: 11000d849f85 RCX: 
RDX: 10a18ebc RSI: 0001 RDI: 850c75e0
RBP: 88006c24fc10 R08: 88007fff70dc R09: 
R10:  R11:  R12: 11000d849fa9
R13: 88006c24fcc8 R14: 856972b8 R15: 88006c24fe68
 
 arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
 default_idle+0xbf/0x440 arch/x86/kernel/process.c:271
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262
 default_idle_call+0x36/0x90 kernel/sched/idle.c:96
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x373/0x520 kernel/sched/idle.c:243
 cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345
 start_secondary+0x36c/0x460 arch/x86/kernel/smpboot.c:272
 start_cpu+0x14/0x14 arch/x86/kernel/head_64.S:306
Code: 5d c3 e8 99 0d e5 fd 45 85 ff 44 89 bd 74 fe ff ff 0f 8f 14 fc
ff ff 45 31 e4 eb 93 e8 7f 0d e5 fd 8b b5 74 fe ff ff 44 89 e0 99 
fe 41 89 c4 44 0f af e6 e9 71 ff ff ff e8 62 0d e5 fd 4c 89
RIP: __tcp_select_window+0x6db/0x920 net/ipv4/tcp_output.c:2585 RSP:
88006cf86b40
---[ end trace 5efcbe8231e36800 ]---

On commit c82be9d2244aacea9851c86f4fb74694c99cd874.

The guy that resets mss seems to be
inet_csk_listen_start->inet_csk_delack_init. After that the timer
fires and divides by icsk->icsk_ack.rcv_mss==0.


Re: net/ipv4: division by 0 in tcp_select_window

2017-03-03 Thread Eric Dumazet
On Fri, Mar 3, 2017 at 10:10 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program triggers division by 0 in tcp_select_window:
>
> https://gist.githubusercontent.com/dvyukov/ef28c0fd2ab57a655508ef7621b12e6c/raw/079011e2a9523a390b0621cbc1e5d9d5e637fd6d/gistfile1.txt

Yeah, tcp_disconnect() should never have existed in the first place.

We'll send a patch, unless you take care of this before us .

Thanks.


Re: net/ipv4: division by 0 in tcp_select_window

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 7:10 PM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program triggers division by 0 in tcp_select_window:
>
> https://gist.githubusercontent.com/dvyukov/ef28c0fd2ab57a655508ef7621b12e6c/raw/079011e2a9523a390b0621cbc1e5d9d5e637fd6d/gistfile1.txt
>
> divide error:  [#1] SMP KASAN
> Modules linked in:
> CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.10.0+ #270
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 88006c236340 task.stack: 88006c248000
> RIP: 0010:__tcp_select_window+0x6db/0x920 net/ipv4/tcp_output.c:2585
> RSP: 0018:88006cf86b40 EFLAGS: 00010206
> RAX: 00c4 RBX: 88006cf86cd8 RCX: dc00
> RDX:  RSI:  RDI: 8800686228bd
> RBP: 88006cf86d00 R08:  R09: 
> R10: 0004 R11: ed000d9f0e18 R12: 00c4
> R13: a700 R14: 880068622040 R15: 
> FS:  () GS:88006cf8() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20e5cff8 CR3: 05021000 CR4: 001406e0
> Call Trace:
>  
>  tcp_select_window net/ipv4/tcp_output.c:270 [inline]
>  tcp_transmit_skb+0xc35/0x3460 net/ipv4/tcp_output.c:1014
>  tcp_xmit_probe_skb+0x36d/0x440 net/ipv4/tcp_output.c:3528
>  tcp_write_wakeup+0x23b/0x6d0 net/ipv4/tcp_output.c:3577
>  tcp_send_probe0+0xbf/0x5d0 net/ipv4/tcp_output.c:3593
>  tcp_probe_timer net/ipv4/tcp_timer.c:362 [inline]
>  tcp_write_timer_handler+0x849/0x9d0 net/ipv4/tcp_timer.c:578
>  tcp_write_timer+0x164/0x180 net/ipv4/tcp_timer.c:592
>  call_timer_fn+0x241/0x820 kernel/time/timer.c:1266
>  expire_timers kernel/time/timer.c:1305 [inline]
>  __run_timers+0x960/0xcf0 kernel/time/timer.c:1599
>  run_timer_softirq+0x21/0x80 kernel/time/timer.c:1612
>  __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
>  invoke_softirq kernel/softirq.c:364 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:658 [inline]
>  smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
>  apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:487
> RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
> RSP: 0018:88006c24fc10 EFLAGS: 0282 ORIG_RAX: ff10
> RAX: dc00 RBX: 11000d849f85 RCX: 
> RDX: 10a18ebc RSI: 0001 RDI: 850c75e0
> RBP: 88006c24fc10 R08: 88007fff70dc R09: 
> R10:  R11:  R12: 11000d849fa9
> R13: 88006c24fcc8 R14: 856972b8 R15: 88006c24fe68
>  
>  arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
>  default_idle+0xbf/0x440 arch/x86/kernel/process.c:271
>  arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262
>  default_idle_call+0x36/0x90 kernel/sched/idle.c:96
>  cpuidle_idle_call kernel/sched/idle.c:154 [inline]
>  do_idle+0x373/0x520 kernel/sched/idle.c:243
>  cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345
>  start_secondary+0x36c/0x460 arch/x86/kernel/smpboot.c:272
>  start_cpu+0x14/0x14 arch/x86/kernel/head_64.S:306
> Code: 5d c3 e8 99 0d e5 fd 45 85 ff 44 89 bd 74 fe ff ff 0f 8f 14 fc
> ff ff 45 31 e4 eb 93 e8 7f 0d e5 fd 8b b5 74 fe ff ff 44 89 e0 99 
> fe 41 89 c4 44 0f af e6 e9 71 ff ff ff e8 62 0d e5 fd 4c 89
> RIP: __tcp_select_window+0x6db/0x920 net/ipv4/tcp_output.c:2585 RSP:
> 88006cf86b40
> ---[ end trace 5efcbe8231e36800 ]---
>
> On commit c82be9d2244aacea9851c86f4fb74694c99cd874.
>
> The guy that resets mss seems to be
> inet_csk_listen_start->inet_csk_delack_init. After that the timer
> fires and divides by icsk->icsk_ack.rcv_mss==0.


Wonder if this has been causing other crashes like this one?

[ cut here ]
kernel BUG at net/ipv4/tcp_output.c:2748!
Call Trace:
 
 tcp_retransmit_skb+0x2e/0x230 net/ipv4/tcp_output.c:2822
 tcp_retransmit_timer+0x104c/0x2d50 net/ipv4/tcp_timer.c:491
 tcp_write_timer_handler+0x334/0x9d0 net/ipv4/tcp_timer.c:574
 tcp_write_timer+0x164/0x180 net/ipv4/tcp_timer.c:592
 call_timer_fn+0x241/0x820 kernel/time/timer.c:1266
 expire_timers kernel/time/timer.c:1305 [inline]
 __run_timers+0x960/0xcf0 kernel/time/timer.c:1599
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1612
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:658 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:487

if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
  if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
BUG();


Re: [PATCH net] rxrpc: Fix potential NULL-pointer exception

2017-03-03 Thread David Miller
From: David Howells 
Date: Thu, 02 Mar 2017 23:26:13 +

> Fix a potential NULL-pointer exception in rxrpc_do_sendmsg().  The call
> state check that I added should have gone into the else-body of the
> if-statement where we actually have a call to check.
> 
> Found by CoverityScan CID#1414316 ("Dereference after null check").
> 
> Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and 
> sendmsg/recvmsg")
> Reported-by: Colin Ian King 
> Signed-off-by: David Howells 

Applied.


Re: [Bug 194749] New: kernel bonding does not work in a network nameservice in versions above 3.10.0-229.20.1

2017-03-03 Thread Cong Wang
On Fri, Mar 3, 2017 at 8:03 AM, Jiri Pirko  wrote:
> If that use case exists I believe it is an abuse. Soft devices that are
> by definition in upper-lower relationships with other devices should not
> move to other namespaces. Prevents all kinds of issues. If you need a
> soft device like bridge of bond within a namespace, just create it there.
>

I can't agree. Dan's use case is pretty valid, lower devices are moved
into a netns before enslaving to the bonding device, it is perfect valid.
NETIF_F_NETNS_LOCAL was introduced for loopback which is
created during netns creation, forcing users to create a bond device in
each netns is not friendly.

What issues are you talking about there? Can't we just fix them?


Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 3:48 PM, Eric Dumazet  wrote:
> On Fri, 2017-03-03 at 06:32 -0800, Eric Dumazet wrote:
>> On Fri, 2017-03-03 at 15:11 +0100, Dmitry Vyukov wrote:
>> > On Mon, Feb 13, 2017 at 11:29 PM, Cong Wang  
>> > wrote:
>> > > On Mon, Feb 13, 2017 at 11:19 AM, Andrey Konovalov
>> > >  wrote:
>> > >> Hi,
>> > >>
>> > >> I've got the following error report while fuzzing the kernel with 
>> > >> syzkaller.
>> > >>
>> > >> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
>> > >>
>> > >> A reproducer and .config are attached.
>> > >> Note, that it takes quite some time to trigger the bug (up to 10 
>> > >> minutes).
>> > >>
>> > >> BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
>> > >> net/dccp/feat.c:1541 at addr 88003713be68
>> > >> Read of size 8 by task syz-executor2/8457
>> > >> CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
>> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
>> > >> 01/01/2011
>> > >> Call Trace:
>> > >>  
>> > >>  __dump_stack lib/dump_stack.c:15 [inline]
>> > >>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>> > >>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>> > >>  print_address_description mm/kasan/report.c:200 [inline]
>> > >>  kasan_report_error mm/kasan/report.c:289 [inline]
>> > >>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>> > >>  kasan_report mm/kasan/report.c:332 [inline]
>> > >>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
>> > >>  dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
>> > >>  dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
>> > >>  dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
>> > >>  dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
>> > >>  dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
>> > >>  ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
>> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>> > >>  ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
>> > >>  dst_input include/net/dst.h:507 [inline]
>> > >>  ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
>> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>> > >>  ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
>> > >>  __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
>> > >>  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
>> > >>  process_backlog+0xe5/0x6c0 net/core/dev.c:4839
>> > >>  napi_poll net/core/dev.c:5202 [inline]
>> > >>  net_rx_action+0xe70/0x1900 net/core/dev.c:5267
>> > >>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
>> > >>  do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
>> > >
>> > >
>> > > Seems there is a race condition between iterating dccp_feat_entry
>> > > and freeing it, bh_lock_sock() seems not held in this path.
>> >
>> >
>> >
>> > Cong, where exactly do we need to add bh_lock_sock()?
>> >
>> > I am still seeing this on 4977ab6e92e267afe9d8f78438c3db330ca8434c
>>
>>
>> I would try :
>
> Or something that would compile. I will take a deeper look after my
> commute.


Something that compiles is definitely better :)
Reapplied.


> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 
> 409d0cfd34474812c3bf74f26cd423a3d65ee441..56f883b301ccd610fc24efeac4fb47d3c2f95ecf
>  100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -482,7 +482,11 @@ static int dccp_v4_send_response(const struct sock *sk, 
> struct request_sock *req
> if (dst == NULL)
> goto out;
>
> +   /* DCCP is not ready yet for lockless SYN processing */
> +   bh_lock_sock((struct sock *)sk);
> skb = dccp_make_response(sk, dst, req);
> +   bh_unlock_sock((struct sock *)sk);
> +
> if (skb != NULL) {
> const struct inet_request_sock *ireq = inet_rsk(req);
> struct dccp_hdr *dh = dccp_hdr(skb);
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 
> 233b57367758c64c09ed40f7359cb8fcb1918d93..673f45f85b7c755c8165c6274ffb6b1fe5660683
>  100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -214,7 +214,11 @@ static int dccp_v6_send_response(const struct sock *sk, 
> struct request_sock *req
> goto done;
> }
>
> +   /* DCCP is not ready yet for lockless SYN processing */
> +   bh_lock_sock((struct sock *)sk);
> skb = dccp_make_response(sk, dst, req);
> +   bh_unlock_sock((struct sock *)sk);
> +
> if (skb != NULL) {
> struct dccp_hdr *dh = dccp_hdr(skb);
> struct ipv6_txoptions *opt;
>
>


Re: [iproute PATCH v3 1/1] color: use "light" colors for dark background

2017-03-03 Thread Stephen Hemminger
On Wed,  1 Mar 2017 21:52:33 +0100
Petr Vorel  wrote:

> COLORFGBG environment variable is used to detect dark background.
> 
> Idea and a bit of code is borrowed from Vim, thanks.
> 
> Signed-off-by: Petr Vorel 

Applied and I split one long line.


Re: [PATCH net 0/2] sfc: couple of fixes

2017-03-03 Thread David Miller
From: Edward Cree 
Date: Fri, 3 Mar 2017 15:20:44 +

> First patch addresses a construct that causes sparse to error out.
> With that fixed, sparse makes some warnings on ef10.c, second patch
>  fixes one of them.

Series applied, thanks.


Re: pull-request: can 2017-03-03

2017-03-03 Thread David Miller
From: Marc Kleine-Budde 
Date: Fri,  3 Mar 2017 14:55:31 +0100

> this is a pull request for the upcoming v4.11 release.
> 
> There are two patches by Ethan Zonca for the gs_usb driver, the first one 
> fixes
> the memory used for USB transfers, the second one the coding style.
> 
> The last two patches are by me, one fixing a memory leak in the usb_8dev 
> driver
> the other a typo in the flexcan driver.

Pulled, thanks.


Re: [PATCH net] sctp: change to save MSG_MORE flag into assoc

2017-03-03 Thread Xin Long
On Sat, Mar 4, 2017 at 12:31 AM, David Laight  wrote:
> From: Xin Long
>> Sent: 03 March 2017 15:43
> ...
>> > It is much more important to get MSG_MORE working 'properly' for SCTP
>> > than for TCP. For TCP an application can always use a long send.
>
>> "long send" ?, you mean bigger data, or keeping sending?
>> I didn't get the difference between SCTP and TCP, they
>> are similar when sending data.
>
> With tcp an application can always replace two send()/write()
> calls with a single call to writev().
> For sctp two send() calls must be made in order to generate two
> data chunks.
> So it is much easier for a tcp application to generate 'full'
> ethernet packets.
okay, it should not be a important reason, and sctp might also support
it one day. :-)

>
>>
>> >
>> > ...
>> >> @@ -1982,6 +1982,7 @@ static int sctp_sendmsg(struct sock *sk, struct 
>> >> msghdr *msg, size_t msg_len)
>> >>* breaks.
>> >>*/
>> >>   err = sctp_primitive_SEND(net, asoc, datamsg);
>> >> + asoc->force_delay = 0;
>> >>   /* Did the lower layer accept the chunk? */
>> >>   if (err) {
>> >>   sctp_datamsg_free(datamsg);
>> >
>> > I don't think this is right - or needed.
>> > You only get to the above if some test has decided to send data chunks.
>> > So it just means that the NEXT time someone tries to send data all the
>> > queued data gets sent.
>
>> the NEXT time someone tries to send data with "MSG_MORE clear",
>> yes, but with "MSG_MORE set", it will still delay.
>>
>> > I'm guessing that the whole thing gets called in a loop (definitely needed
>> > for very long data chunks, or after the window is opened).
>
>> yes, if users keep sending data chunks with MSG_MORE set, no
>> data with "MSG_MORE clear" gap.
>>
>> > Now if an application sends a lot of (say) 100 byte chunks with MSG_MORE
>> > set it would expect to see a lot of full ethernet frames be sent.
>
>> right.
>
>> > With the above a frame will be sent (containing all but 1 chunk) when the
>> > amount of queued data becomes too large for an ethernet frame, and 
>> > immediately
>> > followed by a second ethernet frame with 1 chunk in it.
>
>> "followed by a second ethernet frame with 1 chunk in it.", I think this's
>> what you're really worried about, right ?
>> But sctp flush data queue NOT like what you think, it's not keep traversing
>> the queue untill the queue is empty.
>> once a packet with chunks in one ethernet frame is sent, sctp_outq_flush
>> will return. it will pack chunks and send the next packet again untill some
>> other 'event' triggers it, like retransmission or data received from peer.
>> I don't think this is a problem.
>
> Erm that can't work.
> I think there is code to convert a large user send into multiple data chunks.
> So if the user does a 4k (say) send several large chunks get queued.
> These would need to all be sent at once.
>
> Similarly when the transmit window is received.
> So somewhere there ought to be a loop that will send more than one packet.
As far as I can see, no loop like you said, mostly, the incoming
chunk (like SACK) from peer will trigger the next flush out.
I can try to trace the path in kernel for sure tomorrow.

>
>> > Now it might be that the flag needs clearing when retransmissions are 
>> > queued.
>> > OTOH they might get sent for other reasons.
>
>> Before we really overthought about MSG_MORE, no need to care about
>> retransmissions, define MSG_MORE, in my opinion, it works more for
>> *inflight is 0*, if it's not 0, we shouldn't stop other places flushing them.
>
> Eh? and when nagle disabled.
> If 'inflight' isn't 0 then most paths don't flush data.
I knew, but MSG_MORE is different thing, it should only try to work for the
current and following data.

>
>> We cannot let asoc's more_more flag work as global, it will block elsewhere
>> sending data chunks, not only sctp_sendmsg.
>
> If the connection was flow controlled off, and more 'credit' arrives and there
> is less that an ethernet frame's worth of data pending, and the last send
> said 'MSG_MORE' there is no point sending anything until the application
> does a send with MSG_MORE clear.
got you, I think you have different understanding about MSG_MORE
while this patch just try to make it work like TCP's msg_more, but what
you mentioned here is the same as TCP thing, seems you also want
to improve TCP's MSG_MORE :-)

>
> I'm not sure what causes a retransmission to send data, I suspect that 
> 'inflight'
> can easily be non-zero at that time.
The thing that causes a retransmission to send data is that both tx and
rtx send data through sctp_outq_flush, in which it will try to send rtx queue,
then rx queue.

yes, once a packet is sent out and not yet be SACKed, "inflight" will not be
zero, so when retransmiting, "inflight" must be non-zero.

> Likely something causes a packet be generated - which then collects the data 
> chunks.
>
> David
>
>


Re: [PATCH 1/4] net: thunderx: Fix IOMMU translation faults

2017-03-03 Thread David Miller
From: sunil.kovv...@gmail.com
Date: Fri,  3 Mar 2017 16:17:47 +0530

> @@ -1643,6 +1650,9 @@ static int nicvf_probe(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   if (!pass1_silicon(nic->pdev))
>   nic->hw_tso = true;
>  
> + /* Check if we are attached to IOMMU */
> + nic->iommu_domain = iommu_get_domain_for_dev(dev);

This function is not universally available.

This looks very hackish to me anyways, how all of this stuff is supposed
to work is that you simply use the DMA interfaces unconditionally and
whatever is behind the operations takes care of everything.

Doing it conditionally in the driver with all of this special IOMMU
domain et al. knowledge makes no sense to me at all.

I don't see other drivers doing stuff like this at all, so if you're
going to handle this in a unique way like this you better write
several paragraphs in your commit message explaining why this weird
crap is necessary.

There is no way I can apply this series as it is current written.

Thanks.


Re: [PATCH 1/1] rds: remove unnecessary returned value check

2017-03-03 Thread David Miller
From: Zhu Yanjun 
Date: Fri,  3 Mar 2017 00:44:26 -0500

> The function rds_trans_register always returns 0. As such, it is not
> necessary to check the returned value.
> 
> Cc: Joe Jin 
> Cc: Junxiao Bi 
> Signed-off-by: Zhu Yanjun 

Applied.


[PATCH] selinux: check for address length in selinux_socket_bind()

2017-03-03 Thread Alexander Potapenko
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in packet_bind_spkt():

==
BUG: KMSAN: use of unitialized memory
inter: 0
CPU: 3 PID: 1074 Comm: packet2 Tainted: GB   4.8.0-rc6+ #1916
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  8800882ffb08 825759c8 8800882ffa48
 818bf551 85bab870 0092 85bab550
  0092 bb0009bb 0002
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x238/0x290 lib/dump_stack.c:51
 [] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1008
 [] __msan_warning+0x5b/0xb0 mm/kmsan/kmsan_instr.c:424
 [] selinux_socket_bind+0xf41/0x1080 
security/selinux/hooks.c:4288
 [] security_socket_bind+0x1ec/0x240 security/security.c:1240
 [] SYSC_bind+0x358/0x5f0 net/socket.c:1366
 [] SyS_bind+0x82/0xa0 net/socket.c:1356
 [] do_syscall_64+0x58/0x70 arch/x86/entry/common.c:292
 [] entry_SYSCALL64_slow_path+0x25/0x25 
arch/x86/entry/entry_64.o:?
chained origin: ba6009bb
 [] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
 [< inline >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
 [< inline >] kmsan_save_stack mm/kmsan/kmsan.c:337
 [] kmsan_internal_chain_origin+0x118/0x1e0 
mm/kmsan/kmsan.c:530
 [] __msan_set_alloca_origin4+0xc3/0x130 
mm/kmsan/kmsan_instr.c:380
 [] SYSC_bind+0x129/0x5f0 net/socket.c:1356
 [] SyS_bind+0x82/0xa0 net/socket.c:1356
 [] do_syscall_64+0x58/0x70 arch/x86/entry/common.c:292
 [] return_from_SYSCALL_64+0x0/0x6a 
arch/x86/entry/entry_64.o:?
origin description: address@SYSC_bind (origin=b8c00900)
==

(the line numbers are relative to 4.8-rc6, but the bug persists upstream)

, when I run the following program as root:

===
  #include 
  #include 
  #include 

  int main(int argc, char *argv[]) {
struct sockaddr addr;
int size = 0;
if (argc > 1) {
  size = atoi(argv[1]);
}
memset(, 0, sizeof(addr));
int fd = socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP);
bind(fd, , size);
return 0;
  }
===

(for different values of |size| other error reports are printed).

This happens because bind() unconditionally copies |size| bytes of
|addr| to the kernel, leaving the rest uninitialized. Then
security_socket_bind() reads the IP address bytes, including the
uninitialized ones, to determine the port, or e.g. pass them further to
sel_netnode_find(), which uses them to calculate a hash.

Signed-off-by: Alexander Potapenko 
---
 security/selinux/hooks.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0a4b4b040e0a..eba54489b11b 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4351,10 +4351,19 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
u32 sid, node_perm;
 
if (family == PF_INET) {
+   if (addrlen != sizeof(struct sockaddr_in)) {
+   err = -EINVAL;
+   goto out;
+   }
addr4 = (struct sockaddr_in *)address;
snum = ntohs(addr4->sin_port);
addrp = (char *)>sin_addr.s_addr;
+
} else {
+   if (addrlen != sizeof(struct sockaddr_in6)) {
+   err = -EINVAL;
+   goto out;
+   }
addr6 = (struct sockaddr_in6 *)address;
snum = ntohs(addr6->sin6_port);
addrp = (char *)>sin6_addr.s6_addr;
-- 
2.12.0.rc1.440.g5b76565f74-goog



Re: [PATCH net 0/2] xen-netback: update memory leak fix to avoid BUG

2017-03-03 Thread David Miller
From: Paul Durrant 
Date: Thu, 2 Mar 2017 12:54:24 +

> Commit 9a6cdf52b85e "xen-netback: fix memory leaks on XenBus disconnect"
> added missing code to fix a memory leak by calling vfree() in the
> appropriate place.
> Unfortunately subsequent commit f16f1df65f1c "xen-netback: protect
> resource cleaning on XenBus disconnect" then wrapped this call to vfree()
> in a spin lock, leading to a BUG due to incorrect context.
> 
> Patch #1 makes the existing code more readable
> Patch #2 fixes the problem

Series applied, thanks.


Re: [PATCH net 0/2] nfp: RX and XDP buffer fixes

2017-03-03 Thread David Miller
From: Jakub Kicinski 
Date: Thu,  2 Mar 2017 15:26:19 -0800

> Two trivial fixes for code introduced with XDP support.  First
> one corrects the buffer size we populate a register with.  The
> register is designed to be used for scatter transfers which 
> the driver (and most FWs) don't support so it's not critical.
> The other one for DMA direction is mostly cosmetic, DMA API
> doesn't seem to care today about the precise direction in sync
> calls.

Series applied.


Re: [PATCH net v5 0/2] net: ethernet: bgmac: bug fixes

2017-03-03 Thread David Miller
From: Jon Mason 
Date: Thu,  2 Mar 2017 17:59:55 -0500

> Bug fixes for bgmac driver

Series applied.



Re: [PATCH] selinux: check for address length in selinux_socket_bind()

2017-03-03 Thread Alexander Potapenko
On Fri, Mar 3, 2017 at 6:23 PM, Alexander Potapenko  wrote:
> KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
> uninitialized memory in packet_bind_spkt():
Should be "in selinux_socket_bind()", will fix in the next patch version.
> ==
> BUG: KMSAN: use of unitialized memory
> inter: 0
> CPU: 3 PID: 1074 Comm: packet2 Tainted: GB   4.8.0-rc6+ #1916
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   8800882ffb08 825759c8 8800882ffa48
>  818bf551 85bab870 0092 85bab550
>   0092 bb0009bb 0002
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0x238/0x290 lib/dump_stack.c:51
>  [] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1008
>  [] __msan_warning+0x5b/0xb0 mm/kmsan/kmsan_instr.c:424
>  [] selinux_socket_bind+0xf41/0x1080 
> security/selinux/hooks.c:4288
>  [] security_socket_bind+0x1ec/0x240 
> security/security.c:1240
>  [] SYSC_bind+0x358/0x5f0 net/socket.c:1366
>  [] SyS_bind+0x82/0xa0 net/socket.c:1356
>  [] do_syscall_64+0x58/0x70 arch/x86/entry/common.c:292
>  [] entry_SYSCALL64_slow_path+0x25/0x25 
> arch/x86/entry/entry_64.o:?
> chained origin: ba6009bb
>  [] save_stack_trace+0x27/0x50 
> arch/x86/kernel/stacktrace.c:67
>  [< inline >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
>  [< inline >] kmsan_save_stack mm/kmsan/kmsan.c:337
>  [] kmsan_internal_chain_origin+0x118/0x1e0 
> mm/kmsan/kmsan.c:530
>  [] __msan_set_alloca_origin4+0xc3/0x130 
> mm/kmsan/kmsan_instr.c:380
>  [] SYSC_bind+0x129/0x5f0 net/socket.c:1356
>  [] SyS_bind+0x82/0xa0 net/socket.c:1356
>  [] do_syscall_64+0x58/0x70 arch/x86/entry/common.c:292
>  [] return_from_SYSCALL_64+0x0/0x6a 
> arch/x86/entry/entry_64.o:?
> origin description: address@SYSC_bind (origin=b8c00900)
> ==
>
> (the line numbers are relative to 4.8-rc6, but the bug persists upstream)
>
> , when I run the following program as root:
>
> ===
>   #include 
>   #include 
>   #include 
>
>   int main(int argc, char *argv[]) {
> struct sockaddr addr;
> int size = 0;
> if (argc > 1) {
>   size = atoi(argv[1]);
> }
> memset(, 0, sizeof(addr));
> int fd = socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP);
> bind(fd, , size);
> return 0;
>   }
> ===
>
> (for different values of |size| other error reports are printed).
>
> This happens because bind() unconditionally copies |size| bytes of
> |addr| to the kernel, leaving the rest uninitialized. Then
> security_socket_bind() reads the IP address bytes, including the
> uninitialized ones, to determine the port, or e.g. pass them further to
> sel_netnode_find(), which uses them to calculate a hash.
>
> Signed-off-by: Alexander Potapenko 
> ---
>  security/selinux/hooks.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 0a4b4b040e0a..eba54489b11b 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4351,10 +4351,19 @@ static int selinux_socket_bind(struct socket *sock, 
> struct sockaddr *address, in
> u32 sid, node_perm;
>
> if (family == PF_INET) {
> +   if (addrlen != sizeof(struct sockaddr_in)) {
> +   err = -EINVAL;
> +   goto out;
> +   }
> addr4 = (struct sockaddr_in *)address;
> snum = ntohs(addr4->sin_port);
> addrp = (char *)>sin_addr.s_addr;
> +
> } else {
> +   if (addrlen != sizeof(struct sockaddr_in6)) {
> +   err = -EINVAL;
> +   goto out;
> +   }
> addr6 = (struct sockaddr_in6 *)address;
> snum = ntohs(addr6->sin6_port);
> addrp = (char *)>sin6_addr.s6_addr;
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg


Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2017-03-02

2017-03-03 Thread David Miller
From: Jeff Kirsher 
Date: Thu,  2 Mar 2017 18:24:46 -0800

> This series contains fixes to ixgbe only.

Pulled, thanks Jeff.


Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-03 Thread Peter Zijlstra
On Fri, Mar 03, 2017 at 03:49:38PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 03, 2017 at 09:26:50AM +0100, Christian Borntraeger wrote:
> > Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
> > atomic access for sizes <= word size. And there are certainly places that
> > rely on that. But the *ONCE thing is mostly used for things where we used
> > barrier() 10 years ago.
> 
> A lot of code relies on READ/WRITE_ONCE() to generate single
> instructions for naturally aligned machined word sized loads/stores
> (something GCC used to guarantee, but does no longer IIRC).
> 
> So much so that I would say its a bug if READ/WRITE_ONCE() doesn't
> generate a single instruction under those conditions.
> 
> However, every time I've tried to introduce stricter
> semantics/primitives to verify things Linus hated it.

See here for the last attempt:

  https://marc.info/?l=linux-virtualization=148007765918101=2


Re: pull-request: can-next 2017-03-03,pull-request: can-next 2017-03-03

2017-03-03 Thread David Miller

Net-next is closed, please resubmit this when the tree is actually
openned back up after the merge window.

Thank you.


RE: [PATCH 01/26] compiler: introduce noinline_for_kasan annotation

2017-03-03 Thread David Laight
From: Andrey Ryabinin
> Sent: 03 March 2017 13:50
...
> noinline_iff_kasan might be a better name.  noinline_for_kasan gives the 
> impression
> that we always noinline function for the sake of kasan, while 
> noinline_iff_kasan
> clearly indicates that function is noinline only if kasan is used.

noinline_if_stackbloat

David



Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-03 Thread Peter Zijlstra
On Fri, Mar 03, 2017 at 09:26:50AM +0100, Christian Borntraeger wrote:
> Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
> atomic access for sizes <= word size. And there are certainly places that
> rely on that. But the *ONCE thing is mostly used for things where we used
> barrier() 10 years ago.

A lot of code relies on READ/WRITE_ONCE() to generate single
instructions for naturally aligned machined word sized loads/stores
(something GCC used to guarantee, but does no longer IIRC).

So much so that I would say its a bug if READ/WRITE_ONCE() doesn't
generate a single instruction under those conditions.

However, every time I've tried to introduce stricter
semantics/primitives to verify things Linus hated it.


Re: [Bug 194749] New: kernel bonding does not work in a network nameservice in versions above 3.10.0-229.20.1

2017-03-03 Thread Dan Geist


- On Mar 3, 2017, at 11:03 AM, Jiri Pirko j...@resnulli.us wrote:

> Fri, Mar 03, 2017 at 04:19:13PM CET, nicolas.dich...@6wind.com wrote:
>>Le 02/03/2017 à 21:39, Dan Geist a écrit :
>>> - On Mar 2, 2017, at 3:11 PM, Cong Wang xiyou.wangc...@gmail.com wrote
>>> 
 On Thu, Mar 2, 2017 at 10:32 AM, Stephen Hemminger
  wrote:
>
>
> Begin forwarded message:
>
> Date: Wed, 01 Mar 2017 21:08:01 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 194749] New: kernel bonding does not work in a network 
> nameservice
> in versions above 3.10.0-229.20.1

>
> https://bugzilla.kernel.org/show_bug.cgi?id=194749
>
> Bug ID: 194749
>Summary: kernel bonding does not work in a network nameservice
> in versions above 3.10.0-229.20.1
>Product: Networking
>Version: 2.5
> Kernel Version: > 3.10.0-229.20.1
>   Hardware: x86-64
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: blocking
>   Priority: P1
>  Component: Other
>   Assignee: step...@networkplumber.org
>   Reporter: d...@polter.net
> Regression: No
>
> bond interface is being used in active/standby mode with two physical NICs
> inside a network nameservice to provide switchpath redundancy.
>
> netns is instantiated post-boot with the following:
>
> ip netns add vntp
> ip link set p4p1 netns vntp
> ip link set p4p2 netns vntp
> ip link set bond0 netns vntp
> ip netns exec vntp ip link set lo up
> ip netns exec vntp ip link set p4p1 up
> ip netns exec vntp ip link set p4p2 up
> ip netns exec vntp ip link set bond0 up
> ip netns exec vntp ifenslave bond0 p4p1 p4p2

 This is due to the following commit:

 commit f9399814927ad9bb995a6e109c2a5f9d8a848209
 Author: Weilong Chen 
 Date:   Wed Jan 22 17:16:30 2014 +0800

bonding: Don't allow bond devices to change network namespaces.

Like bridge, bonding as netdevice doesn't cross netns boundaries.

Bonding ports and bonding itself live in same netns.

Signed-off-by: Weilong Chen 
Signed-off-by: David S. Miller 


 NETIF_F_NETNS_LOCAL was introduced for loopback device which
 is created for each netns, it is not clear why we need to add it to bond
 and bridge...
>>> 
>>> Thank you for tracking this down. Without digging through the code to 
>>> figure it
>>> out, does this imply that the existence of a bond interface is not possible 
>>> AT
>>> ALL within a netns or simply that it may not be "migrated" between the 
>>> global
>>> scope and a netns?
>>It means that the migration is not possible. I think the only reason to have
>>this flag on bonding and bridge is the lack of test and fix. There is probably
>>some work to be done to have this feature. But are there real use cases of
>>x-netns bonding or x-netns bridge?
> 
> If that use case exists I believe it is an abuse. Soft devices that are
> by definition in upper-lower relationships with other devices should not
> move to other namespaces. Prevents all kinds of issues. If you need a
> soft device like bridge of bond within a namespace, just create it there.

I think the implementation is good as it stands and i don't have a use case to 
the contrary. I simply misunderstood the implications of creating the bond 
interface in the global space and had been utilizing the unnecessarily 
permissive behavior of the older kernels. Once I stopped doing that and 
migrated the instance creation to within the netns, my desired behavior and 
functionality were restored. I also commented the bug report on vger as such.

Thanks for the clarification and consideration.
Dan 

-- 
Dan Geist dan(@)polter.net



VIA velocity hang: how to find cause?

2017-03-03 Thread Udo van den Heuvel
Hello,

I noticed at least twice that a VIA velocity interface stops functioning
without reason but is revived after and `ifdown eth0; ifup eth0`.
No errors are shown in dmesg or /var/log/messages w.r.t. the interface.
How to find the root cause?

This is with kernel 4.4.48 and .50.

Kind regards,
Udo


Re: [PATCH 1/1] rds: remove unnecessary returned value check

2017-03-03 Thread Santosh Shilimkar

On 3/2/2017 9:44 PM, Zhu Yanjun wrote:

The function rds_trans_register always returns 0. As such, it is not
necessary to check the returned value.

Cc: Joe Jin 
Cc: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---

Acked-by: Santosh Shilimkar 


RE: [PATCH net] sctp: change to save MSG_MORE flag into assoc

2017-03-03 Thread David Laight
From: Xin Long
> Sent: 03 March 2017 15:43
...
> > It is much more important to get MSG_MORE working 'properly' for SCTP
> > than for TCP. For TCP an application can always use a long send.

> "long send" ?, you mean bigger data, or keeping sending?
> I didn't get the difference between SCTP and TCP, they
> are similar when sending data.

With tcp an application can always replace two send()/write()
calls with a single call to writev().
For sctp two send() calls must be made in order to generate two
data chunks.
So it is much easier for a tcp application to generate 'full'
ethernet packets. 

> 
> >
> > ...
> >> @@ -1982,6 +1982,7 @@ static int sctp_sendmsg(struct sock *sk, struct 
> >> msghdr *msg, size_t msg_len)
> >>* breaks.
> >>*/
> >>   err = sctp_primitive_SEND(net, asoc, datamsg);
> >> + asoc->force_delay = 0;
> >>   /* Did the lower layer accept the chunk? */
> >>   if (err) {
> >>   sctp_datamsg_free(datamsg);
> >
> > I don't think this is right - or needed.
> > You only get to the above if some test has decided to send data chunks.
> > So it just means that the NEXT time someone tries to send data all the
> > queued data gets sent.

> the NEXT time someone tries to send data with "MSG_MORE clear",
> yes, but with "MSG_MORE set", it will still delay.
> 
> > I'm guessing that the whole thing gets called in a loop (definitely needed
> > for very long data chunks, or after the window is opened).

> yes, if users keep sending data chunks with MSG_MORE set, no
> data with "MSG_MORE clear" gap.
> 
> > Now if an application sends a lot of (say) 100 byte chunks with MSG_MORE
> > set it would expect to see a lot of full ethernet frames be sent.

> right.

> > With the above a frame will be sent (containing all but 1 chunk) when the
> > amount of queued data becomes too large for an ethernet frame, and 
> > immediately
> > followed by a second ethernet frame with 1 chunk in it.

> "followed by a second ethernet frame with 1 chunk in it.", I think this's
> what you're really worried about, right ?
> But sctp flush data queue NOT like what you think, it's not keep traversing
> the queue untill the queue is empty.
> once a packet with chunks in one ethernet frame is sent, sctp_outq_flush
> will return. it will pack chunks and send the next packet again untill some
> other 'event' triggers it, like retransmission or data received from peer.
> I don't think this is a problem.

Erm that can't work.
I think there is code to convert a large user send into multiple data chunks.
So if the user does a 4k (say) send several large chunks get queued.
These would need to all be sent at once.

Similarly when the transmit window is received.
So somewhere there ought to be a loop that will send more than one packet.

> > Now it might be that the flag needs clearing when retransmissions are 
> > queued.
> > OTOH they might get sent for other reasons.

> Before we really overthought about MSG_MORE, no need to care about
> retransmissions, define MSG_MORE, in my opinion, it works more for
> *inflight is 0*, if it's not 0, we shouldn't stop other places flushing them.

Eh? and when nagle disabled.
If 'inflight' isn't 0 then most paths don't flush data.

> We cannot let asoc's more_more flag work as global, it will block elsewhere
> sending data chunks, not only sctp_sendmsg.

If the connection was flow controlled off, and more 'credit' arrives and there
is less that an ethernet frame's worth of data pending, and the last send
said 'MSG_MORE' there is no point sending anything until the application
does a send with MSG_MORE clear.

I'm not sure what causes a retransmission to send data, I suspect that 
'inflight'
can easily be non-zero at that time.
Likely something causes a packet be generated - which then collects the data 
chunks.

David




[PATCH 0/4] net: thunderx: Miscellaneous fixes

2017-03-03 Thread sunil . kovvuri
From: Sunil Goutham 

This patch set fixes multiples issues such as IOMMU
translation faults when kernel is booted with IOMMU enabled
on host, incorrect MAC ID reading from ACPI tables and IPv6
UDP packet drop due to failure of checksum validation.

Sunil Goutham (3):
  net: thunderx: Fix IOMMU translation faults
  net: thunderx: Fix LMAC mode debug prints for QSGMII mode
  net: thunderx: Fix invalid mac addresses for node1 interfaces

Thanneeru Srinivasulu (1):
  net: thunderx: Allow IPv6 frames with zero UDP checksum

 drivers/net/ethernet/cavium/thunder/nic.h  |   1 +
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |  12 +-
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 193 +
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |   2 +
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  |  64 +--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h  |   1 -
 6 files changed, 223 insertions(+), 50 deletions(-)

-- 
2.7.4



Re: [PATCH 01/26] compiler: introduce noinline_for_kasan annotation

2017-03-03 Thread Andrey Ryabinin


On 03/02/2017 07:38 PM, Arnd Bergmann wrote:
> When CONFIG_KASAN is set, we can run into some code that uses incredible
> amounts of kernel stack:
> 
> drivers/staging/dgnc/dgnc_neo.c:1056:1: error: the frame size of 2 bytes 
> is larger than 2048 bytes [-Werror=frame-larger-than=]
> drivers/media/i2c/cx25840/cx25840-core.c:4960:1: error: the frame size of 
> 94000 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
> drivers/media/dvb-frontends/stv090x.c:3430:1: error: the frame size of 5312 
> bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> 
> This happens when a sanitizer uses stack memory each time an inline function
> gets called. This introduces a new annotation for those functions to make
> them either 'inline' or 'noinline' dependning on the CONFIG_KASAN symbol.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/compiler.h | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index f8110051188f..56b90897a459 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -416,6 +416,17 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>   */
>  #define noinline_for_stack noinline
>  
> +/*
> + * CONFIG_KASAN can lead to extreme stack usage with certain patterns when
> + * one function gets inlined many times and each instance requires a stack
> + * ckeck.
> + */
> +#ifdef CONFIG_KASAN
> +#define noinline_for_kasan noinline __maybe_unused


noinline_iff_kasan might be a better name.  noinline_for_kasan gives the 
impression
that we always noinline function for the sake of kasan, while noinline_iff_kasan
clearly indicates that function is noinline only if kasan is used.

> +#else
> +#define noinline_for_kasan inline
> +#endif
> +
>  #ifndef __always_inline
>  #define __always_inline inline
>  #endif
> 


Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-03 Thread Christian Borntraeger
On 03/02/2017 10:45 PM, Arnd Bergmann wrote:
> On Thu, Mar 2, 2017 at 8:00 PM, Christian Borntraeger
>  wrote:
>> On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
>>> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
>>>  wrote:
 On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>
> This attempts a rewrite of the two macros, using a simpler implementation
> for the most common case of having a naturally aligned 1, 2, 4, or (on
> 64-bit architectures) 8  byte object that can be accessed with a single
> instruction.  For these, we go back to a volatile pointer dereference
> that we had with the ACCESS_ONCE macro.

 We had changed that back then because gcc 4.6 and 4.7 had a bug that could
 removed the volatile statement on aggregate types like the following one

 union ipte_control {
 unsigned long val;
 struct {
 unsigned long k  : 1;
 unsigned long kh : 31;
 unsigned long kg : 32;
 };
 };

 See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145

 If I see that right, your __ALIGNED_WORD(x)
 macro would say that for above structure  sizeof(x) == sizeof(long)) is 
 true,
 so it would fall back to the old volatile cast and might reintroduce the
 old compiler bug?
>>
>> Oh dear, I should double check my sentences in emails before sending...anyway
>> the full story is referenced in
>>
>> commit 60815cf2e05057db5b78e398d9734c493560b11e
>> Merge tag 'for-linus' of 
>> git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
>> which has a pointer to
>> http://marc.info/?i=54611D86.4040306%40de.ibm.com
>> which contains the full story.
> 
> Ok, got it. So I guess the behavior of forcing aligned accesses on aligned
> data is accidental, and allowing non-power-of-two arguments is also not
> the main purpose.


Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
atomic access for sizes <= word size. And there are certainly places that
rely on that. But the *ONCE thing is mostly used for things where we used
barrier() 10 years ago.


 Maybe we could just bail out on new compilers if we get
> either of those? That might catch code that accidentally does something
> that is inherently non-atomic or that causes a trap when the intention was
> to have a simple atomic access.

I think Linus stated that its ok to assume that the compiler is smart enough 
to uses a single instruction to access aligned and properly sized scalar types
for *ONCE.

Back then when I changed ACCESS_ONCE there were many places that did use it
for non-atomic, > word size accesses. For example on some architectures a pmd_t
is a typedef to an array, for which there is no way to read that atomically.
So the focus must be on the "ONCE" part.

If some code uses a properly aligned, word sized object we can also assume 
atomic access. If the access is not properly sized/aligned we do not get
atomicity, but we do get the "ONCE".
But adding a check for alignment/size would break the compilation of some
code.





Re: [PATCH 00/26] bring back stack frame warning with KASAN

2017-03-03 Thread Alexander Potapenko
On Thu, Mar 2, 2017 at 5:38 PM, Arnd Bergmann  wrote:
> It took a long while to get this done, but I'm finally ready
> to send the first half of the KASAN stack size patches that
> I did in response to the kernelci.org warnings.
>
> As before, it's worth mentioning that things are generally worse
> with gcc-7.0.1 because of the addition of -fsanitize-address-use-after-scope
> that are not present on kernelci, so my randconfig testing found
> a lot more than kernelci did.
>
> The main areas are:
>
> - READ_ONCE/WRITE_ONCE cause problems in lots of code
> - typecheck() causes huge problems in a few places
> - I'm introducing "noinline_for_kasan" and use it in a lot
>   of places that suffer from inline functions with local variables
>   - netlink, as used in various parts of the kernel
>   - a number of drivers/media drivers
>   - a handful of wireless network drivers
> - kmemcheck conflicts with -fsanitize-address-use-after-scope
>
> This series lets us add back a stack frame warning for 3072 bytes
> with -fsanitize-address-use-after-scope, or 2048 bytes without it.
>
> I have a follow-up series that further reduces the stack frame
> warning limit to 1280 bytes for all 64-bit architectures, and
> 1536 bytes with basic KASAN support (no -fsanitize-address-use-after-scope).
> For now, I'm only posting the first half, in order to keep
> it (barely) reviewable.
Can you please elaborate on why do you need this? Are you trying to
squeeze KASAN into some embedded device?
Noinlines sprayed over the codebase are hard to maintain, and certain
compiler changes may cause bloated stack frames in other places.
Maybe it should be enough to just increase the stack frame limit in
KASAN builds, as Dmitry suggested previously?
> Both series are tested with many hundred randconfig builds on both
> x86 and arm64, which are the only architectures supporting KASAN.
>
> Arnd
>
>  [PATCH 01/26] compiler: introduce noinline_for_kasan annotation
>  [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE
>  [PATCH 03/26] typecheck.h: avoid local variables in typecheck() macro
>  [PATCH 04/26] tty: kbd: reduce stack size with KASAN
>  [PATCH 05/26] netlink: mark nla_put_{u8,u16,u32} noinline_for_kasan
>  [PATCH 06/26] rocker: mark rocker_tlv_put_* functions as
>  [PATCH 07/26] brcmsmac: reduce stack size with KASAN
>  [PATCH 08/26] brcmsmac: make some local variables 'static const' to
>  [PATCH 09/26] brcmsmac: split up wlc_phy_workarounds_nphy
>  [PATCH 10/26] brcmsmac: reindent split functions
>  [PATCH 11/26] rtlwifi: reduce stack usage for KASAN
>  [PATCH 12/26] wl3501_cs: reduce stack size for KASAN
>  [PATCH 13/26] rtl8180: reduce stack size for KASAN
>  [PATCH 14/26] [media] dvb-frontends: reduce stack size in i2c access
>  [PATCH 15/26] [media] tuners: i2c: reduce stack usage for
>  [PATCH 16/26] [media] i2c: adv7604: mark register access as
>  [PATCH 17/26] [media] i2c: ks0127: reduce stack frame size for KASAN
>  [PATCH 18/26] [media] i2c: cx25840: avoid stack overflow with KASAN
>  [PATCH 19/26] [media] r820t: mark register functions as
>  [PATCH 20/26] [media] em28xx: split up em28xx_dvb_init to reduce
>  [PATCH 21/26] drm/bridge: ps8622: reduce stack size for KASAN
>  [PATCH 22/26] drm/i915/gvt: don't overflow the kernel stack with
>  [PATCH 23/26] mtd: cfi: reduce stack size with KASAN
>  [PATCH 24/26] ocfs2: reduce stack size with KASAN
>  [PATCH 25/26] isdn: eicon: mark divascapi incompatible with kasan
>  [PATCH 26/26] kasan: rework Kconfig settings
>
>  arch/x86/include/asm/switch_to.h |2 +-
>  drivers/gpu/drm/bridge/parade-ps8622.c   |2 +-
>  drivers/gpu/drm/i915/gvt/mmio.h  |   17 +-
>  drivers/isdn/hardware/eicon/Kconfig  |1 +
>  drivers/media/dvb-frontends/ascot2e.c|3 +-
>  drivers/media/dvb-frontends/cxd2841er.c  |4 +-
>  drivers/media/dvb-frontends/drx39xyj/drxj.c  |   14 +-
>  drivers/media/dvb-frontends/helene.c |4 +-
>  drivers/media/dvb-frontends/horus3a.c|2 +-
>  drivers/media/dvb-frontends/itd1000.c|2 +-
>  drivers/media/dvb-frontends/mt312.c  |2 +-
>  drivers/media/dvb-frontends/si2165.c |   14 +-
>  drivers/media/dvb-frontends/stb0899_drv.c|2 +-
>  drivers/media/dvb-frontends/stb6100.c|2 +-
>  drivers/media/dvb-frontends/stv0367.c|2 +-
>  drivers/media/dvb-frontends/stv090x.c|2 +-
>  drivers/media/dvb-frontends/stv6110.c|2 +-
>  drivers/media/dvb-frontends/stv6110x.c   |2 +-
>  drivers/media/dvb-frontends/tda8083.c|2 +-
>  

Re: [Bug 194749] New: kernel bonding does not work in a network nameservice in versions above 3.10.0-229.20.1

2017-03-03 Thread Nicolas Dichtel
Le 03/03/2017 à 17:03, Jiri Pirko a écrit :
> Fri, Mar 03, 2017 at 04:19:13PM CET, nicolas.dich...@6wind.com wrote:
>> Le 02/03/2017 à 21:39, Dan Geist a écrit :
[snip]
 NETIF_F_NETNS_LOCAL was introduced for loopback device which
 is created for each netns, it is not clear why we need to add it to bond
 and bridge...
>>>
>>> Thank you for tracking this down. Without digging through the code to 
>>> figure it out, does this imply that the existence of a bond interface is 
>>> not possible AT ALL within a netns or simply that it may not be "migrated" 
>>> between the global scope and a netns?
>> It means that the migration is not possible. I think the only reason to have
>> this flag on bonding and bridge is the lack of test and fix. There is 
>> probably
>> some work to be done to have this feature. But are there real use cases of
>> x-netns bonding or x-netns bridge?
> 
> If that use case exists I believe it is an abuse. Soft devices that are
> by definition in upper-lower relationships with other devices should not
> move to other namespaces. Prevents all kinds of issues. If you need a
> soft device like bridge of bond within a namespace, just create it there.
> 
Note that vlan supports x-netns. And I think that the corresponding use cases
are valid ;-)
But I agree that for bonding and bridge it seems wrong.


Re: Dell Inspiron 5558/0VNM2T and suspend/resume problem with r8169

2017-03-03 Thread Diego Viola
On Fri, Mar 3, 2017 at 12:40 PM, Diego Viola  wrote:
> On Fri, Mar 3, 2017 at 12:37 PM, Diego Viola  wrote:
>> On Wed, Mar 1, 2017 at 12:47 PM, Diego Viola  wrote:
>>> On Wed, Mar 1, 2017 at 12:44 PM, Diego Viola  wrote:
 My machine (a Dell Inspiron 5558 laptop) fails to resume from suspend
 unless I rmmod r8169 first.

 Another workaround is to do this before suspend:

 echo 0 > /sys/power/pm_async

 I've been reproducing the freeze like this:

 $ i3lock && systemctl suspend

 I would have to repeat this at least 5 times for the freeze to occur,
 but it seems to be easily reproducible.

 If I don't invoke i3lock, I cannot get the freeze to happen, but it
 seems to happen with other lockers also.

 I have tried Alt+SysRq+r and tried to switch to another TTY but the
 machine is always unresponsive, which indicates that it's a kernel
 panic.

 I have had a similar issue to this about a year ago with the jme
 driver and this was the fix:

 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/jme.c?id=ee50c130c82175eaa0820c96b6d3763928af2241

 I haven't tried getting a kernel trace yet, but all seems to indicate
 the problem is caused by r8169, at least til now.

 Any ideas, please?

 Thanks,
 Diego
>>>
>>> Sorry, I forgot to mention, I'm on Arch Linux (x86_64), kernel 
>>> 4.9.11-1-ARCH.
>>>
>>> Diego
>>
>> This is still a problem with Linux 4.10.1.
>>
>> Diego
>
> I got this trace in the journal while suspending, not sure if it's
> related to this problem:
>
> Mar 03 12:05:05 myhost kernel: PM: Preparing system for sleep (mem)
> Mar 03 12:05:05 myhost kernel: Freezing user space processes ...
> Mar 03 12:05:05 myhost kernel: usb 2-6: new full-speed USB device
> number 34 using xhci_hcd
> Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28
> UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 Sense Key :
> 0x6 [current]
> Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 ASC=0x28 ASCQ=0x0
> Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 CDB:
> opcode=0x28 28 00 00 00 00 04 00 00 02 00
> Mar 03 12:05:05 myhost kernel: blk_update_request: I/O error, dev sr0, sector 
> 16
> Mar 03 12:05:05 myhost kernel: (elapsed 1.452 seconds) done.
> Mar 03 12:05:05 myhost kernel: Freezing remaining freezable tasks ...
> Mar 03 12:05:05 myhost kernel: [ cut here ]
> Mar 03 12:05:05 myhost kernel: WARNING: CPU: 3 PID: 2134 at
> drivers/base/firmware_class.c:1200 _request_firmware+0x2da/0xa70
> Mar 03 12:05:05 myhost kernel: Modules linked in: fuse nls_iso8859_1
> nls_cp437 vfat fat snd_hda_codec_hdmi rtsx_usb_ms memstick
> rtsx_usb_sdmmc dell_led snd_hda_codec_realtek snd_hda_codec_generic
> dell_laptop intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp
> kvm_int
> Mar 03 12:05:05 myhost kernel:  bluetooth cfg80211 aesni_intel
> aes_x86_64 crypto_simd glue_helper cryptd evdev intel_cstate
> input_leds intel_rapl_perf mac_hid i915 pcspkr snd snd_soc_sst_acpi
> thermal snd_soc_sst_match wmi battery ac97_bus soundcore fjes
> drm_kms_helper vide
> Mar 03 12:05:05 myhost kernel: CPU: 3 PID: 2134 Comm: kworker/3:5 Not
> tainted 4.10.1-ARCH #1
> Mar 03 12:05:05 myhost kernel: Hardware name: Dell Inc. Inspiron
> 5558/0VNM2T, BIOS A14 11/22/2016
> Mar 03 12:05:05 myhost kernel: Workqueue: usb_hub_wq hub_event [usbcore]
> Mar 03 12:05:05 myhost kernel: Call Trace:
> Mar 03 12:05:05 myhost kernel:  dump_stack+0x63/0x83
> Mar 03 12:05:05 myhost kernel:  __warn+0xcb/0xf0
> Mar 03 12:05:05 myhost kernel:  warn_slowpath_null+0x1d/0x20
> Mar 03 12:05:05 myhost kernel:  _request_firmware+0x2da/0xa70
> Mar 03 12:05:05 myhost kernel:  request_firmware+0x37/0x50
> Mar 03 12:05:05 myhost kernel:  ath3k_load_patch+0xc3/0x1e0 [ath3k]
> Mar 03 12:05:05 myhost kernel:  ath3k_probe+0x76/0x4ac [ath3k]
> Mar 03 12:05:05 myhost kernel:  ? __pm_runtime_set_status+0x1c0/0x2a0
> Mar 03 12:05:05 myhost kernel:  usb_probe_interface+0x159/0x2d0 [usbcore]
> Mar 03 12:05:05 myhost kernel:  driver_probe_device+0x2bb/0x460
> Mar 03 12:05:05 myhost kernel:  __device_attach_driver+0x8c/0x100
> Mar 03 12:05:05 myhost kernel:  ? __driver_attach+0xf0/0xf0
> Mar 03 12:05:05 myhost kernel:  bus_for_each_drv+0x67/0xb0
> Mar 03 12:05:05 myhost kernel:  __device_attach+0xdd/0x160
> Mar 03 12:05:05 myhost kernel:  device_initial_probe+0x13/0x20
> Mar 03 12:05:05 myhost kernel:  bus_probe_device+0x92/0xa0
> Mar 03 12:05:05 myhost kernel:  device_add+0x393/0x670
> Mar 03 12:05:05 myhost kernel:  usb_set_configuration+0x5f9/0x910 [usbcore]
> Mar 03 12:05:05 myhost kernel:  generic_probe+0x2e/0x80 [usbcore]
> Mar 03 12:05:05 myhost kernel:  usb_probe_device+0x2e/0x70 [usbcore]
> Mar 03 12:05:05 myhost kernel:  

Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Eric Dumazet
On Fri, 2017-03-03 at 07:22 -0800, Eric Dumazet wrote:
> On Fri, Mar 3, 2017 at 7:12 AM, Dmitry Vyukov  wrote:
> > The first bot that picked this up started spewing:
> >
> > BUG: spinlock recursion on CPU#1, syz-executor2/9452
> 
> Yes. The bug is not about locking the listener, but protecting fields
> of struct dccp_request_sock
> 
> I will provide a patch, once I reach the office and after the breakfast ;)

OK here is what I suggest to fix the races.

diff --git a/include/linux/dccp.h b/include/linux/dccp.h
index 
61d042bbbf607253033d9948b291cab2322814ba..68449293c4b6233c1a1d4133b1819376a9310225
 100644
--- a/include/linux/dccp.h
+++ b/include/linux/dccp.h
@@ -163,6 +163,7 @@ struct dccp_request_sock {
__u64dreq_isr;
__u64dreq_gsr;
__be32   dreq_service;
+   spinlock_t   dreq_lock;
struct list_head dreq_featneg;
__u32dreq_timestamp_echo;
__u32dreq_timestamp_time;
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 
e267e6f4c9a5566b369a03a600a408e5bd41cbad..abd07a443219853b022bef41cb072e90ff8f07f0
 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -142,6 +142,13 @@ struct sock *dccp_check_req(struct sock *sk, struct 
sk_buff *skb,
struct dccp_request_sock *dreq = dccp_rsk(req);
bool own_req;
 
+   /* TCP/DCCP listeners became lockless.
+* DCCP stores complex state in its request_sock, so we need
+* a protection for them, now this code runs without being protected
+* by the parent (listener) lock.
+*/
+   spin_lock_bh(>dreq_lock);
+
/* Check for retransmitted REQUEST */
if (dccp_hdr(skb)->dccph_type == DCCP_PKT_REQUEST) {
 
@@ -156,7 +163,7 @@ struct sock *dccp_check_req(struct sock *sk, struct sk_buff 
*skb,
inet_rtx_syn_ack(sk, req);
}
/* Network Duplicate, discard packet */
-   return NULL;
+   goto out;
}
 
DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_PACKET_ERROR;
@@ -182,20 +189,20 @@ struct sock *dccp_check_req(struct sock *sk, struct 
sk_buff *skb,
 
child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
 req, _req);
-   if (!child)
-   goto listen_overflow;
-
-   return inet_csk_complete_hashdance(sk, child, req, own_req);
+   if (child) {
+   child = inet_csk_complete_hashdance(sk, child, req, own_req);
+   goto out;
+   }
 
-listen_overflow:
-   dccp_pr_debug("listen_overflow!\n");
DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_TOO_BUSY;
 drop:
if (dccp_hdr(skb)->dccph_type != DCCP_PKT_RESET)
req->rsk_ops->send_reset(sk, skb);
 
inet_csk_reqsk_queue_drop(sk, req);
-   return NULL;
+out:
+   spin_unlock_bh(>dreq_lock);
+   return child;
 }
 
 EXPORT_SYMBOL_GPL(dccp_check_req);
@@ -246,6 +253,7 @@ int dccp_reqsk_init(struct request_sock *req,
 {
struct dccp_request_sock *dreq = dccp_rsk(req);
 
+   spin_lock_init(>dreq_lock);
inet_rsk(req)->ir_rmt_port = dccp_hdr(skb)->dccph_sport;
inet_rsk(req)->ir_num  = ntohs(dccp_hdr(skb)->dccph_dport);
inet_rsk(req)->acked   = 0;




[PATCH net 4/7] bnx2x: fix detection of VLAN filtering feature for VF

2017-03-03 Thread Michal Schmidt
VFs are currently missing the VLAN filtering feature, because we were
checking the PF's acquire response before actually performing the acquire.

Fix it by setting the feature flag later when we have the PF response.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index d57290b9ea..ac76fc251d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13292,17 +13292,15 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_HIGHDMA;
 
-   /* VF with OLD Hypervisor or old PF do not support filtering */
if (IS_PF(bp)) {
if (chip_is_e1x)
bp->accept_any_vlan = true;
else
dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
-#ifdef CONFIG_BNX2X_SRIOV
-   } else if (bp->acquire_resp.pfdev_info.pf_cap & PFVF_CAP_VLAN_FILTER) {
-   dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
-#endif
}
+   /* For VF we'll know whether to enable VLAN filtering after
+* getting a response to CHANNEL_TLV_ACQUIRE from PF.
+*/
 
dev->features |= dev->hw_features | NETIF_F_HW_VLAN_CTAG_RX;
dev->features |= NETIF_F_HIGHDMA;
@@ -14009,6 +14007,14 @@ static int bnx2x_init_one(struct pci_dev *pdev,
rc = bnx2x_vfpf_acquire(bp, tx_count, rx_count);
if (rc)
goto init_one_freemem;
+
+#ifdef CONFIG_BNX2X_SRIOV
+   /* VF with OLD Hypervisor or old PF do not support filtering */
+   if (bp->acquire_resp.pfdev_info.pf_cap & PFVF_CAP_VLAN_FILTER) {
+   dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
+   dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
+   }
+#endif
}
 
/* Enable SRIOV if capability found in configuration space */
-- 
2.9.3



[PATCH net 2/7] bnx2x: lower verbosity of VF stats debug messages

2017-03-03 Thread Michal Schmidt
When BNX2X_MSG_IOV is enabled, the driver produces too many VF statistics
messages. Lower the verbosity of the VF stats messages similarly as in
commit 76ca70fabbdaa3 ("bnx2x: [Debug] change verbosity of some prints").

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 6fad22adbb..9f0f851774 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -1899,7 +1899,8 @@ void bnx2x_iov_adjust_stats_req(struct bnx2x *bp)
continue;
}
 
-   DP(BNX2X_MSG_IOV, "add addresses for vf %d\n", vf->abs_vfid);
+   DP_AND((BNX2X_MSG_IOV | BNX2X_MSG_STATS),
+  "add addresses for vf %d\n", vf->abs_vfid);
for_each_vfq(vf, j) {
struct bnx2x_vf_queue *rxq = vfq_get(vf, j);
 
@@ -1920,11 +1921,12 @@ void bnx2x_iov_adjust_stats_req(struct bnx2x *bp)
cpu_to_le32(U64_HI(q_stats_addr));
cur_query_entry->address.lo =
cpu_to_le32(U64_LO(q_stats_addr));
-   DP(BNX2X_MSG_IOV,
-  "added address %x %x for vf %d queue %d client %d\n",
-  cur_query_entry->address.hi,
-  cur_query_entry->address.lo, cur_query_entry->funcID,
-  j, cur_query_entry->index);
+   DP_AND((BNX2X_MSG_IOV | BNX2X_MSG_STATS),
+  "added address %x %x for vf %d queue %d client 
%d\n",
+  cur_query_entry->address.hi,
+  cur_query_entry->address.lo,
+  cur_query_entry->funcID,
+  j, cur_query_entry->index);
cur_query_entry++;
cur_data_offset += sizeof(struct per_queue_stats);
stats_count++;
-- 
2.9.3



[PATCH net 7/7] bnx2x: add missing configuration of VF VLAN filters

2017-03-03 Thread Michal Schmidt
Configuring VLANs from the VF side had no effect, because the PF ignored
filters of type VFPF_VLAN_FILTER in the VF-PF message.

Add the missing filter type to configure.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index c2d327d9df..76a4668c50 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -1777,6 +1777,23 @@ static int bnx2x_vf_mbx_qfilters(struct bnx2x *bp, 
struct bnx2x_virtf *vf)
goto op_err;
}
 
+   /* build vlan list */
+   fl = NULL;
+
+   rc = bnx2x_vf_mbx_macvlan_list(bp, vf, msg, ,
+  VFPF_VLAN_FILTER);
+   if (rc)
+   goto op_err;
+
+   if (fl) {
+   /* set vlan list */
+   rc = bnx2x_vf_mac_vlan_config_list(bp, vf, fl,
+  msg->vf_qid,
+  false);
+   if (rc)
+   goto op_err;
+   }
+
}
 
if (msg->flags & VFPF_SET_Q_FILTERS_RX_MASK_CHANGED) {
-- 
2.9.3



[PATCH net 1/7] bnx2x: prevent crash when accessing PTP with interface down

2017-03-03 Thread Michal Schmidt
It is possible to crash the kernel by accessing a PTP device while its
associated bnx2x interface is down. Before the interface is brought up,
the timecounter is not initialized, so accessing it results in NULL
dereference.

Fix it by checking if the interface is up.

Use -ENETDOWN as the error code when the interface is down.
 -EFAULT in bnx2x_ptp_adjfreq() did not seem right.

Tested using phc_ctl get/set/adj/freq commands.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index d8d06fdfc4..d57290b9ea 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13738,7 +13738,7 @@ static int bnx2x_ptp_adjfreq(struct ptp_clock_info 
*ptp, s32 ppb)
if (!netif_running(bp->dev)) {
DP(BNX2X_MSG_PTP,
   "PTP adjfreq called while the interface is down\n");
-   return -EFAULT;
+   return -ENETDOWN;
}
 
if (ppb < 0) {
@@ -13797,6 +13797,12 @@ static int bnx2x_ptp_adjtime(struct ptp_clock_info 
*ptp, s64 delta)
 {
struct bnx2x *bp = container_of(ptp, struct bnx2x, ptp_clock_info);
 
+   if (!netif_running(bp->dev)) {
+   DP(BNX2X_MSG_PTP,
+  "PTP adjtime called while the interface is down\n");
+   return -ENETDOWN;
+   }
+
DP(BNX2X_MSG_PTP, "PTP adjtime called, delta = %llx\n", delta);
 
timecounter_adjtime(>timecounter, delta);
@@ -13809,6 +13815,12 @@ static int bnx2x_ptp_gettime(struct ptp_clock_info 
*ptp, struct timespec64 *ts)
struct bnx2x *bp = container_of(ptp, struct bnx2x, ptp_clock_info);
u64 ns;
 
+   if (!netif_running(bp->dev)) {
+   DP(BNX2X_MSG_PTP,
+  "PTP gettime called while the interface is down\n");
+   return -ENETDOWN;
+   }
+
ns = timecounter_read(>timecounter);
 
DP(BNX2X_MSG_PTP, "PTP gettime called, ns = %llu\n", ns);
@@ -13824,6 +13836,12 @@ static int bnx2x_ptp_settime(struct ptp_clock_info 
*ptp,
struct bnx2x *bp = container_of(ptp, struct bnx2x, ptp_clock_info);
u64 ns;
 
+   if (!netif_running(bp->dev)) {
+   DP(BNX2X_MSG_PTP,
+  "PTP settime called while the interface is down\n");
+   return -ENETDOWN;
+   }
+
ns = timespec64_to_ns(ts);
 
DP(BNX2X_MSG_PTP, "PTP settime called, ns = %llu\n", ns);
-- 
2.9.3



[PATCH net 5/7] bnx2x: do not rollback VF MAC/VLAN filters we did not configure

2017-03-03 Thread Michal Schmidt
On failure to configure a VF MAC/VLAN filter we should not attempt to
rollback filters that we failed to configure with -EEXIST.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 8 +++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 9f0f851774..2068bb8f54 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -434,7 +434,9 @@ static int bnx2x_vf_mac_vlan_config(struct bnx2x *bp,
 
/* Add/Remove the filter */
rc = bnx2x_config_vlan_mac(bp, );
-   if (rc && rc != -EEXIST) {
+   if (rc == -EEXIST)
+   return 0;
+   if (rc) {
BNX2X_ERR("Failed to %s %s\n",
  filter->add ? "add" : "delete",
  (filter->type == BNX2X_VF_FILTER_VLAN_MAC) ?
@@ -444,6 +446,8 @@ static int bnx2x_vf_mac_vlan_config(struct bnx2x *bp,
return rc;
}
 
+   filter->applied = true;
+
return 0;
 }
 
@@ -471,6 +475,8 @@ int bnx2x_vf_mac_vlan_config_list(struct bnx2x *bp, struct 
bnx2x_virtf *vf,
BNX2X_ERR("Managed only %d/%d filters - rolling back\n",
  i, filters->count + 1);
while (--i >= 0) {
+   if (!filters->filters[i].applied)
+   continue;
filters->filters[i].add = !filters->filters[i].add;
bnx2x_vf_mac_vlan_config(bp, vf, qid,
 >filters[i],
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index 7a6d406f4c..888d0b6632 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -114,6 +114,7 @@ struct bnx2x_vf_mac_vlan_filter {
(BNX2X_VF_FILTER_MAC | BNX2X_VF_FILTER_VLAN) /*shortcut*/
 
bool add;
+   bool applied;
u8 *mac;
u16 vid;
 };
-- 
2.9.3



[PATCH net 3/7] bnx2x: fix possible overrun of VFPF multicast addresses array

2017-03-03 Thread Michal Schmidt
It is too late to check for the limit of the number of VF multicast
addresses after they have already been copied to the req->multicast[]
array, possibly overflowing it.

Do the check before copying.

Also fix the error path to not skip unlocking vf2pf_mutex.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index bfae300cf2..c2d327d9df 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -868,7 +868,7 @@ int bnx2x_vfpf_set_mcast(struct net_device *dev)
struct bnx2x *bp = netdev_priv(dev);
struct vfpf_set_q_filters_tlv *req = >vf2pf_mbox->req.set_q_filters;
struct pfvf_general_resp_tlv *resp = >vf2pf_mbox->resp.general_resp;
-   int rc, i = 0;
+   int rc = 0, i = 0;
struct netdev_hw_addr *ha;
 
if (bp->state != BNX2X_STATE_OPEN) {
@@ -883,6 +883,15 @@ int bnx2x_vfpf_set_mcast(struct net_device *dev)
/* Get Rx mode requested */
DP(NETIF_MSG_IFUP, "dev->flags = %x\n", dev->flags);
 
+   /* We support PFVF_MAX_MULTICAST_PER_VF mcast addresses tops */
+   if (netdev_mc_count(dev) > PFVF_MAX_MULTICAST_PER_VF) {
+   DP(NETIF_MSG_IFUP,
+  "VF supports not more than %d multicast MAC addresses\n",
+  PFVF_MAX_MULTICAST_PER_VF);
+   rc = -EINVAL;
+   goto out;
+   }
+
netdev_for_each_mc_addr(ha, dev) {
DP(NETIF_MSG_IFUP, "Adding mcast MAC: %pM\n",
   bnx2x_mc_addr(ha));
@@ -890,16 +899,6 @@ int bnx2x_vfpf_set_mcast(struct net_device *dev)
i++;
}
 
-   /* We support four PFVF_MAX_MULTICAST_PER_VF mcast
- * addresses tops
- */
-   if (i >= PFVF_MAX_MULTICAST_PER_VF) {
-   DP(NETIF_MSG_IFUP,
-  "VF supports not more than %d multicast MAC addresses\n",
-  PFVF_MAX_MULTICAST_PER_VF);
-   return -EINVAL;
-   }
-
req->n_multicast = i;
req->flags |= VFPF_SET_Q_FILTERS_MULTICAST_CHANGED;
req->vf_qid = 0;
@@ -924,7 +923,7 @@ int bnx2x_vfpf_set_mcast(struct net_device *dev)
 out:
bnx2x_vfpf_finalize(bp, >first_tlv);
 
-   return 0;
+   return rc;
 }
 
 /* request pf to add a vlan for the vf */
-- 
2.9.3



[PATCH net 0/7] bnx2x: PTP crash, VF VLAN fixes

2017-03-03 Thread Michal Schmidt
Hello,
here are fixes for a crash with PTP, a crash in setting of VF multicast
addresses, and non-working VLAN filters configuration from the VF side.

Michal Schmidt (7):
  bnx2x: prevent crash when accessing PTP with interface down
  bnx2x: lower verbosity of VF stats debug messages
  bnx2x: fix possible overrun of VFPF multicast addresses array
  bnx2x: fix detection of VLAN filtering feature for VF
  bnx2x: do not rollback VF MAC/VLAN filters we did not configure
  bnx2x: fix incorrect filter count in an error message
  bnx2x: add missing configuration of VF VLAN filters

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 36 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 15 ++---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h |  1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c  | 40 ---
 4 files changed, 70 insertions(+), 22 deletions(-)

-- 
2.9.3



[PATCH net 6/7] bnx2x: fix incorrect filter count in an error message

2017-03-03 Thread Michal Schmidt
filters->count is the number of filters we were supposed to configure.
There is no reason to increase it by +1 when printing the count in an error
message.

Signed-off-by: Michal Schmidt 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 2068bb8f54..bdfd53b46b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -473,7 +473,7 @@ int bnx2x_vf_mac_vlan_config_list(struct bnx2x *bp, struct 
bnx2x_virtf *vf,
/* Rollback if needed */
if (i != filters->count) {
BNX2X_ERR("Managed only %d/%d filters - rolling back\n",
- i, filters->count + 1);
+ i, filters->count);
while (--i >= 0) {
if (!filters->filters[i].applied)
continue;
-- 
2.9.3



RE: [net 2/2] ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines

2017-03-03 Thread Duyck, Alexander H
> -Original Message-
> From: David Laight [mailto:david.lai...@aculab.com]
> Sent: Friday, March 3, 2017 4:25 AM
> To: Kirsher, Jeffrey T ; da...@davemloft.net
> Cc: Duyck, Alexander H ;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com
> Subject: RE: [net 2/2] ixgbe: Limit use of 2K buffers on architectures with 
> 256B
> or larger cache lines
> 
> From: Jeff Kirsher
> > Sent: 03 March 2017 02:25
> > From: Alexander Duyck 
> >
> > On architectures that have a cache line size larger than 64 Bytes we
> > start running into issues where the amount of headroom for the frame
> > starts shrinking.
> >
> > The size of skb_shared_info on a system with a 64B L1 cache line size
> > is 320.  This increases to 384 with a 128B cache line, and 512 with a
> > 256B cache line.
> 
> Perhaps some of the CACHE_LINE_ALIGNED markers don't actually need to
> force alignment with large line sizes?
> 
> I realise some things have hard requirements for cache alignment (eg non-
> coherent dma), but others are just there to limit the number of cache lines 
> read
> and/or dirtied.
> 
>   David

For our purposes I think this works well enough.  Basically we wanted to 
guarantee we have enough headroom for XDP.  In the case of the Mellanox drivers 
they are guaranteeing 256 if I recall correctly.

I have some follow-up patches for net-next that will make it so that we can 
just do a build-time test that will determine the padding size and allow us to 
always guaranteed at least NET_SKB_PAD + NET_IP_ALIGN.

- Alex




Re: [PATCH net] sctp: change to save MSG_MORE flag into assoc

2017-03-03 Thread Xin Long
On Fri, Mar 3, 2017 at 8:49 PM, David Laight  wrote:
> From: Xin Long
>> Sent: 03 March 2017 06:24
>> David Laight noticed the support for MSG_MORE with datamsg->force_day
>> didn't really work as we expected, as the first msg with MSG_MORE set
>> would always block the following chunks' dequeuing.
>>
>> This Patch is to rewrite it by saving the MSG_MORE flag into assoc as
>> Divid Laight suggested.
>^ typo
ah, sorry. :P
>
>> asoc->force_delay is used to save MSG_MORE flag before a msg is sent.
>> Once this msg is queued, asoc->force_delay is set back to 0, so that
>> it will not affect other places flushing out queue.
>
> That doesn't seem right nor make sense.
>
>> asoc->force_delay works as a 'local param' here as the msg sending is
>> under protection of sock lock.  It would make sctp's MSG_MORE work as
>> tcp's.
>
> It is much more important to get MSG_MORE working 'properly' for SCTP
> than for TCP. For TCP an application can always use a long send.
"long send" ?, you mean bigger data, or keeping sending?
I didn't get the difference between SCTP and TCP, they
are similar when sending data.

>
> ...
>> @@ -1982,6 +1982,7 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr 
>> *msg, size_t msg_len)
>>* breaks.
>>*/
>>   err = sctp_primitive_SEND(net, asoc, datamsg);
>> + asoc->force_delay = 0;
>>   /* Did the lower layer accept the chunk? */
>>   if (err) {
>>   sctp_datamsg_free(datamsg);
>
> I don't think this is right - or needed.
> You only get to the above if some test has decided to send data chunks.
> So it just means that the NEXT time someone tries to send data all the
> queued data gets sent.
the NEXT time someone tries to send data with "MSG_MORE clear",
yes, but with "MSG_MORE set", it will still delay.

> I'm guessing that the whole thing gets called in a loop (definitely needed
> for very long data chunks, or after the window is opened).
yes, if users keep sending data chunks with MSG_MORE set, no
data with "MSG_MORE clear" gap.

> Now if an application sends a lot of (say) 100 byte chunks with MSG_MORE
> set it would expect to see a lot of full ethernet frames be sent.
right.
> With the above a frame will be sent (containing all but 1 chunk) when the
> amount of queued data becomes too large for an ethernet frame, and immediately
> followed by a second ethernet frame with 1 chunk in it.
"followed by a second ethernet frame with 1 chunk in it.", I think this's
what you're really worried about, right ?
But sctp flush data queue NOT like what you think, it's not keep traversing
the queue untill the queue is empty.
once a packet with chunks in one ethernet frame is sent, sctp_outq_flush
will return. it will pack chunks and send the next packet again untill some
other 'event' triggers it, like retransmission or data received from peer.
I don't think this is a problem.

>
> Now it might be that the flag needs clearing when retransmissions are queued.
> OTOH they might get sent for other reasons.
Before we really overthought about MSG_MORE, no need to care about
retransmissions, define MSG_MORE, in my opinion, it works more for
*inflight is 0*, if it's not 0, we shouldn't stop other places flushing them.

We cannot let asoc's more_more flag work as global, it will block elsewhere
sending data chunks, not only sctp_sendmsg.

Thanks

>
> David
>
>


Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Eric Dumazet
On Fri, 2017-03-03 at 15:11 +0100, Dmitry Vyukov wrote:
> On Mon, Feb 13, 2017 at 11:29 PM, Cong Wang  wrote:
> > On Mon, Feb 13, 2017 at 11:19 AM, Andrey Konovalov
> >  wrote:
> >> Hi,
> >>
> >> I've got the following error report while fuzzing the kernel with 
> >> syzkaller.
> >>
> >> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
> >>
> >> A reproducer and .config are attached.
> >> Note, that it takes quite some time to trigger the bug (up to 10 minutes).
> >>
> >> BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
> >> net/dccp/feat.c:1541 at addr 88003713be68
> >> Read of size 8 by task syz-executor2/8457
> >> CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> >> 01/01/2011
> >> Call Trace:
> >>  
> >>  __dump_stack lib/dump_stack.c:15 [inline]
> >>  dump_stack+0x292/0x398 lib/dump_stack.c:51
> >>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
> >>  print_address_description mm/kasan/report.c:200 [inline]
> >>  kasan_report_error mm/kasan/report.c:289 [inline]
> >>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
> >>  kasan_report mm/kasan/report.c:332 [inline]
> >>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
> >>  dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
> >>  dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
> >>  dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
> >>  dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
> >>  dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
> >>  ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
> >>  NF_HOOK include/linux/netfilter.h:257 [inline]
> >>  ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
> >>  dst_input include/net/dst.h:507 [inline]
> >>  ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
> >>  NF_HOOK include/linux/netfilter.h:257 [inline]
> >>  ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
> >>  __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
> >>  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
> >>  process_backlog+0xe5/0x6c0 net/core/dev.c:4839
> >>  napi_poll net/core/dev.c:5202 [inline]
> >>  net_rx_action+0xe70/0x1900 net/core/dev.c:5267
> >>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
> >>  do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
> >
> >
> > Seems there is a race condition between iterating dccp_feat_entry
> > and freeing it, bh_lock_sock() seems not held in this path.
> 
> 
> 
> Cong, where exactly do we need to add bh_lock_sock()?
> 
> I am still seeing this on 4977ab6e92e267afe9d8f78438c3db330ca8434c


I would try :

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 
409d0cfd34474812c3bf74f26cd423a3d65ee441..5a8b5ac5edaaf35428ab04cc810d98310bd169ed
 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -482,7 +482,9 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
if (dst == NULL)
goto out;
 
+   bh_lock_sock(sk);
skb = dccp_make_response(sk, dst, req);
+   bh_unlock_sock(sk);
if (skb != NULL) {
const struct inet_request_sock *ireq = inet_rsk(req);
struct dccp_hdr *dh = dccp_hdr(skb);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 
233b57367758c64c09ed40f7359cb8fcb1918d93..e89cc88d14c22d411a91afab093e209fcbb816d8
 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -214,7 +214,9 @@ static int dccp_v6_send_response(const struct sock *sk, 
struct request_sock *req
goto done;
}
 
+   bh_lock_sock(sk);
skb = dccp_make_response(sk, dst, req);
+   bh_unlock_sock(sk);
if (skb != NULL) {
struct dccp_hdr *dh = dccp_hdr(skb);
struct ipv6_txoptions *opt;




Re: Dell Inspiron 5558/0VNM2T and suspend/resume problem with r8169

2017-03-03 Thread Diego Viola
On Fri, Mar 3, 2017 at 12:37 PM, Diego Viola  wrote:
> On Wed, Mar 1, 2017 at 12:47 PM, Diego Viola  wrote:
>> On Wed, Mar 1, 2017 at 12:44 PM, Diego Viola  wrote:
>>> My machine (a Dell Inspiron 5558 laptop) fails to resume from suspend
>>> unless I rmmod r8169 first.
>>>
>>> Another workaround is to do this before suspend:
>>>
>>> echo 0 > /sys/power/pm_async
>>>
>>> I've been reproducing the freeze like this:
>>>
>>> $ i3lock && systemctl suspend
>>>
>>> I would have to repeat this at least 5 times for the freeze to occur,
>>> but it seems to be easily reproducible.
>>>
>>> If I don't invoke i3lock, I cannot get the freeze to happen, but it
>>> seems to happen with other lockers also.
>>>
>>> I have tried Alt+SysRq+r and tried to switch to another TTY but the
>>> machine is always unresponsive, which indicates that it's a kernel
>>> panic.
>>>
>>> I have had a similar issue to this about a year ago with the jme
>>> driver and this was the fix:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/jme.c?id=ee50c130c82175eaa0820c96b6d3763928af2241
>>>
>>> I haven't tried getting a kernel trace yet, but all seems to indicate
>>> the problem is caused by r8169, at least til now.
>>>
>>> Any ideas, please?
>>>
>>> Thanks,
>>> Diego
>>
>> Sorry, I forgot to mention, I'm on Arch Linux (x86_64), kernel 4.9.11-1-ARCH.
>>
>> Diego
>
> This is still a problem with Linux 4.10.1.
>
> Diego

I got this trace in the journal while suspending, not sure if it's
related to this problem:

Mar 03 12:05:05 myhost kernel: PM: Preparing system for sleep (mem)
Mar 03 12:05:05 myhost kernel: Freezing user space processes ...
Mar 03 12:05:05 myhost kernel: usb 2-6: new full-speed USB device
number 34 using xhci_hcd
Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28
UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 Sense Key :
0x6 [current]
Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 ASC=0x28 ASCQ=0x0
Mar 03 12:05:05 myhost kernel: sr 1:0:0:0: [sr0] tag#28 CDB:
opcode=0x28 28 00 00 00 00 04 00 00 02 00
Mar 03 12:05:05 myhost kernel: blk_update_request: I/O error, dev sr0, sector 16
Mar 03 12:05:05 myhost kernel: (elapsed 1.452 seconds) done.
Mar 03 12:05:05 myhost kernel: Freezing remaining freezable tasks ...
Mar 03 12:05:05 myhost kernel: [ cut here ]
Mar 03 12:05:05 myhost kernel: WARNING: CPU: 3 PID: 2134 at
drivers/base/firmware_class.c:1200 _request_firmware+0x2da/0xa70
Mar 03 12:05:05 myhost kernel: Modules linked in: fuse nls_iso8859_1
nls_cp437 vfat fat snd_hda_codec_hdmi rtsx_usb_ms memstick
rtsx_usb_sdmmc dell_led snd_hda_codec_realtek snd_hda_codec_generic
dell_laptop intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_int
Mar 03 12:05:05 myhost kernel:  bluetooth cfg80211 aesni_intel
aes_x86_64 crypto_simd glue_helper cryptd evdev intel_cstate
input_leds intel_rapl_perf mac_hid i915 pcspkr snd snd_soc_sst_acpi
thermal snd_soc_sst_match wmi battery ac97_bus soundcore fjes
drm_kms_helper vide
Mar 03 12:05:05 myhost kernel: CPU: 3 PID: 2134 Comm: kworker/3:5 Not
tainted 4.10.1-ARCH #1
Mar 03 12:05:05 myhost kernel: Hardware name: Dell Inc. Inspiron
5558/0VNM2T, BIOS A14 11/22/2016
Mar 03 12:05:05 myhost kernel: Workqueue: usb_hub_wq hub_event [usbcore]
Mar 03 12:05:05 myhost kernel: Call Trace:
Mar 03 12:05:05 myhost kernel:  dump_stack+0x63/0x83
Mar 03 12:05:05 myhost kernel:  __warn+0xcb/0xf0
Mar 03 12:05:05 myhost kernel:  warn_slowpath_null+0x1d/0x20
Mar 03 12:05:05 myhost kernel:  _request_firmware+0x2da/0xa70
Mar 03 12:05:05 myhost kernel:  request_firmware+0x37/0x50
Mar 03 12:05:05 myhost kernel:  ath3k_load_patch+0xc3/0x1e0 [ath3k]
Mar 03 12:05:05 myhost kernel:  ath3k_probe+0x76/0x4ac [ath3k]
Mar 03 12:05:05 myhost kernel:  ? __pm_runtime_set_status+0x1c0/0x2a0
Mar 03 12:05:05 myhost kernel:  usb_probe_interface+0x159/0x2d0 [usbcore]
Mar 03 12:05:05 myhost kernel:  driver_probe_device+0x2bb/0x460
Mar 03 12:05:05 myhost kernel:  __device_attach_driver+0x8c/0x100
Mar 03 12:05:05 myhost kernel:  ? __driver_attach+0xf0/0xf0
Mar 03 12:05:05 myhost kernel:  bus_for_each_drv+0x67/0xb0
Mar 03 12:05:05 myhost kernel:  __device_attach+0xdd/0x160
Mar 03 12:05:05 myhost kernel:  device_initial_probe+0x13/0x20
Mar 03 12:05:05 myhost kernel:  bus_probe_device+0x92/0xa0
Mar 03 12:05:05 myhost kernel:  device_add+0x393/0x670
Mar 03 12:05:05 myhost kernel:  usb_set_configuration+0x5f9/0x910 [usbcore]
Mar 03 12:05:05 myhost kernel:  generic_probe+0x2e/0x80 [usbcore]
Mar 03 12:05:05 myhost kernel:  usb_probe_device+0x2e/0x70 [usbcore]
Mar 03 12:05:05 myhost kernel:  driver_probe_device+0x2bb/0x460
Mar 03 12:05:05 myhost kernel:  __device_attach_driver+0x8c/0x100
Mar 03 12:05:05 myhost kernel:  ? __driver_attach+0xf0/0xf0
Mar 03 12:05:05 myhost kernel:  bus_for_each_drv+0x67/0xb0
Mar 03 12:05:05 myhost 

Re: usb/net/hso: WARNING: ungligned urb->setup_dma

2017-03-03 Thread Stefan Wahren
Hi Baruch,

Am 01.03.2017 um 11:54 schrieb Baruch Siach:
> Hi Stefan,
>
> On Tue, Feb 28, 2017 at 07:32:09PM +0100, Stefan Wahren wrote: 
>>> Baruch Siach  hat am 28. Februar 2017 um 19:07 
>>> geschrieben:
>>> On Tue, Feb 28, 2017 at 05:21:18PM +0100, Stefan Wahren wrote:
 Am 28.02.2017 um 13:01 schrieb Baruch Siach:
> On Tue, Feb 28, 2017 at 10:28:10AM +0200, Baruch Siach wrote:
>> I'm hitting this warning consistently on my Raspberry Pi 3 running 
>> kernel
>> v4.10.1 with some unrelated device tree changes, and a debug print 
>> (below).
>> The device identifies as "GlobeTrotter HSDPA Modem", VID: 0af0, PID: 
>> 6971.
>> The warning triggers consistently on first write access to /dev/ttyHS0 
>> that
>> ModemManager attempts. The first line in the log is my debug print.
> I tested the same hardware successfully on an i.MX6 CuBox-i (ARM32) using 
> the
> same kernel version (4.10.1), and on an x86_64 PC (4.9). So this seems to 
> be
> platform specific. I don't have any other ARM64 machine at the moment, 
> though.
 those platforms usually doesn't use the dwc2 USB host controller. So it
 should be tested with another dwc2 platform.
>>> The code that initializes setup_dma is not under drivers/usb/dwc2/. Though 
>>> the 
>>> problem looks like memory corruption, so its cause might be anywhere.
>> only a suspicion, but could you please try this patch [1]?
>>
>> [1] - https://patchwork.kernel.org/patch/9166771/
> It doesn't change anything.
>
> My guess is that source of the issue is memory corruption that just happens 
> to 
> corrupt also the setup_dma field of struct urb. In other words, it has 
> nothing 
> to do with DMA, IMO.

may you could use

CONFIG_DMA_API_DEBUG

or

CONFIG_SLUB_DEBUG

in order to find the source?

>
> Thanks,
> baruch
>




Re: Dell Inspiron 5558/0VNM2T and suspend/resume problem with r8169

2017-03-03 Thread Diego Viola
On Wed, Mar 1, 2017 at 12:47 PM, Diego Viola  wrote:
> On Wed, Mar 1, 2017 at 12:44 PM, Diego Viola  wrote:
>> My machine (a Dell Inspiron 5558 laptop) fails to resume from suspend
>> unless I rmmod r8169 first.
>>
>> Another workaround is to do this before suspend:
>>
>> echo 0 > /sys/power/pm_async
>>
>> I've been reproducing the freeze like this:
>>
>> $ i3lock && systemctl suspend
>>
>> I would have to repeat this at least 5 times for the freeze to occur,
>> but it seems to be easily reproducible.
>>
>> If I don't invoke i3lock, I cannot get the freeze to happen, but it
>> seems to happen with other lockers also.
>>
>> I have tried Alt+SysRq+r and tried to switch to another TTY but the
>> machine is always unresponsive, which indicates that it's a kernel
>> panic.
>>
>> I have had a similar issue to this about a year ago with the jme
>> driver and this was the fix:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/jme.c?id=ee50c130c82175eaa0820c96b6d3763928af2241
>>
>> I haven't tried getting a kernel trace yet, but all seems to indicate
>> the problem is caused by r8169, at least til now.
>>
>> Any ideas, please?
>>
>> Thanks,
>> Diego
>
> Sorry, I forgot to mention, I'm on Arch Linux (x86_64), kernel 4.9.11-1-ARCH.
>
> Diego

This is still a problem with Linux 4.10.1.

Diego


Re: [PATCH 25/26] isdn: eicon: mark divascapi incompatible with kasan

2017-03-03 Thread Arnd Bergmann
On Fri, Mar 3, 2017 at 4:22 PM, Andrey Ryabinin  wrote:
> On 03/03/2017 05:54 PM, Arnd Bergmann wrote:
>> On Fri, Mar 3, 2017 at 3:20 PM, Andrey Ryabinin  
>> wrote:
>>> On 03/02/2017 07:38 PM, Arnd Bergmann wrote:
>>>
>>> This is kinda radical solution.
>>> Wouldn't be better to just increase -Wframe-larger-than for this driver 
>>> through Makefile?
>>
>> I thought about it too, and decided for disabling the driver entirely
>> since I suspected that
>> not only the per-function stack frame is overly large here but also
>> depth of the call chain,
>> which would then lead us to hiding an actual stack overflow.
>>
>
> No one complained so far ;)
> Disabling the driver like you did will throw it out from allmodconfig so it 
> will receive less compile-testing.

Good point, I'll add a driver specific flag then and leave it there.

>> Note that this driver is almost certainly broken, it hasn't seen any
>> updates other than
>> style and compile-warning fixes in 10 years and doesn't support any of
>> the hardware
>> introduced since 2002 (the company still makes PCIe ISDN adapters, but
>> the driver
>> only supports legacy PCI versions and older buses).
>
> Which means that it's unlikely that someone will run this driver with KASAN
and trigger stack overflow (if it's really possible).

True.

   Arnd


[PATCH net-next RFC 2/4] virtio-net: transmit napi

2017-03-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional for now. Follow-on
patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 73 
 1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8c21e9a4adc7..9a9031640179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,6 +33,8 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
+static int napi_tx_weight = NAPI_POLL_WEIGHT;
+
 static bool csum = true, gso = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
@@ -86,6 +88,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +266,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
 
/* Suppress further interrupts. */
virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -961,6 +969,9 @@ static void skb_recv_done(struct virtqueue *rvq)
 
 static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
+   if (!napi->weight)
+   return;
+
napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
@@ -1046,12 +1057,13 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_enable(vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
+static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
 {
struct sk_buff *skb;
unsigned int len;
@@ -1060,7 +1072,8 @@ static void free_old_xmit_skbs(struct send_queue *sq)
unsigned int packets = 0;
unsigned int bytes = 0;
 
-   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
+   while (packets < budget &&
+  (skb = virtqueue_get_buf(sq->vq, )) != NULL) {
pr_debug("Sent skb %p\n", skb);
 
bytes += skb->len;
@@ -1073,12 +1086,35 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 * happens when called speculatively from start_xmit.
 */
if (!packets)
-   return;
+   return 0;
 
u64_stats_update_begin(>tx_syncp);
stats->tx_bytes += bytes;
stats->tx_packets += packets;
u64_stats_update_end(>tx_syncp);
+
+   return packets;
+}
+
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+   bool complete = false;
+
+   __netif_tx_lock(txq, smp_processor_id());
+   if (free_old_xmit_skbs(sq, budget) < budget)
+   complete = true;
+   __netif_tx_unlock(txq);
+
+   if (complete)
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+
+   return complete ? 0 : budget;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
@@ -1130,9 +1166,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool 

Re: [Bug 194749] New: kernel bonding does not work in a network nameservice in versions above 3.10.0-229.20.1

2017-03-03 Thread Nicolas Dichtel
Le 02/03/2017 à 21:39, Dan Geist a écrit :
> - On Mar 2, 2017, at 3:11 PM, Cong Wang xiyou.wangc...@gmail.com wrote
> 
>> On Thu, Mar 2, 2017 at 10:32 AM, Stephen Hemminger
>>  wrote:
>>>
>>>
>>> Begin forwarded message:
>>>
>>> Date: Wed, 01 Mar 2017 21:08:01 +
>>> From: bugzilla-dae...@bugzilla.kernel.org
>>> To: step...@networkplumber.org
>>> Subject: [Bug 194749] New: kernel bonding does not work in a network 
>>> nameservice
>>> in versions above 3.10.0-229.20.1
>>>
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=194749
>>>
>>> Bug ID: 194749
>>>Summary: kernel bonding does not work in a network nameservice
>>> in versions above 3.10.0-229.20.1
>>>Product: Networking
>>>Version: 2.5
>>> Kernel Version: > 3.10.0-229.20.1
>>>   Hardware: x86-64
>>> OS: Linux
>>>   Tree: Mainline
>>> Status: NEW
>>>   Severity: blocking
>>>   Priority: P1
>>>  Component: Other
>>>   Assignee: step...@networkplumber.org
>>>   Reporter: d...@polter.net
>>> Regression: No
>>>
>>> bond interface is being used in active/standby mode with two physical NICs
>>> inside a network nameservice to provide switchpath redundancy.
>>>
>>> netns is instantiated post-boot with the following:
>>>
>>> ip netns add vntp
>>> ip link set p4p1 netns vntp
>>> ip link set p4p2 netns vntp
>>> ip link set bond0 netns vntp
>>> ip netns exec vntp ip link set lo up
>>> ip netns exec vntp ip link set p4p1 up
>>> ip netns exec vntp ip link set p4p2 up
>>> ip netns exec vntp ip link set bond0 up
>>> ip netns exec vntp ifenslave bond0 p4p1 p4p2
>>
>> This is due to the following commit:
>>
>> commit f9399814927ad9bb995a6e109c2a5f9d8a848209
>> Author: Weilong Chen 
>> Date:   Wed Jan 22 17:16:30 2014 +0800
>>
>>bonding: Don't allow bond devices to change network namespaces.
>>
>>Like bridge, bonding as netdevice doesn't cross netns boundaries.
>>
>>Bonding ports and bonding itself live in same netns.
>>
>>Signed-off-by: Weilong Chen 
>>Signed-off-by: David S. Miller 
>>
>>
>> NETIF_F_NETNS_LOCAL was introduced for loopback device which
>> is created for each netns, it is not clear why we need to add it to bond
>> and bridge...
> 
> Thank you for tracking this down. Without digging through the code to figure 
> it out, does this imply that the existence of a bond interface is not 
> possible AT ALL within a netns or simply that it may not be "migrated" 
> between the global scope and a netns?
It means that the migration is not possible. I think the only reason to have
this flag on bonding and bridge is the lack of test and fix. There is probably
some work to be done to have this feature. But are there real use cases of
x-netns bonding or x-netns bridge?


Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Eric Dumazet
On Fri, Mar 3, 2017 at 7:12 AM, Dmitry Vyukov  wrote:
> The first bot that picked this up started spewing:
>
> BUG: spinlock recursion on CPU#1, syz-executor2/9452

Yes. The bug is not about locking the listener, but protecting fields
of struct dccp_request_sock

I will provide a patch, once I reach the office and after the breakfast ;)


[PATCH net-next RFC 2/4] virtio-net: transmit napi

2017-03-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional for now. Follow-on
patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 73 
 1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8c21e9a4adc7..9a9031640179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,6 +33,8 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
+static int napi_tx_weight = NAPI_POLL_WEIGHT;
+
 static bool csum = true, gso = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
@@ -86,6 +88,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +266,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
 
/* Suppress further interrupts. */
virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -961,6 +969,9 @@ static void skb_recv_done(struct virtqueue *rvq)
 
 static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
+   if (!napi->weight)
+   return;
+
napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
@@ -1046,12 +1057,13 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_enable(vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
+static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
 {
struct sk_buff *skb;
unsigned int len;
@@ -1060,7 +1072,8 @@ static void free_old_xmit_skbs(struct send_queue *sq)
unsigned int packets = 0;
unsigned int bytes = 0;
 
-   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
+   while (packets < budget &&
+  (skb = virtqueue_get_buf(sq->vq, )) != NULL) {
pr_debug("Sent skb %p\n", skb);
 
bytes += skb->len;
@@ -1073,12 +1086,35 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 * happens when called speculatively from start_xmit.
 */
if (!packets)
-   return;
+   return 0;
 
u64_stats_update_begin(>tx_syncp);
stats->tx_bytes += bytes;
stats->tx_packets += packets;
u64_stats_update_end(>tx_syncp);
+
+   return packets;
+}
+
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+   bool complete = false;
+
+   __netif_tx_lock(txq, smp_processor_id());
+   if (free_old_xmit_skbs(sq, budget) < budget)
+   complete = true;
+   __netif_tx_unlock(txq);
+
+   if (complete)
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+
+   return complete ? 0 : budget;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
@@ -1130,9 +1166,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool 

[PATCH net 1/2] sfc: avoid max() in array size

2017-03-03 Thread Edward Cree
It confuses sparse, which thinks the size isn't constant.  Let's achieve
 the same thing with a BUILD_BUG_ON, since we know which one should be
 bigger and don't expect them ever to change.

Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/ef10.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 92e1c6d..4d88e85 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -828,9 +828,7 @@ static int efx_ef10_alloc_piobufs(struct efx_nic *efx, 
unsigned int n)
 static int efx_ef10_link_piobufs(struct efx_nic *efx)
 {
struct efx_ef10_nic_data *nic_data = efx->nic_data;
-   _MCDI_DECLARE_BUF(inbuf,
- max(MC_CMD_LINK_PIOBUF_IN_LEN,
- MC_CMD_UNLINK_PIOBUF_IN_LEN));
+   MCDI_DECLARE_BUF(inbuf, MC_CMD_LINK_PIOBUF_IN_LEN);
struct efx_channel *channel;
struct efx_tx_queue *tx_queue;
unsigned int offset, index;
@@ -839,8 +837,6 @@ static int efx_ef10_link_piobufs(struct efx_nic *efx)
BUILD_BUG_ON(MC_CMD_LINK_PIOBUF_OUT_LEN != 0);
BUILD_BUG_ON(MC_CMD_UNLINK_PIOBUF_OUT_LEN != 0);
 
-   memset(inbuf, 0, sizeof(inbuf));
-
/* Link a buffer to each VI in the write-combining mapping */
for (index = 0; index < nic_data->n_piobufs; ++index) {
MCDI_SET_DWORD(inbuf, LINK_PIOBUF_IN_PIOBUF_HANDLE,
@@ -920,6 +916,10 @@ static int efx_ef10_link_piobufs(struct efx_nic *efx)
return 0;
 
 fail:
+   /* inbuf was defined for MC_CMD_LINK_PIOBUF.  We can use the same
+* buffer for MC_CMD_UNLINK_PIOBUF because it's shorter.
+*/
+   BUILD_BUG_ON(MC_CMD_LINK_PIOBUF_IN_LEN < MC_CMD_UNLINK_PIOBUF_IN_LEN);
while (index--) {
MCDI_SET_DWORD(inbuf, UNLINK_PIOBUF_IN_TXQ_INSTANCE,
   nic_data->pio_write_vi_base + index);



[PATCH net 2/2] sfc: fix IPID endianness in TSOv2

2017-03-03 Thread Edward Cree
The value we read from the header is in network byte order, whereas
 EFX_POPULATE_QWORD_* takes values in host byte order (which it then
 converts to little-endian, as MCDI is little-endian).

Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2")
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/ef10.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 4d88e85..c60c2d4 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -2183,7 +2183,7 @@ static int efx_ef10_tx_tso_desc(struct efx_tx_queue 
*tx_queue,
/* Modify IPv4 header if needed. */
ip->tot_len = 0;
ip->check = 0;
-   ipv4_id = ip->id;
+   ipv4_id = ntohs(ip->id);
} else {
/* Modify IPv6 header if needed. */
struct ipv6hdr *ipv6 = ipv6_hdr(skb);


Re: [PATCH 25/26] isdn: eicon: mark divascapi incompatible with kasan

2017-03-03 Thread Andrey Ryabinin


On 03/03/2017 05:54 PM, Arnd Bergmann wrote:
> On Fri, Mar 3, 2017 at 3:20 PM, Andrey Ryabinin  
> wrote:
>>
>>
>> On 03/02/2017 07:38 PM, Arnd Bergmann wrote:
>>> When CONFIG_KASAN is enabled, we have several functions that use rather
>>> large kernel stacks, e.g.
>>>
>>> drivers/isdn/hardware/eicon/message.c: In function 'group_optimization':
>>> drivers/isdn/hardware/eicon/message.c:14841:1: warning: the frame size of 
>>> 864 bytes is larger than 500 bytes [-Wframe-larger-than=]
>>> drivers/isdn/hardware/eicon/message.c: In function 'add_b1':
>>> drivers/isdn/hardware/eicon/message.c:7925:1: warning: the frame size of 
>>> 1008 bytes is larger than 500 bytes [-Wframe-larger-than=]
>>> drivers/isdn/hardware/eicon/message.c: In function 'add_b23':
>>> drivers/isdn/hardware/eicon/message.c:8551:1: warning: the frame size of 
>>> 928 bytes is larger than 500 bytes [-Wframe-larger-than=]
>>> drivers/isdn/hardware/eicon/message.c: In function 'sig_ind':
>>> drivers/isdn/hardware/eicon/message.c:6113:1: warning: the frame size of 
>>> 2112 bytes is larger than 500 bytes [-Wframe-larger-than=]
>>>
>>> To be on the safe side, and to enable a lower frame size warning limit, 
>>> let's
>>> just mark this driver as broken when KASAN is in use. I have tried to reduce
>>> the stack size as I did with dozens of other drivers, but failed to come up
>>> with a good solution for this one.
>>>
>>
>> This is kinda radical solution.
>> Wouldn't be better to just increase -Wframe-larger-than for this driver 
>> through Makefile?
> 
> I thought about it too, and decided for disabling the driver entirely
> since I suspected that
> not only the per-function stack frame is overly large here but also
> depth of the call chain,
> which would then lead us to hiding an actual stack overflow.
> 

No one complained so far ;)
Disabling the driver like you did will throw it out from allmodconfig so it 
will receive less compile-testing.


> Note that this driver is almost certainly broken, it hasn't seen any
> updates other than
> style and compile-warning fixes in 10 years and doesn't support any of
> the hardware
> introduced since 2002 (the company still makes PCIe ISDN adapters, but
> the driver
> only supports legacy PCI versions and older buses).

Which means that it's unlikely that someone will run this driver with KASAN and 
trigger stack overflow (if it's really possible).


[PATCH net 0/2] sfc: couple of fixes

2017-03-03 Thread Edward Cree
First patch addresses a construct that causes sparse to error out.
With that fixed, sparse makes some warnings on ef10.c, second patch
 fixes one of them.

Edward Cree (2):
  sfc: avoid max() in array size
  sfc: fix IPID endianness in TSOv2

 drivers/net/ethernet/sfc/ef10.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)



Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Eric Dumazet
On Fri, 2017-03-03 at 16:06 +0100, Dmitry Vyukov wrote:

> Something that compiles is definitely better :)
> Reapplied.

Just to be clear : This is not the proper patch. This only reduces the
race.

bh_lock_sock() does not prevent a user process from owning the socket.

We need another protection, probably RCU based, or another spinlock
protecting the fields needed at SYNACK generation.





Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Eric Dumazet
On Fri, 2017-03-03 at 06:32 -0800, Eric Dumazet wrote:
> On Fri, 2017-03-03 at 15:11 +0100, Dmitry Vyukov wrote:
> > On Mon, Feb 13, 2017 at 11:29 PM, Cong Wang  
> > wrote:
> > > On Mon, Feb 13, 2017 at 11:19 AM, Andrey Konovalov
> > >  wrote:
> > >> Hi,
> > >>
> > >> I've got the following error report while fuzzing the kernel with 
> > >> syzkaller.
> > >>
> > >> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
> > >>
> > >> A reproducer and .config are attached.
> > >> Note, that it takes quite some time to trigger the bug (up to 10 
> > >> minutes).
> > >>
> > >> BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
> > >> net/dccp/feat.c:1541 at addr 88003713be68
> > >> Read of size 8 by task syz-executor2/8457
> > >> CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> > >> 01/01/2011
> > >> Call Trace:
> > >>  
> > >>  __dump_stack lib/dump_stack.c:15 [inline]
> > >>  dump_stack+0x292/0x398 lib/dump_stack.c:51
> > >>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
> > >>  print_address_description mm/kasan/report.c:200 [inline]
> > >>  kasan_report_error mm/kasan/report.c:289 [inline]
> > >>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
> > >>  kasan_report mm/kasan/report.c:332 [inline]
> > >>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
> > >>  dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
> > >>  dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
> > >>  dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
> > >>  dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
> > >>  dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
> > >>  ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
> > >>  ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
> > >>  dst_input include/net/dst.h:507 [inline]
> > >>  ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
> > >>  ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
> > >>  __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
> > >>  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
> > >>  process_backlog+0xe5/0x6c0 net/core/dev.c:4839
> > >>  napi_poll net/core/dev.c:5202 [inline]
> > >>  net_rx_action+0xe70/0x1900 net/core/dev.c:5267
> > >>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
> > >>  do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
> > >
> > >
> > > Seems there is a race condition between iterating dccp_feat_entry
> > > and freeing it, bh_lock_sock() seems not held in this path.
> > 
> > 
> > 
> > Cong, where exactly do we need to add bh_lock_sock()?
> > 
> > I am still seeing this on 4977ab6e92e267afe9d8f78438c3db330ca8434c
> 
> 
> I would try :

Or something that would compile. I will take a deeper look after my
commute.

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 
409d0cfd34474812c3bf74f26cd423a3d65ee441..56f883b301ccd610fc24efeac4fb47d3c2f95ecf
 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -482,7 +482,11 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
if (dst == NULL)
goto out;
 
+   /* DCCP is not ready yet for lockless SYN processing */
+   bh_lock_sock((struct sock *)sk);
skb = dccp_make_response(sk, dst, req);
+   bh_unlock_sock((struct sock *)sk);
+
if (skb != NULL) {
const struct inet_request_sock *ireq = inet_rsk(req);
struct dccp_hdr *dh = dccp_hdr(skb);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 
233b57367758c64c09ed40f7359cb8fcb1918d93..673f45f85b7c755c8165c6274ffb6b1fe5660683
 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -214,7 +214,11 @@ static int dccp_v6_send_response(const struct sock *sk, 
struct request_sock *req
goto done;
}
 
+   /* DCCP is not ready yet for lockless SYN processing */
+   bh_lock_sock((struct sock *)sk);
skb = dccp_make_response(sk, dst, req);
+   bh_unlock_sock((struct sock *)sk);
+
if (skb != NULL) {
struct dccp_hdr *dh = dccp_hdr(skb);
struct ipv6_txoptions *opt;




Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 3:32 PM, Eric Dumazet  wrote:
>> > On Mon, Feb 13, 2017 at 11:19 AM, Andrey Konovalov
>> >  wrote:
>> >> Hi,
>> >>
>> >> I've got the following error report while fuzzing the kernel with 
>> >> syzkaller.
>> >>
>> >> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
>> >>
>> >> A reproducer and .config are attached.
>> >> Note, that it takes quite some time to trigger the bug (up to 10 minutes).
>> >>
>> >> BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
>> >> net/dccp/feat.c:1541 at addr 88003713be68
>> >> Read of size 8 by task syz-executor2/8457
>> >> CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
>> >> 01/01/2011
>> >> Call Trace:
>> >>  
>> >>  __dump_stack lib/dump_stack.c:15 [inline]
>> >>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>> >>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>> >>  print_address_description mm/kasan/report.c:200 [inline]
>> >>  kasan_report_error mm/kasan/report.c:289 [inline]
>> >>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>> >>  kasan_report mm/kasan/report.c:332 [inline]
>> >>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
>> >>  dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
>> >>  dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
>> >>  dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
>> >>  dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
>> >>  dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
>> >>  ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
>> >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>> >>  ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
>> >>  dst_input include/net/dst.h:507 [inline]
>> >>  ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
>> >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>> >>  ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
>> >>  __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
>> >>  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
>> >>  process_backlog+0xe5/0x6c0 net/core/dev.c:4839
>> >>  napi_poll net/core/dev.c:5202 [inline]
>> >>  net_rx_action+0xe70/0x1900 net/core/dev.c:5267
>> >>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
>> >>  do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
>> >
>> >
>> > Seems there is a race condition between iterating dccp_feat_entry
>> > and freeing it, bh_lock_sock() seems not held in this path.
>>
>>
>>
>> Cong, where exactly do we need to add bh_lock_sock()?
>>
>> I am still seeing this on 4977ab6e92e267afe9d8f78438c3db330ca8434c
>
>
> I would try :
>
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 
> 409d0cfd34474812c3bf74f26cd423a3d65ee441..5a8b5ac5edaaf35428ab04cc810d98310bd169ed
>  100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -482,7 +482,9 @@ static int dccp_v4_send_response(const struct sock *sk, 
> struct request_sock *req
> if (dst == NULL)
> goto out;
>
> +   bh_lock_sock(sk);
> skb = dccp_make_response(sk, dst, req);
> +   bh_unlock_sock(sk);
> if (skb != NULL) {
> const struct inet_request_sock *ireq = inet_rsk(req);
> struct dccp_hdr *dh = dccp_hdr(skb);
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 
> 233b57367758c64c09ed40f7359cb8fcb1918d93..e89cc88d14c22d411a91afab093e209fcbb816d8
>  100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -214,7 +214,9 @@ static int dccp_v6_send_response(const struct sock *sk, 
> struct request_sock *req
> goto done;
> }
>
> +   bh_lock_sock(sk);
> skb = dccp_make_response(sk, dst, req);
> +   bh_unlock_sock(sk);
> if (skb != NULL) {
> struct dccp_hdr *dh = dccp_hdr(skb);
> struct ipv6_txoptions *opt;


Applied on bots. Thanks!


Re: net/dccp: use-after-free in dccp_feat_activate_values

2017-03-03 Thread Dmitry Vyukov
On Fri, Mar 3, 2017 at 4:06 PM, Dmitry Vyukov  wrote:
> On Fri, Mar 3, 2017 at 3:48 PM, Eric Dumazet  wrote:
>> On Fri, 2017-03-03 at 06:32 -0800, Eric Dumazet wrote:
>>> On Fri, 2017-03-03 at 15:11 +0100, Dmitry Vyukov wrote:
>>> > On Mon, Feb 13, 2017 at 11:29 PM, Cong Wang  
>>> > wrote:
>>> > > On Mon, Feb 13, 2017 at 11:19 AM, Andrey Konovalov
>>> > >  wrote:
>>> > >> Hi,
>>> > >>
>>> > >> I've got the following error report while fuzzing the kernel with 
>>> > >> syzkaller.
>>> > >>
>>> > >> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
>>> > >>
>>> > >> A reproducer and .config are attached.
>>> > >> Note, that it takes quite some time to trigger the bug (up to 10 
>>> > >> minutes).
>>> > >>
>>> > >> BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
>>> > >> net/dccp/feat.c:1541 at addr 88003713be68
>>> > >> Read of size 8 by task syz-executor2/8457
>>> > >> CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
>>> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
>>> > >> 01/01/2011
>>> > >> Call Trace:
>>> > >>  
>>> > >>  __dump_stack lib/dump_stack.c:15 [inline]
>>> > >>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>>> > >>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>> > >>  print_address_description mm/kasan/report.c:200 [inline]
>>> > >>  kasan_report_error mm/kasan/report.c:289 [inline]
>>> > >>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>> > >>  kasan_report mm/kasan/report.c:332 [inline]
>>> > >>  __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
>>> > >>  dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
>>> > >>  dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
>>> > >>  dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
>>> > >>  dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
>>> > >>  dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
>>> > >>  ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
>>> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>>> > >>  ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
>>> > >>  dst_input include/net/dst.h:507 [inline]
>>> > >>  ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
>>> > >>  NF_HOOK include/linux/netfilter.h:257 [inline]
>>> > >>  ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
>>> > >>  __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
>>> > >>  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
>>> > >>  process_backlog+0xe5/0x6c0 net/core/dev.c:4839
>>> > >>  napi_poll net/core/dev.c:5202 [inline]
>>> > >>  net_rx_action+0xe70/0x1900 net/core/dev.c:5267
>>> > >>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
>>> > >>  do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
>>> > >
>>> > >
>>> > > Seems there is a race condition between iterating dccp_feat_entry
>>> > > and freeing it, bh_lock_sock() seems not held in this path.
>>> >
>>> >
>>> >
>>> > Cong, where exactly do we need to add bh_lock_sock()?
>>> >
>>> > I am still seeing this on 4977ab6e92e267afe9d8f78438c3db330ca8434c
>>>
>>>
>>> I would try :
>>
>> Or something that would compile. I will take a deeper look after my
>> commute.
>
>
> Something that compiles is definitely better :)
> Reapplied.
>
>
>> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
>> index 
>> 409d0cfd34474812c3bf74f26cd423a3d65ee441..56f883b301ccd610fc24efeac4fb47d3c2f95ecf
>>  100644
>> --- a/net/dccp/ipv4.c
>> +++ b/net/dccp/ipv4.c
>> @@ -482,7 +482,11 @@ static int dccp_v4_send_response(const struct sock *sk, 
>> struct request_sock *req
>> if (dst == NULL)
>> goto out;
>>
>> +   /* DCCP is not ready yet for lockless SYN processing */
>> +   bh_lock_sock((struct sock *)sk);
>> skb = dccp_make_response(sk, dst, req);
>> +   bh_unlock_sock((struct sock *)sk);
>> +
>> if (skb != NULL) {
>> const struct inet_request_sock *ireq = inet_rsk(req);
>> struct dccp_hdr *dh = dccp_hdr(skb);
>> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
>> index 
>> 233b57367758c64c09ed40f7359cb8fcb1918d93..673f45f85b7c755c8165c6274ffb6b1fe5660683
>>  100644
>> --- a/net/dccp/ipv6.c
>> +++ b/net/dccp/ipv6.c
>> @@ -214,7 +214,11 @@ static int dccp_v6_send_response(const struct sock *sk, 
>> struct request_sock *req
>> goto done;
>> }
>>
>> +   /* DCCP is not ready yet for lockless SYN processing */
>> +   bh_lock_sock((struct sock *)sk);
>> skb = dccp_make_response(sk, dst, req);
>> +   bh_unlock_sock((struct sock *)sk);
>> +
>> if (skb != NULL) {
>> struct dccp_hdr *dh = dccp_hdr(skb);
>> struct ipv6_txoptions *opt;
>>
>>


The first bot that picked this up started spewing:

BUG: spinlock recursion on CPU#1, syz-executor2/9452
 lock: 0x8801cd09abc8, .magic: dead4ead, .owner:
syz-executor2/9452, .owner_cpu: 1
CPU: 1 PID: 9452 

Re: [PATCH 01/26] compiler: introduce noinline_for_kasan annotation

2017-03-03 Thread Arnd Bergmann
On Fri, Mar 3, 2017 at 3:33 PM, Alexander Potapenko  wrote:
> On Fri, Mar 3, 2017 at 3:30 PM, Arnd Bergmann  wrote:
>> On Fri, Mar 3, 2017 at 2:55 PM, Alexander Potapenko  
>> wrote:
>>
>> Would KMSAN also force local variables to be non-overlapping the way that
>> asan-stack=1 and -fsanitize-address-use-after-scope do? As I understood it,
>> KMSAN would add extra code for maintaining the uninit bits, but in an example
>> like this
> The thing is that KMSAN (and other tools that insert heavyweight
> instrumentation) may cause heavy register spilling which will also
> blow up the stack frames.

In that case, I would expect a mostly distinct set of functions to have large
stack frames with KMSAN, compared to the ones that need
noinline_for_kasan. In most cases I patched, the called inline function is
actually trivial, but invoked many times from the same caller.

 Arnd


net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-03-03 Thread Dmitry Vyukov
Hello,

I am getting heap out-of-bounds reports in
fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone while running
syzkaller fuzzer on 86292b33d4b79ee03e2f43ea0381ef85f077c760. They all
follow the same pattern: an object of size 216 is allocated from
ip_dst_cache slab, and then accessed at offset 272/276 withing
fib6_walk. Looks like type confusion. Unfortunately this is not
reproducible.

==
BUG: KASAN: slab-out-of-bounds in rt6_dump_route+0x293/0x2f0
net/ipv6/route.c:3547 at addr 88004b864514
Read of size 4 by task syz-executor7/25042
CPU: 0 PID: 25042 Comm: syz-executor7 Not tainted 4.10.0+ #234
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:166
 print_address_description mm/kasan/report.c:204 [inline]
 kasan_report_error mm/kasan/report.c:288 [inline]
 kasan_report.part.2+0x198/0x440 mm/kasan/report.c:310
 kasan_report mm/kasan/report.c:330 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:330
 rt6_dump_route+0x293/0x2f0 net/ipv6/route.c:3547
 fib6_dump_node+0x101/0x1a0 net/ipv6/ip6_fib.c:315
 fib6_walk_continue+0x4b3/0x620 net/ipv6/ip6_fib.c:1576
 fib6_walk+0x1cf/0x300 net/ipv6/ip6_fib.c:1621
 fib6_dump_table net/ipv6/ip6_fib.c:374 [inline]
 inet6_dump_fib+0x832/0xea0 net/ipv6/ip6_fib.c:447
 rtnl_dump_all+0x8a/0x2a0 net/core/rtnetlink.c:2776
 netlink_dump+0x54d/0xd40 net/netlink/af_netlink.c:2127
 __netlink_dump_start+0x4e5/0x760 net/netlink/af_netlink.c:2217
 netlink_dump_start include/linux/netlink.h:165 [inline]
 rtnetlink_rcv_msg+0x4a3/0x860 net/core/rtnetlink.c:4094
 netlink_rcv_skb+0x2ab/0x390 net/netlink/af_netlink.c:2298
 rtnetlink_rcv+0x2a/0x40 net/core/rtnetlink.c:4110
 netlink_unicast_kernel net/netlink/af_netlink.c:1231 [inline]
 netlink_unicast+0x514/0x730 net/netlink/af_netlink.c:1257
 netlink_sendmsg+0xa9f/0xe50 net/netlink/af_netlink.c:1803
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x326/0x600 net/socket.c:846
 new_sync_write fs/read_write.c:499 [inline]
 __vfs_write+0x483/0x740 fs/read_write.c:512
 vfs_write+0x187/0x530 fs/read_write.c:560
 SYSC_write fs/read_write.c:607 [inline]
 SyS_write+0xfb/0x230 fs/read_write.c:599
 entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458d9
RSP: 002b:7fe10102bb58 EFLAGS: 0292 ORIG_RAX: 0001
RAX: ffda RBX: 0006 RCX: 004458d9
RDX: 001f RSI: 20691000 RDI: 0006
RBP: 006e2fc0 R08:  R09: 
R10:  R11: 0292 R12: 00708000
R13: 209e1ff7 R14: 0001 R15: fffd
Object at 88004b864400, in cache ip_dst_cache size: 216
Allocated:
PID = 21976
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:605
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
 kmem_cache_alloc+0x102/0x680 mm/slab.c:3571
 dst_alloc+0x11b/0x1a0 net/core/dst.c:209
 rt_dst_alloc+0xf0/0x580 net/ipv4/route.c:1482
 __mkroute_output net/ipv4/route.c:2163 [inline]
 __ip_route_output_key_hash+0xce3/0x2c70 net/ipv4/route.c:2373
 __ip_route_output_key include/net/route.h:122 [inline]
 ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2459
 ip_route_output_key include/net/route.h:132 [inline]
 sctp_v4_get_dst+0x5d2/0x1570 net/sctp/protocol.c:454
 sctp_transport_route+0xa8/0x420 net/sctp/transport.c:292
 sctp_assoc_add_peer+0x5a5/0x1470 net/sctp/associola.c:653
 sctp_sendmsg+0x1800/0x3970 net/sctp/socket.c:1870
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
 __sys_sendmmsg+0x25c/0x750 net/socket.c:2075
 SYSC_sendmmsg net/socket.c:2106 [inline]
 SyS_sendmmsg+0x35/0x60 net/socket.c:2101
 entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed:
PID = 15058
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:578
 __cache_free mm/slab.c:3513 [inline]
 kmem_cache_free+0x71/0x240 mm/slab.c:3773
 dst_destroy+0x1fd/0x330 net/core/dst.c:269
 dst_free include/net/dst.h:428 [inline]
 rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:198 [inline]
 free_fib_info_rcu+0x399/0x590 net/ipv4/fib_semantics.c:213
 __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 rcu_do_batch.isra.67+0xa31/0xe50 kernel/rcu/tree.c:2877
 invoke_rcu_callbacks kernel/rcu/tree.c:3140 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:3107 [inline]
 rcu_process_callbacks+0x45b/0xc50 kernel/rcu/tree.c:3124
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
Memory state 

Re: [PATCH 26/26] kasan: rework Kconfig settings

2017-03-03 Thread Andrey Ryabinin


On 03/02/2017 07:38 PM, Arnd Bergmann wrote:

> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..27c838c40a36 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -216,10 +216,9 @@ config ENABLE_MUST_CHECK
>  config FRAME_WARN
>   int "Warn for stack frames larger than (needs gcc 4.4)"
>   range 0 8192
> - default 0 if KASAN
> - default 2048 if GCC_PLUGIN_LATENT_ENTROPY
> + default 3072 if KASAN_EXTRA
>   default 1024 if !64BIT
> - default 2048 if 64BIT
> + default 1280 if 64BIT

This looks unrelated. Also, it means that now we have 1280 with KASAN=y && 
KASAN_EXTRA=n.
Judging from changelog I assume that this hunk slipped here from the follow up 
series.

>   help
> Tell gcc to warn at build time for stack frames larger than this.
> Setting this too low will cause a lot of warnings.
> @@ -499,7 +498,7 @@ config DEBUG_OBJECTS_ENABLE_DEFAULT
>  
>  config DEBUG_SLAB
>   bool "Debug slab memory allocations"
> - depends on DEBUG_KERNEL && SLAB && !KMEMCHECK
> + depends on DEBUG_KERNEL && SLAB && !KMEMCHECK && !KASAN
>   help
> Say Y here to have the kernel do limited verification on memory
> allocation as well as poisoning memory on free to catch use of freed
> @@ -511,7 +510,7 @@ config DEBUG_SLAB_LEAK
>  
>  config SLUB_DEBUG_ON
>   bool "SLUB debugging on by default"
> - depends on SLUB && SLUB_DEBUG && !KMEMCHECK
> + depends on SLUB && SLUB_DEBUG && !KMEMCHECK && !KASAN

Why? SLUB_DEBUG_ON works with KASAN.

>   default n
>   help
> Boot with debugging on by default. SLUB boots by default with



  1   2   >