Re: [PATCH v6 19/36] nds32: VDSO support

2018-02-05 Thread Vincent Chen
2018-01-18 18:28 GMT+08:00 Arnd Bergmann :
> On Mon, Jan 15, 2018 at 6:53 AM, Greentime Hu  wrote:
>> From: Greentime Hu 
>>
>> This patch adds VDSO support. The VDSO code is currently used for
>> sys_rt_sigreturn() and optimised gettimeofday() (using the SoC timer 
>> counter).
>>
>> Signed-off-by: Vincent Chen 
>> Signed-off-by: Greentime Hu 
>
> Acked-by: Arnd Bergmann 

Dear Arnd Bergmann:

We find a small bug here which make LTP 20170929 clock_getres01
fail. The bug is in __vdso_clock_getres() function. When argument res
is NULL, -EFAULT error code is returned now. But, the returned
value is 0 for SyS_clock_getres under the same conditions.
Therefore, testcase thinks it is a bug.

I will modify the code as below and add it in the next version patch
if you think it is OK.

@@ -209,7 +209,7 @@ static notrace int clock_getres_fallback( ...
 {
 if (res == NULL)
-return -EFAULT;
+return 0;


Thanks

Vincnet


Re: [PATCH iproute2-next 00/10] RDMA resource tracking

2018-02-05 Thread Leon Romanovsky
On Mon, Feb 05, 2018 at 05:25:53PM -0800, Stephen Hemminger wrote:
> On Wed, 31 Jan 2018 10:11:46 +0200
> Leon Romanovsky  wrote:
>
> > Changelog:
> >  v2 -> v3:
> >* Rebased to commit: 1e24e773f144 ("Merge branch 'iproute2-master' into 
> > iproute2-next")
> >* Refreshed include/uapi/rdma/rdma_netlink.h file
> >* Fixed bug, where cxgb4 was printed twice.
> >  v1 -> v2;
> >* Added checks for all occurrences of strdup failures and added patch
> >  with fix of already merged code.
> >* Sync with latest kernel code.
> >* Rewrote table representation to be similar to "ip route" output.
> >* Implemented string filters.
> >* Removed curr/max representation from the summary output.
> >  v0 -> v1:
> >* Fixed subject title in patch #1: rdam -> rdma.
> >* Added newline between variable declaration and the code.
> >* Add check to failure in strdup() call in rd_check_is_string_filtered().
> >* Rewrote res_qp_parse_cb() to avoid long lines and extra indentation.
> >
> > 
> >
> > David, Stephen,
> >
> > The kernel code is accepted to the RDMA and will be sent to Linus in
> > this merge window, and this is refreshed version of user space part.
> >
> > Because, I found bug in handling cxgb4 devices and we cleaned header
> > file a little bit more, it was more wise to resend the series instead
> > of applying v2.
> >
> > https://patchwork.ozlabs.org/project/netdev/list/?series=RDMA+resource+tracking&submitter=68852&state=8&q=&archive=&delegate=
> >
> > Thanks
> >
> > Cc: RDMA mailing list 
> > Cc: Steve Wise 
> >
> > [1] https://www.spinics.net/lists/linux-rdma/msg59535.html
> >
> > Leon Romanovsky (10):
> >   rdma: Add option to provide "-" sign for the port number
> >   rdma: Make visible the number of arguments
> >   rdma: Add filtering infrastructure
> >   rdma: Set pointer to device name position
> >   rdma: Allow external usage of compare string routine
> >   rdma: Update kernel header file
> >   rdma: Add resource tracking summary
> >   rdma: Add QP resource tracking information
> >   rdma: Document resource tracking
> >   rdma: Check return value of strdup call
> >
> >  include/uapi/rdma/rdma_netlink.h |  67 +-
> >  man/man8/rdma-resource.8 |  86 +++
> >  rdma/Makefile|   2 +-
> >  rdma/link.c  |   2 +-
> >  rdma/rdma.c  |   4 +-
> >  rdma/rdma.h  |  28 ++-
> >  rdma/res.c   | 486 
> > +++
> >  rdma/utils.c | 321 +-
> >  8 files changed, 971 insertions(+), 25 deletions(-)
> >  create mode 100644 man/man8/rdma-resource.8
> >  create mode 100644 rdma/res.c
> >
> > --
> > 2.16.1
> >
>
> Applied these to the master branch.
> Dropped the kernel header update, picked that up through the other update 
> process.

Thanks a lot


signature.asc
Description: PGP signature


Re: [PATCH v6 20/36] nds32: Signal handling support

2018-02-05 Thread Vincent Chen
Thanks, I got it.

After referring to arm64 and risc-v, we try to refine our code, such as
removing unneeded checking and refining syscall restart flow. We
hope these modifications can enhance the reliability and readability.
However, the following 2 files which you had acked are included in
this modification.
1. arch/nds32/include/asm/nds32.h
(patch: Assembly macros and definitions)
 The definition of  macro tbl and why are removed.
   - Now, we use pt_reg->syscallno instead of 'why' to determine
 whether entering kernel is via syscall or not. Therefore, macro
 'why' is unneeded.

--- a/arch/nds32/include/asm/nds32.h
+++ b/arch/nds32/include/asm/nds32.h
@@ -66,10 +66,6 @@ static inline unsigned long CACHE_LINE_SIZE

 #endif /* __ASSEMBLY__ */

-/* tbl and why is used in ex-scall.S and ex-exit.S */
-#define tbl $r8
-#define why $r8
-
 #define IVB_BASE   PHYS_OFFSET


2. arch/nds32/kernel/ex-scall.S(patch: System calls handling)
a. Define macro tbl
- The marco tbl is used only in this file. So, I move its definition
  from arch/nds32/include/asm/nds32.h to here.
b. Remove 'set why = 0'  when issuing syscall number is invalid
c. Adjust input arguments of syscall_trace_enter

--- a/arch/nds32/kernel/ex-scall.S
+++ b/arch/nds32/kernel/ex-scall.S
...
+#define tbl $r8

 /*
  * $r7 will be writen as syscall nr
- * by retrieving from $ITYPE 'SWID' bitfiled
  */
.macro  get_scno
lwi $r7, [$sp + R15_OFFSET]
@@ -54,7 +49,6 @@ ENTRY(eh_syscall)
get_scno
gie_enable

-ENTRY(eh_syscall_phase_2)
lwi $p0, [tsk+#TSK_TI_FLAGS]

andi$p1, $p0, #_TIF_WORK_SYSCALL_ENTRY
@@ -71,7 +65,6 @@ jmp_systbl:
jr  $p1 ! no return

 _SCNO_EXCEED:
-   moviwhy, 0
ori $r0, $r7, #0
ori$r1, $sp, #0
b   bad_syscall
@@ -81,8 +74,7 @@ _SCNO_EXCEED:
  * context switches, and waiting for our parent to respond.
  */
 __sys_trace:
-   move$r1, $sp
-   move$r0, $r7! trace entry [IP = 0]
+   move$r0, $sp
bal syscall_trace_enter
move$r7, $r0
la  $lp, __sys_trace_return ! return address


If you think these modifications in acked files are not permitted,
we will recover it.

We verify all modifications by LTP 2017 related cases and glibc
2.26 testsuite. We plan to add it in the next version patch and
hope you can give us some comments as before.

Thanks

Vincent



2018-01-24 19:13 GMT+08:00 Arnd Bergmann :
> On Wed, Jan 24, 2018 at 1:56 AM, Vincent Chen  wrote:
>> 2018-01-18 18:30 GMT+08:00 Arnd Bergmann :
>>> On Mon, Jan 15, 2018 at 6:53 AM, Greentime Hu  wrote:
 From: Greentime Hu 

 This patch adds support for signal handling.

 Signed-off-by: Vincent Chen 
 Signed-off-by: Greentime Hu 
>>>
>>> I never feel qualified enough to properly review signal handling code, so
>>> no Ack from me for this code even though I don't see anything wrong with it.
>>> Hopefully someone else can give an Ack after looking more closely.
>>>
>>
>> Dear Arnd:
>>
>> We'd be glad to improve signal handling code to meet your requirement.
>> Could you
>> tell us which part we need to refine or which implementation is good
>> for us to refer?
>
> No, as I said, the problem is on my side, I just don't understand enough of 
> it.
> I would assume that the arm64 and risc-v implementations are the most
> thoroughly reviewed, but haven't looked at those in enough detail either.
> If your code does something that risc-v doesn't do, try to understand whether
> there should be a difference or not.
>
>   Arnd


Re: [PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Xin Long
On Tue, Feb 6, 2018 at 7:20 AM, David Ahern  wrote:
> On 2/5/18 12:48 PM, Tommi Rantala wrote:
>> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
>> 410f03831 ("sctp: add routing output fallback"):
>>
>> When walking the address_list, successive ip_route_output_key() calls
>> may return the same rt->dst with the reference incremented on each call.
>>
>> The code would not decrement the dst refcount when the dst pointer was
>> identical from the previous iteration, causing the dst refcnt leak.
>>
> ...
>>   ...
>>
>> Fixes: 410f03831 ("sctp: add routing output fallback")
>> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
>> addresses")
>
> any syzbot references for this bug?
In Dmitry Vyukov mail, there was no syzbot reference provided.
Not sure if there's another way to tell syzbot.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] RDS: IB: Fix null pointer issue

2018-02-05 Thread Guanglei Li
Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall

PID: 47039  TASK: 89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
 #0 [898e35f159f0] machine_kexec at 8103abf9
 #1 [898e35f15a60] crash_kexec at 810b96e3
 #2 [898e35f15b30] oops_end at 8150f518
 #3 [898e35f15b60] no_context at 8104854c
 #4 [898e35f15ba0] __bad_area_nosemaphore at 81048675
 #5 [898e35f15bf0] bad_area_nosemaphore at 810487d3
 #6 [898e35f15c00] do_page_fault at 815120b8
 #7 [898e35f15d10] page_fault at 8150ea95
[exception RIP: unknown or invalid address]
RIP:   RSP: 898e35f15dc8  RFLAGS: 00010282
RAX: fffe  RBX: 889b77f6fc00  RCX:81c99d88
RDX:   RSI: 896019ee08e8  RDI:889b77f6fc00
RBP: 898e35f15df0   R8: 896019ee08c8  R9:
R10: 0400  R11:   R12:896019ee08c0
R13: 889b77f6fe68  R14: 81c99d80  R15: a022a1e0
ORIG_RAX:   CS: 0010 SS: 0018
 #8 [898e35f15dc8] cma_ndev_work_handler at a022a228 [rdma_cm]
 #9 [898e35f15df8] process_one_work at 8108a7c6
 #10 [898e35f15e58] worker_thread at 8108bda0
 #11 [898e35f15ee8] kthread at 81090fe6

PID: 45659  TASK: 880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
 #0 [881024ccfc98] __schedule at 8150bac4
 #1 [881024ccfd40] schedule at 8150c2cf
 #2 [881024ccfd50] __mutex_lock_slowpath at 8150cee7
 #3 [881024ccfdc0] mutex_lock at 8150cdeb
 #4 [881024ccfde0] rdma_destroy_id at a022a027 [rdma_cm]
 #5 [881024ccfe10] rds_ib_laddr_check at a0357857 [rds_rdma]
 #6 [881024ccfe50] rds_trans_get_preferred at a0324c2a [rds]
 #7 [881024ccfe80] rds_bind at a031d690 [rds]
 #8 [881024ccfeb0] sys_bind at 8142a670

PID: 45659  PID: 47039
rds_ib_laddr_check
  /* create id_priv with a null event_handler */
  rdma_create_id
  rdma_bind_addr
cma_acquire_dev
  /* add id_priv to cma_dev->id_list */
  cma_attach_to_dev
cma_ndev_work_handler
  /* event_hanlder is null */
  id_priv->id.event_handler

Signed-off-by: Guanglei Li 
Signed-off-by: Honglei Wang 
Reviewed-by: Junxiao Bi 
Reviewed-by: Yanjun Zhu 
Reviewed-by: Leon Romanovsky 
Acked-by: Santosh Shilimkar 
Acked-by: Doug Ledford 
---
 net/rds/ib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index b2a5067..ff0c980 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -345,7 +345,8 @@ static int rds_ib_laddr_check(struct net *net, __be32 addr)
/* Create a CMA ID and try to bind it. This catches both
 * IB and iWARP capable NICs.
 */
-   cm_id = rdma_create_id(&init_net, NULL, NULL, RDMA_PS_TCP, IB_QPT_RC);
+   cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler,
+  NULL, RDMA_PS_TCP, IB_QPT_RC);
if (IS_ERR(cm_id))
return PTR_ERR(cm_id);
 
-- 
2.7.4



Re: [PATCH, net] ibmvnic: fix empty firmware version and errors cleanup

2018-02-05 Thread David Miller
From: Desnes Augusto Nunes do Rosario 
Date: Mon,  5 Feb 2018 14:33:55 -0200

> This patch makes sure that the firmware version is never NULL. Moreover,
> it also performs some cleanup on the error messages.
> 
> Fixes: a107311d7fdf ("ibmvnic: fix firmware version when no firmware level
> has been provided by the VIOS server")
> Signed-off-by: Desnes A. Nunes do Rosario 

Applied, thanks.


Re: [PATCH] sctp: fix dst refcnt leak in sctp_v6_get_dst()

2018-02-05 Thread David Miller
From: Alexey Kodanev 
Date: Mon,  5 Feb 2018 15:10:35 +0300

> When going through the bind address list in sctp_v6_get_dst() and
> the previously found address is better ('matchlen > bmatchlen'),
> the code continues to the next iteration without releasing currently
> held destination.
> 
> Fix it by releasing 'bdst' before continue to the next iteration, and
> instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
> move the already existed one right after ip6_dst_lookup_flow(), i.e. we
> shouldn't proceed further if we get an error for the route lookup.
> 
> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary 
> addresses for ipv6")
> Signed-off-by: Alexey Kodanev 

Applied and queued up for -stable, thank you.


Re: [PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread David Miller
From: Tommi Rantala 
Date: Mon,  5 Feb 2018 21:48:14 +0200

> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
> 410f03831 ("sctp: add routing output fallback"):
> 
> When walking the address_list, successive ip_route_output_key() calls
> may return the same rt->dst with the reference incremented on each call.
> 
> The code would not decrement the dst refcount when the dst pointer was
> identical from the previous iteration, causing the dst refcnt leak.
 ...
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
> addresses")
> Signed-off-by: Tommi Rantala 

Applied and queued up for -stable, thanks.


Re: Please apply these tiny, 4-month-old patches.

2018-02-05 Thread Michael Witten
On Tue, 6 Feb 2018 02:42:17 +0100, Andrew Lunn wrote:

>>> Please learn how the community works, and how to interact with
>>> developers and maintainers in that community appropriately.
>>
>> I already tried that.
>>
>> If you're unwilling to be an effective maintainer, then please hand
>> off the responsibiilty to someone else.
>
> Could i suggest you read:
>
> https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt
>
> And in particular, the bit about netdev being closed.
>
> I suggest you wait a week for netdev to open, and then submit the
> patches again. Actual patches, which cleanly apply to net-next.

Thank you kindly for the suggestion.

However, I'm fully versed on the scripture.

I'm glad to report that the patches apply cleanly to `net-next'; the
actual patches are, of course, still available in my previous emails.
Also, as already described, they can be easily fetched from GitHub.

Sincerely,
Michael Witten


Re: Please apply these tiny, 4-month-old patches.

2018-02-05 Thread Andrew Lunn
> > Please learn how the community works, and how to interact with
> > developers and maintainers in that community appropriately.
> 
> I already tried that.
> 
> If you're unwilling to be an effective maintainer, then please hand
> off the responsibiilty to someone else.

Could i suggest you read:

https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt

And in particular, the bit about netdev being closed.

I suggest you wait a week for netdev to open, and then submit the
patches again. Actual patches, which cleanly apply to net-next.

  Andrew


Re: Please apply these tiny, 4-month-old patches.

2018-02-05 Thread Michael Witten
On Mon, 05 Feb 2018 20:12:11 -0500 (EST), David Miller wrote:

>> If this is considered "new" code (it isn't) and if this email is
>> received outside of an appropriate merge window, then save this
>> email for later consideration---this isn't a real-time conversation;
>> this is email, so it doesn't matter when you receive it.
>
> Sorry, things don't work that way.
>
> You must submit your changes at the appropriate time.
>
> Please learn how the community works, and how to interact with
> developers and maintainers in that community appropriately.

I already tried that.

If you're unwilling to be an effective maintainer, then please hand
off the responsibiilty to someone else.

Sincerely,
Michael Witten


Re: [PATCH iproute2-next 00/10] RDMA resource tracking

2018-02-05 Thread Stephen Hemminger
On Wed, 31 Jan 2018 10:11:46 +0200
Leon Romanovsky  wrote:

> Changelog:
>  v2 -> v3:
>* Rebased to commit: 1e24e773f144 ("Merge branch 'iproute2-master' into 
> iproute2-next")
>* Refreshed include/uapi/rdma/rdma_netlink.h file
>* Fixed bug, where cxgb4 was printed twice.
>  v1 -> v2;
>* Added checks for all occurrences of strdup failures and added patch
>  with fix of already merged code.
>* Sync with latest kernel code.
>* Rewrote table representation to be similar to "ip route" output.
>* Implemented string filters.
>* Removed curr/max representation from the summary output.
>  v0 -> v1:
>* Fixed subject title in patch #1: rdam -> rdma.
>* Added newline between variable declaration and the code.
>* Add check to failure in strdup() call in rd_check_is_string_filtered().
>* Rewrote res_qp_parse_cb() to avoid long lines and extra indentation.
> 
> 
> 
> David, Stephen,
> 
> The kernel code is accepted to the RDMA and will be sent to Linus in
> this merge window, and this is refreshed version of user space part.
> 
> Because, I found bug in handling cxgb4 devices and we cleaned header
> file a little bit more, it was more wise to resend the series instead
> of applying v2.
> 
> https://patchwork.ozlabs.org/project/netdev/list/?series=RDMA+resource+tracking&submitter=68852&state=8&q=&archive=&delegate=
> 
> Thanks
> 
> Cc: RDMA mailing list 
> Cc: Steve Wise 
> 
> [1] https://www.spinics.net/lists/linux-rdma/msg59535.html
> 
> Leon Romanovsky (10):
>   rdma: Add option to provide "-" sign for the port number
>   rdma: Make visible the number of arguments
>   rdma: Add filtering infrastructure
>   rdma: Set pointer to device name position
>   rdma: Allow external usage of compare string routine
>   rdma: Update kernel header file
>   rdma: Add resource tracking summary
>   rdma: Add QP resource tracking information
>   rdma: Document resource tracking
>   rdma: Check return value of strdup call
> 
>  include/uapi/rdma/rdma_netlink.h |  67 +-
>  man/man8/rdma-resource.8 |  86 +++
>  rdma/Makefile|   2 +-
>  rdma/link.c  |   2 +-
>  rdma/rdma.c  |   4 +-
>  rdma/rdma.h  |  28 ++-
>  rdma/res.c   | 486 
> +++
>  rdma/utils.c | 321 +-
>  8 files changed, 971 insertions(+), 25 deletions(-)
>  create mode 100644 man/man8/rdma-resource.8
>  create mode 100644 rdma/res.c
> 
> --
> 2.16.1
> 

Applied these to the master branch.
Dropped the kernel header update, picked that up through the other update 
process.


Re: [PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Neil Horman
On Mon, Feb 05, 2018 at 09:48:14PM +0200, Tommi Rantala wrote:
> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
> 410f03831 ("sctp: add routing output fallback"):
> 
> When walking the address_list, successive ip_route_output_key() calls
> may return the same rt->dst with the reference incremented on each call.
> 
> The code would not decrement the dst refcount when the dst pointer was
> identical from the previous iteration, causing the dst refcnt leak.
> 
> Testcase:
>   ip netns add TEST
>   ip netns exec TEST ip link set lo up
>   ip link add dummy0 type dummy
>   ip link add dummy1 type dummy
>   ip link add dummy2 type dummy
>   ip link set dev dummy0 netns TEST
>   ip link set dev dummy1 netns TEST
>   ip link set dev dummy2 netns TEST
>   ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
>   ip netns exec TEST ip link set dummy0 up
>   ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
>   ip netns exec TEST ip link set dummy1 up
>   ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
>   ip netns exec TEST ip link set dummy2 up
>   ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 
> 2 -s -B 192.168.1.3
>   ip netns del TEST
> 
> In 4.4 and 4.9 kernels this results to:
>   [  354.179591] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  364.419674] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  374.663664] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  384.903717] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  395.143724] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  405.383645] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   ...
> 
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
> addresses")
> Signed-off-by: Tommi Rantala 
> ---
> 
> v2: unconditional dst_release() before breaking out of the loop:
>  - if dst == NULL, then dst_release() is no-op.
>  - if dst != &rt->dst, then drop reference to the first dst, we do not
>need it after all.
>  - if dst == &rt->dst, then the refcnt was incremented twice for this
>dst, so drop the second ref.
> 
>  net/sctp/protocol.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index ba28d270eb24..aba2381d3cf9 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -509,22 +509,20 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   if (IS_ERR(rt))
>   continue;
>  
> - if (!dst)
> - dst = &rt->dst;
> -
>   /* Ensure the src address belongs to the output
>* interface.
>*/
>   odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
>false);
>   if (!odev || odev->ifindex != fl4->flowi4_oif) {
> - if (&rt->dst != dst)
> + if (!dst)
> + dst = &rt->dst;
> + else
>   dst_release(&rt->dst);
>   continue;
>   }
>  
> - if (dst != &rt->dst)
> - dst_release(dst);
> + dst_release(dst);
>   dst = &rt->dst;
>   break;
>   }
> -- 
> 2.15.1
> 
> 
Acked-by: Neil Horman 



Re: Please apply these tiny, 4-month-old patches.

2018-02-05 Thread David Miller
From: Michael Witten 
Date: Tue, 06 Feb 2018 00:54:35 -

> If this is considered "new" code (it isn't) and if this email is
> received outside of an appropriate merge window, then save this
> email for later consideration---this isn't a real-time conversation;
> this is email, so it doesn't matter when you receive it.

Sorry, things don't work that way.

You must submit your changes at the appropriate time.

Please learn how the community works, and how to interact with
developers and maintainers in that community appropriately.

Thank you.


Please apply these tiny, 4-month-old patches.

2018-02-05 Thread Michael Witten
Strictly speaking, these patches streamline the code at both 
compile-time and run-time, and seem to face no technical
objection of note.

The strongest objection was a dubious *potential* refactoring of
similar code, a refactoring which is clearly vaporware, and which
I myself would have tried to complete if only this initial foray
had been applied.

If this is considered "new" code (it isn't) and if this email is
received outside of an appropriate merge window, then save this
email for later consideration---this isn't a real-time conversation;
this is email, so it doesn't matter when you receive it.

Sincerely,
Michael Witten

On Sun, 01 Oct 2017 22:19:02 -, Michael Witten wrote:

> The following patch series is an ad hoc "cleanup" that I made
> while perusing the code (I'm not well versed in this code, so I
> would not be surprised if there were objections to the changes):
>
>   [1] net: __sock_cmsg_send(): Remove unused parameter `msg'
>   [2] net: inet_recvmsg(): Remove unnecessary bitwise operation
>   [3] net: skb_queue_purge(): lock/unlock the queue only once
>
> Each patch will be sent as an individual reply to this email;
> the total diff is appended below for your convenience.
>
> You may also fetch these patches from GitHub:
>
>   git checkout --detach 5969d1bb3082b41eba8fd2c826559abe38ccb6df
>   git pull https://github.com/mfwitten/linux.git net/tcp-ip/01-cleanup/02
>
> Overall:
>
>   include/net/sock.h |  2 +-
>   net/core/skbuff.c  | 26 ++
>   net/core/sock.c|  4 ++--
>   net/ipv4/af_inet.c |  2 +-
>   net/ipv4/ip_sockglue.c |  2 +-
>   net/ipv6/datagram.c|  2 +-
>   6 files changed, 24 insertions(+), 14 deletions(-)
>
> Sincerly,
> Michael Witten
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 03a362568357..83373d7148a9 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1562,7 +1562,7 @@ struct sockcm_cookie {
>   u16 tsflags;
>  };
>  
> -int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr 
> *cmsg,
> +int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>struct sockcm_cookie *sockc);
>  int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
>  struct sockcm_cookie *sockc);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 9b7b6bbb2a23..425e03fe1c56 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2091,7 +2091,7 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, 
> unsigned long size,
>  }
>  EXPORT_SYMBOL(sock_alloc_send_skb);
>  
> -int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr 
> *cmsg,
> +int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>struct sockcm_cookie *sockc)
>  {
>   u32 tsflags;
> @@ -2137,7 +2137,7 @@ int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
>   return -EINVAL;
>   if (cmsg->cmsg_level != SOL_SOCKET)
>   continue;
> - ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
> + ret = __sock_cmsg_send(sk, cmsg, sockc);
>   if (ret)
>   return ret;
>   }
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index e558e4f9597b..c79b7822b0b9 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -263,7 +263,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
> struct ipcm_cookie *ipc,
>   }
>  #endif
>   if (cmsg->cmsg_level == SOL_SOCKET) {
> - err = __sock_cmsg_send(sk, msg, cmsg, &ipc->sockc);
> + err = __sock_cmsg_send(sk, cmsg, &ipc->sockc);
>   if (err)
>   return err;
>   continue;
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index a1f918713006..1d1926a4cbe2 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -756,7 +756,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock 
> *sk,
>   }
>  
>   if (cmsg->cmsg_level == SOL_SOCKET) {
> - err = __sock_cmsg_send(sk, msg, cmsg, sockc);
> + err = __sock_cmsg_send(sk, cmsg, sockc);
>   if (err)
>   return err;
>   continue;
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index e31108e5ef79..2dbed042a412 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -791,7 +791,7 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t size,
>   sock_rps_record_flow(sk);
>  
>   err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
> -flags & ~MSG_DONTWAIT, &addr_len);
> +flags, &addr_len);
>   if (err >= 0)
>   msg->msg_namelen = addr_len;
>   return err;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 68065d7

Re: [PATCH] ss: introduce switch to print exact value of data rates

2018-02-05 Thread Stephen Hemminger
On Fri, 2 Feb 2018 16:32:47 -0700
David Ahern  wrote:

> On 2/1/18 7:19 AM, Tomasz Torcz wrote:
> >   Introduce -X/--exact switch to disable human-friendly printing
> >  of datarates. With the switch, data is not presented as MBps/Kbps.
> > 
> >   Signed-off-by: Tomasz Torcz 
> > ---
> >  misc/ss.c | 12 ++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/misc/ss.c b/misc/ss.c
> > index 29a25070..5ca5112a 100644
> > --- a/misc/ss.c
> > +++ b/misc/ss.c
> > @@ -95,6 +95,7 @@ int resolve_services = 1;
> >  int preferred_family = AF_UNSPEC;
> >  int show_options;
> >  int show_details;
> > +int show_exact;  
> 
> show_exact suggests the other versions are not exact. show_raw? or
> 
> int human_readable = 1;
> 
> and -X disables it?

agree with david, think of another flag name (raw? or no-prefix?)


[PATCH bpf] tools/bpf: fix batch-mode test failure of test_xdp_redirect.sh

2018-02-05 Thread Yonghong Song
The tests at tools/testing/selftests/bpf can run in patch mode, e.g.,
make -C tools/testing/selftests/bpf run_tests

With the batch mode, I experimented intermittent test failure of
test_xdp_redirect.sh.

selftests: test_xdp_redirect [PASS]
selftests: test_xdp_redirect.sh [PASS]
RTNETLINK answers: File exists
selftests: test_xdp_meta [FAILED]
selftests: test_xdp_meta.sh [FAIL]


The following illustrates what caused the failure:
 (1). test_xdp_redirect creates veth pairs (veth1,veth11) and
  (veth2,veth22), and assign veth11 and veth22 to namespace
  ns1 and ns2 respectively.
 (2). at the end of test_xdp_redirect test, ns1 and ns2 are
  deleted. During this process, the deletion of actual
  namespace resources, including deletion of veth1{1} and veth2{2},
  is put into a workqueue to be processed asynchronously.
 (3). test_xdp_meta tries to create veth pair (veth1, veth2).
  The previous veth deletions in step (2) have not finished yet,
  and veth1 or veth2 may be still valid in the kernel, thus
  causing the failure.

The fix is to explicitly delete the veth pair before test_xdp_redirect
exits. Only one end of veth needs deletion as the kernel will delete
the other end automatically. Also test_xdp_meta is also fixed in
similar manner to avoid future potential issues.

Fixes: 996139e801fd ("selftests: bpf: add a test for XDP redirect")
Fixes: 22c8852624fc ("bpf: improve selftests and add tests for meta pointer")
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_xdp_meta.sh | 1 +
 tools/testing/selftests/bpf/test_xdp_redirect.sh | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_xdp_meta.sh 
b/tools/testing/selftests/bpf/test_xdp_meta.sh
index 307aa85..637fcf4 100755
--- a/tools/testing/selftests/bpf/test_xdp_meta.sh
+++ b/tools/testing/selftests/bpf/test_xdp_meta.sh
@@ -9,6 +9,7 @@ cleanup()
fi
 
set +e
+   ip link del veth1 2> /dev/null
ip netns del ns1 2> /dev/null
ip netns del ns2 2> /dev/null
 }
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect.sh 
b/tools/testing/selftests/bpf/test_xdp_redirect.sh
index 344a365..c4b17e0 100755
--- a/tools/testing/selftests/bpf/test_xdp_redirect.sh
+++ b/tools/testing/selftests/bpf/test_xdp_redirect.sh
@@ -19,6 +19,8 @@ cleanup()
fi
 
set +e
+   ip link del veth1 2> /dev/null
+   ip link del veth2 2> /dev/null
ip netns del ns1 2> /dev/null
ip netns del ns2 2> /dev/null
 }
-- 
2.9.5



[PATCH net] nfp: fix kdoc warnings on nested structures

2018-02-05 Thread Jakub Kicinski
Commit 84ce5b987783 ("scripts: kernel-doc: improve nested logic to
handle multiple identifiers") improved the handling of nested structure
definitions in scripts/kernel-doc, and changed the expected format of
documentation.  This causes new warnings to appear on W=1 builds.

Only comment changes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h  | 24 ++--
 .../ethernet/netronome/nfp/flower/tunnel_conf.c| 10 ++---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  6 ++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h   | 43 +++---
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 21 ++-
 5 files changed, 55 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 424fe8338105..054df3dc0698 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -112,22 +112,22 @@ enum pkt_vec {
  * @map_elems_in_use:  number of elements allocated to offloaded maps
  *
  * @adjust_head:   adjust head capability
- * @flags: extra flags for adjust head
- * @off_min:   minimal packet offset within buffer required
- * @off_max:   maximum packet offset within buffer required
- * @guaranteed_sub:amount of negative adjustment guaranteed possible
- * @guaranteed_add:amount of positive adjustment guaranteed possible
+ * @adjust_head.flags: extra flags for adjust head
+ * @adjust_head.off_min:   minimal packet offset within buffer required
+ * @adjust_head.off_max:   maximum packet offset within buffer required
+ * @adjust_head.guaranteed_sub:negative adjustment guaranteed possible
+ * @adjust_head.guaranteed_add:positive adjustment guaranteed possible
  *
  * @maps:  map capability
- * @types: supported map types
- * @max_maps:  max number of maps supported
- * @max_elems: max number of entries in each map
- * @max_key_sz:max size of map key
- * @max_val_sz:max size of map value
- * @max_elem_sz:   max size of map entry (key + value)
+ * @maps.types:supported map types
+ * @maps.max_maps: max number of maps supported
+ * @maps.max_elems:max number of entries in each map
+ * @maps.max_key_sz:   max size of map key
+ * @maps.max_val_sz:   max size of map value
+ * @maps.max_elem_sz:  max size of map entry (key + value)
  *
  * @helpers:   helper addressess for various calls
- * @map_lookup:map lookup helper address
+ * @helpers.map_lookup:map lookup helper address
  */
 struct nfp_app_bpf {
struct nfp_app *app;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c 
b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index b03f22f29612..ec524d97869d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -50,9 +50,9 @@
  * @seq:   sequence number of the message
  * @count: number of tunnels report in message
  * @flags: options part of the request
- * @ipv4:  dest IPv4 address of active route
- * @egress_port:   port the encapsulated packet egressed
- * @extra: reserved for future use
+ * @tun_info.ipv4: dest IPv4 address of active route
+ * @tun_info.egress_port:  port the encapsulated packet egressed
+ * @tun_info.extra:reserved for future use
  * @tun_info:  tunnels that have sent traffic in reported period
  */
 struct nfp_tun_active_tuns {
@@ -132,8 +132,8 @@ struct nfp_ipv4_addr_entry {
  * struct nfp_tun_mac_addr - configure MAC address of tunnel EP on NFP
  * @reserved:  reserved for future use
  * @count: number of MAC addresses in the message
- * @index: index of MAC address in the lookup table
- * @addr:  interface MAC address
+ * @addresses.index:   index of MAC address in the lookup table
+ * @addresses.addr:interface MAC address
  * @addresses: series of MACs to offload
  */
 struct nfp_tun_mac_addr {
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index d88eda9707e6..787df47ec430 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -193,7 +193,8 @@ struct nfp_net_tx_desc {
 
 /**
  * struct nfp_net_tx_buf - software TX buffer descriptor
- * @skb:   sk_buff associated with this buffer
+ * @skb:   normal ring, sk_buff associated with this buffer
+ * @frag:  XDP ring, page frag associated with this buffer
  * @dma_addr:  DMA mapping address of the buffer
  * @fidx:  Fragment index (-1 for the head and [0..nr_frags-1] for frags)
  * @pkt_cnt:   Number of packets to be 

Re: [Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-05 Thread Eric Dumazet
On Mon, 2018-02-05 at 14:41 -0800, Cong Wang wrote:
> rateest_hash is supposed to be protected by xt_rateest_mutex,
> and, as suggested by Eric, lookup and insert should be atomic,
> so we should acquire the xt_rateest_mutex once for both.
> 
> So introduce a non-locking helper for internal use and keep the
> locking one for external.
> 
> Reported-by: 
> Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
> Cc: Pablo Neira Ayuso 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 
> ---

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH net 1/1 v2] rtnetlink: require unique netns identifier

2018-02-05 Thread Christian Brauner
On Tue, Feb 06, 2018 at 12:47:46AM +0300, Kirill Tkhai wrote:
> On 05.02.2018 18:55, Christian Brauner wrote:
> > Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
> > it is possible for userspace to send us requests with three different
> > properties to identify a target network namespace. This affects at least
> > RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
> > network namespace which is confusing. For legacy reasons the kernel will
> > pick the IFLA_NET_NS_PID property first and then look for the
> > IFLA_NET_NS_FD property but there is no reason to extend this type of
> > behavior to network namespace ids. The regression potential is quite
> > minimal since the rtnetlink requests in question either won't allow
> > IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
> > support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.
> >> Signed-off-by: Christian Brauner 
> > ---
> > ChangeLog v1->v2:
> > * return errno when the specified network namespace id is invalid
> > * fill in struct netlink_ext_ack if the network namespace id is invalid
> > * rename rtnl_ensure_unique_netns_attr() to rtnl_ensure_unique_netns() to
> >   indicate that a request without any network namespace identifying 
> > attributes
> >   is also considered valid.
> > 
> > ChangeLog v0->v1:
> > * report a descriptive error to userspace via struct netlink_ext_ack
> > * do not fail when multiple properties specifiy the same network namespace
> > ---
> >  net/core/rtnetlink.c | 69 
> > 
> >  1 file changed, 69 insertions(+)
> > 
> > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> > index 56af8e41abfc..c096c4ff9a00 100644
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -1951,6 +1951,59 @@ static struct net *rtnl_link_get_net_capable(const 
> > struct sk_buff *skb,
> > return net;
> >  }
> >  
> > +/* Verify that rtnetlink requests supporting network namespace ids
> > + * do not pass additional properties referring to different network
> > + * namespaces.
> > + */
> > +static int rtnl_ensure_unique_netns(const struct sock *sk, struct nlattr 
> > *tb[],
> > +   struct netlink_ext_ack *extack)
> > +{
> > +   int ret = -EINVAL;
> > +   struct net *net = NULL, *unique_net = NULL;
> > +
> > +   /* Requests without network namespace ids have been able to specify
> > +* multiple properties referring to different network namespaces so
> > +* don't regress them.
> > +*/
> > +   if (!tb[IFLA_IF_NETNSID])
> > +   return 0;
> > +
> > +   /* Caller operates on the current network namespace. */
> > +   if (!tb[IFLA_NET_NS_PID] && !tb[IFLA_NET_NS_FD])
> > +   return 0;
> > +
> > +   unique_net = get_net_ns_by_id(sock_net(sk), 
> > nla_get_s32(tb[IFLA_IF_NETNSID]));
> > +   if (!unique_net) {
> > +   NL_SET_ERR_MSG(extack, "invalid network namespace id");
> > +   return ret;
> > +   }
> > +
> > +   if (tb[IFLA_NET_NS_PID]) {
> > +   net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
> > +   if (net != unique_net)
> > +   goto on_error;
> > +   }
> > +
> > +   if (tb[IFLA_NET_NS_FD]) {
> > +   net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
> > +   if (net != unique_net)
> > +   goto on_error;
> > +   }
> > +
> > +   ret = 0;
> > +
> > +on_error:
> > +   put_net(unique_net);
> > +
> > +   if (net && !IS_ERR(net))
> > +   put_net(net);
> 
> 1)When we have tb[IFLA_NET_NS_PID and tb[IFLA_NET_NS_FD] both set and pointing
> to the same net, this function increments net::count in get_net_ns_by_pid() 
> and
> in get_net_ns_by_fd(), i.e. twice. But only single put_net(net) will be 
> called.
> So, after this function net::count will be incremented by 1, and it never will
> die.

Thanks for spotting this, Kirill.

> 
> 2)The whole approach does not seem good for me. The first reason is it's racy.
> Even if rtnl_ensure_unique_netns() returns 0, this does not guarantees that
> tb[IFLA_IF_NETNSID] and tb[IFLA_NET_NS_PID] will be point the same net later,
> as the pid may die or do setns(). Racy check is worse than no check at all.
> 
> The second reason is after this patch get_net_ns_by_id/get_net_ns_by_pid()/
> get_net_ns_by_fd() will be called twice: the first time is in your check
> and the second time is where they are actually used. This is not good for
> performance.

If this is really a performance problem we can simply fix this by
performing the check when the target network namespace is retrieved in
each request. The intention for doing it in one function at the
beginning of each request was to make it generic and easily
understandable.

> 
> What is the problem people pass several different tb[xxx] in one call? We
> may just describe the order of tb[xxx] in man page and their priorities,
> and ignore the rest after the first not z

Re: [PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread David Ahern
On 2/5/18 12:48 PM, Tommi Rantala wrote:
> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
> 410f03831 ("sctp: add routing output fallback"):
> 
> When walking the address_list, successive ip_route_output_key() calls
> may return the same rt->dst with the reference incremented on each call.
> 
> The code would not decrement the dst refcount when the dst pointer was
> identical from the previous iteration, causing the dst refcnt leak.
> 
...
>   ...
> 
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
> addresses")

any syzbot references for this bug?


Re: [Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-05 Thread Florian Westphal
Cong Wang  wrote:
> rateest_hash is supposed to be protected by xt_rateest_mutex,
> and, as suggested by Eric, lookup and insert should be atomic,
> so we should acquire the xt_rateest_mutex once for both.
> 
> So introduce a non-locking helper for internal use and keep the
> locking one for external.

Looks good, thanks Cong.

Reviewed-by: Florian Westphal 


Re: [NetDev-info] Distributed Switch Architecture for 88E6390

2018-02-05 Thread Andrew Lunn
> Hi Andrew,
> 
> > For port 0x6 reg 0x4, confirmed.  It looks like the latest code is
> > correctly setting bits 13:12.  I'm porting from an older hash.

Hi Dave

6390 support is quite new. So you really should be use the latest
code. Otherwise you are going to have issues like this. Plus you are
going to want the fixes for the issues you just pointed out yourself.

> > Thanks for the suggestion regarding the MTU size.  I will look
> > into it in detail and follow up.  The driver for the ethernet
> > device I'm using is compatible with the nvidia,eqos device (which
> > is basically saying it is the built-in MAC & PHY on the Tegra 186
> > ).

I'm not aware of anybody using a tegra with DSA. So there could be
frame size issues.

  Andrew


Re: [PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Marcelo Ricardo Leitner
On Mon, Feb 05, 2018 at 09:48:14PM +0200, Tommi Rantala wrote:
> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
> 410f03831 ("sctp: add routing output fallback"):
> 
> When walking the address_list, successive ip_route_output_key() calls
> may return the same rt->dst with the reference incremented on each call.
> 
> The code would not decrement the dst refcount when the dst pointer was
> identical from the previous iteration, causing the dst refcnt leak.
> 
> Testcase:
>   ip netns add TEST
>   ip netns exec TEST ip link set lo up
>   ip link add dummy0 type dummy
>   ip link add dummy1 type dummy
>   ip link add dummy2 type dummy
>   ip link set dev dummy0 netns TEST
>   ip link set dev dummy1 netns TEST
>   ip link set dev dummy2 netns TEST
>   ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
>   ip netns exec TEST ip link set dummy0 up
>   ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
>   ip netns exec TEST ip link set dummy1 up
>   ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
>   ip netns exec TEST ip link set dummy2 up
>   ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 
> 2 -s -B 192.168.1.3
>   ip netns del TEST
> 
> In 4.4 and 4.9 kernels this results to:
>   [  354.179591] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  364.419674] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  374.663664] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  384.903717] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  395.143724] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   [  405.383645] unregister_netdevice: waiting for lo to become free. Usage 
> count = 1
>   ...
> 
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
> addresses")
> Signed-off-by: Tommi Rantala 

Acked-by: Marcelo Ricardo Leitner 

Thanks

> ---
> 
> v2: unconditional dst_release() before breaking out of the loop:
>  - if dst == NULL, then dst_release() is no-op.
>  - if dst != &rt->dst, then drop reference to the first dst, we do not
>need it after all.
>  - if dst == &rt->dst, then the refcnt was incremented twice for this
>dst, so drop the second ref.
> 
>  net/sctp/protocol.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index ba28d270eb24..aba2381d3cf9 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -509,22 +509,20 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   if (IS_ERR(rt))
>   continue;
>  
> - if (!dst)
> - dst = &rt->dst;
> -
>   /* Ensure the src address belongs to the output
>* interface.
>*/
>   odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
>false);
>   if (!odev || odev->ifindex != fl4->flowi4_oif) {
> - if (&rt->dst != dst)
> + if (!dst)
> + dst = &rt->dst;
> + else
>   dst_release(&rt->dst);
>   continue;
>   }
>  
> - if (dst != &rt->dst)
> - dst_release(dst);
> + dst_release(dst);
>   dst = &rt->dst;
>   break;
>   }
> -- 
> 2.15.1
> 


[Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-05 Thread Cong Wang
rateest_hash is supposed to be protected by xt_rateest_mutex,
and, as suggested by Eric, lookup and insert should be atomic,
so we should acquire the xt_rateest_mutex once for both.

So introduce a non-locking helper for internal use and keep the
locking one for external.

Reported-by: 
Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
Cc: Pablo Neira Ayuso 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/netfilter/xt_RATEEST.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 498b54fd04d7..141c295191f6 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -39,23 +39,31 @@ static void xt_rateest_hash_insert(struct xt_rateest *est)
hlist_add_head(&est->list, &rateest_hash[h]);
 }
 
-struct xt_rateest *xt_rateest_lookup(const char *name)
+static struct xt_rateest *__xt_rateest_lookup(const char *name)
 {
struct xt_rateest *est;
unsigned int h;
 
h = xt_rateest_hash(name);
-   mutex_lock(&xt_rateest_mutex);
hlist_for_each_entry(est, &rateest_hash[h], list) {
if (strcmp(est->name, name) == 0) {
est->refcnt++;
-   mutex_unlock(&xt_rateest_mutex);
return est;
}
}
-   mutex_unlock(&xt_rateest_mutex);
+
return NULL;
 }
+
+struct xt_rateest *xt_rateest_lookup(const char *name)
+{
+   struct xt_rateest *est;
+
+   mutex_lock(&xt_rateest_mutex);
+   est = __xt_rateest_lookup(name);
+   mutex_unlock(&xt_rateest_mutex);
+   return est;
+}
 EXPORT_SYMBOL_GPL(xt_rateest_lookup);
 
 void xt_rateest_put(struct xt_rateest *est)
@@ -100,8 +108,10 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
 
net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
 
-   est = xt_rateest_lookup(info->name);
+   mutex_lock(&xt_rateest_mutex);
+   est = __xt_rateest_lookup(info->name);
if (est) {
+   mutex_unlock(&xt_rateest_mutex);
/*
 * If estimator parameters are specified, they must match the
 * existing estimator.
@@ -139,11 +149,13 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
 
info->est = est;
xt_rateest_hash_insert(est);
+   mutex_unlock(&xt_rateest_mutex);
return 0;
 
 err2:
kfree(est);
 err1:
+   mutex_unlock(&xt_rateest_mutex);
return ret;
 }
 
-- 
2.13.0



Re: [PATCH 4/4] net: amd-xgbe: fix comparison to bitshift when dealing with a mask

2018-02-05 Thread Tom Lendacky
On 2/5/2018 2:10 PM, Wolfram Sang wrote:
> Due to a typo, the mask was destroyed by a comparison instead of a bit
> shift.
> 
> Signed-off-by: Wolfram Sang 

Excellent find.

Acked-by: Tom Lendacky 

David, this should also be applied to the 4.14 and 4.15 stable releases.

Thanks,
Tom

> ---
> Only build tested. To be applied individually per subsystem.
> 
>  drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c 
> b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> index 7a3ebfd236f5eb..100adee778dfd6 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> @@ -595,7 +595,7 @@ static void xgbe_isr_task(unsigned long data)
>  
>   reissue_mask = 1 << 0;
>   if (!pdata->per_channel_irq)
> - reissue_mask |= 0x < 4;
> + reissue_mask |= 0x << 4;
>  
>   XP_IOWRITE(pdata, XP_INT_REISSUE_EN, reissue_mask);
>   }
> 


Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"

2018-02-05 Thread Heiner Kallweit
Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> 
> 
> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
>> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
 This commit forces callers of phy_resume() and phy_suspend() to hold
 mutex phydev->lock. This was done for calls to phy_resume() and
 phy_suspend() in phylib, however there are more callers in network
 drivers. I'd assume that these other calls issue a warning now
 because of the lock not being held.
 So is there something I miss or would this have to be fixed?
>>>
>>> Hi Heiner
>>>
>>> This is a good point.
>>>
>>> Yes, it looks like some fixes are needed. But what exactly?
>>>
>>> The phy state machine will suspend and resume the phy is you call
>>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>>
>> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
>> to PHY_HALTED and (at least if we're not in polling mode) doesn't
>> call phy_suspend(). Maybe a call to phy_trigger_machine() is
>> needed like in phy_start() ? Then the state machine would call
>> phy_suspend(), provided the link is still up.
> 
> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> this is actually a great source of problems which I tried to address here:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> 
> because phy_stop() is not a synchronous call, so when it returns the
> state machine might still be running (it can take up to a 1 HZ depending
> on when you called phy_stop()) and so if you took that as a
> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> you will likely see problems. phy_stop_machine() would provide that
> synchronization point, but is not currently exported, despite being a
> global symbol. This patch series above is all well and good, except that
> Geert reported issues with suspend/resume interactions which I have not
> been able to track down.
> 
> We should most definitively try to consolidate the different PHY
> suspend/resume within the Ethernet MAC suspend/resume implementation and
> document exactly what the appropriate behavior must be under the
> following circumstances:
> 
> - when to call phy_stop() + phy_stop_machine()
> - when to call phy_suspend() (if the network interface does do not WoL)
> - when to call phy_resume() (if needed, actually, it usually is not)
> - when to call phy_start()
> 

I think phy_start() / phy_start_machine() / phy_start_interrupts()
belong together and we may call the latter two functions from phy_start().
Same for stop.

This would mean:
- Remove call to phy_start_interrupts() from phy_connect_direct()
- Call phy_start_machine() and phy_start_interrupts() from phy_start()
- mdio_bus_phy_suspend() calls phy_stop()
Same for stop, plus: phy_error() calls phy_stop().

In this setup a second call to phy_stop() wouldn't hurt because state
is PHY_HALTED already and phy_stop() is a no-op.

A functional change would be that interrupts are disabled during system
suspend (except WoL because we don't suspend the PHY is this case). 

These are first thoughts and therefore it's fine if you totally disagree ..
I didn't test this yet, it's only a "Gedankenexperiment" so far.

When talking about suspend/resume I think we talked about system suspend /
resume. However I think we need to consider also runtime pm.
If a link is down the network driver may decide to runtime-suspend the PHY
(power it down). In case of runtime pm I'd say we need to keep irq and
workqueue active to be able to react if a cable is plugged in and the PHY
wakes up automatically and establishes a link.

Heiner


> I don't unfortunately have the time to code this myself at the moment,
> but I will happily review patches if you have the opportunity to do so.
> 
>>
>> However, if the link is down already (due to whatever calls
>> around phy_stop() in the driver) then phy_suspend() wouldn't be
>> called.
> 
> Correct, there is an implicit assumption that when the link is down,
> there is an opportunity for the Ethernet MAC driver to put things in low
> power, and the PHY itself, should be in a lower power mode where only
> link/energy detection might be utilizing power. At least this is the theory.
> 
>>
>> Heiner
>>
>>> A few examples:
>>>
>>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
>>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
>>> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
>>>
>>> So i suspect those drivers which call phy_suspend()/phy_resume()
>>> should really be modified to call phy_stop()/phy_start().
>>>
>>> hns_nic_config_phy_loopback() is just funky, and probably needs the
>>> help of the hns guys to fix.
>>>
>>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
>>> can be removed.
>>>
>>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
>>> locks should be added. socfpga_d

Re: [PATCH net 1/1 v2] rtnetlink: require unique netns identifier

2018-02-05 Thread Kirill Tkhai
On 05.02.2018 18:55, Christian Brauner wrote:
> Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
> it is possible for userspace to send us requests with three different
> properties to identify a target network namespace. This affects at least
> RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
> network namespace which is confusing. For legacy reasons the kernel will
> pick the IFLA_NET_NS_PID property first and then look for the
> IFLA_NET_NS_FD property but there is no reason to extend this type of
> behavior to network namespace ids. The regression potential is quite
> minimal since the rtnetlink requests in question either won't allow
> IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
> support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.
>> Signed-off-by: Christian Brauner 
> ---
> ChangeLog v1->v2:
> * return errno when the specified network namespace id is invalid
> * fill in struct netlink_ext_ack if the network namespace id is invalid
> * rename rtnl_ensure_unique_netns_attr() to rtnl_ensure_unique_netns() to
>   indicate that a request without any network namespace identifying attributes
>   is also considered valid.
> 
> ChangeLog v0->v1:
> * report a descriptive error to userspace via struct netlink_ext_ack
> * do not fail when multiple properties specifiy the same network namespace
> ---
>  net/core/rtnetlink.c | 69 
> 
>  1 file changed, 69 insertions(+)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 56af8e41abfc..c096c4ff9a00 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1951,6 +1951,59 @@ static struct net *rtnl_link_get_net_capable(const 
> struct sk_buff *skb,
>   return net;
>  }
>  
> +/* Verify that rtnetlink requests supporting network namespace ids
> + * do not pass additional properties referring to different network
> + * namespaces.
> + */
> +static int rtnl_ensure_unique_netns(const struct sock *sk, struct nlattr 
> *tb[],
> + struct netlink_ext_ack *extack)
> +{
> + int ret = -EINVAL;
> + struct net *net = NULL, *unique_net = NULL;
> +
> + /* Requests without network namespace ids have been able to specify
> +  * multiple properties referring to different network namespaces so
> +  * don't regress them.
> +  */
> + if (!tb[IFLA_IF_NETNSID])
> + return 0;
> +
> + /* Caller operates on the current network namespace. */
> + if (!tb[IFLA_NET_NS_PID] && !tb[IFLA_NET_NS_FD])
> + return 0;
> +
> + unique_net = get_net_ns_by_id(sock_net(sk), 
> nla_get_s32(tb[IFLA_IF_NETNSID]));
> + if (!unique_net) {
> + NL_SET_ERR_MSG(extack, "invalid network namespace id");
> + return ret;
> + }
> +
> + if (tb[IFLA_NET_NS_PID]) {
> + net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
> + if (net != unique_net)
> + goto on_error;
> + }
> +
> + if (tb[IFLA_NET_NS_FD]) {
> + net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
> + if (net != unique_net)
> + goto on_error;
> + }
> +
> + ret = 0;
> +
> +on_error:
> + put_net(unique_net);
> +
> + if (net && !IS_ERR(net))
> + put_net(net);

1)When we have tb[IFLA_NET_NS_PID and tb[IFLA_NET_NS_FD] both set and pointing
to the same net, this function increments net::count in get_net_ns_by_pid() and
in get_net_ns_by_fd(), i.e. twice. But only single put_net(net) will be called.
So, after this function net::count will be incremented by 1, and it never will
die.

2)The whole approach does not seem good for me. The first reason is it's racy.
Even if rtnl_ensure_unique_netns() returns 0, this does not guarantees that
tb[IFLA_IF_NETNSID] and tb[IFLA_NET_NS_PID] will be point the same net later,
as the pid may die or do setns(). Racy check is worse than no check at all.

The second reason is after this patch get_net_ns_by_id/get_net_ns_by_pid()/
get_net_ns_by_fd() will be called twice: the first time is in your check
and the second time is where they are actually used. This is not good for
performance.

What is the problem people pass several different tb[xxx] in one call? We
may just describe the order of tb[xxx] in man page and their priorities,
and ignore the rest after the first not zero tb[xxx] is found, and do that
in the place, where net from tb[xxx] in actually used. This is the thing
we already do.

Comparing to classic Linux interface such as syscalls, it's usual behavior
for them to ignore one argument, when another is set. Nobody confuses.

Kirill


Re: [NetDev-info] Distributed Switch Architecture for 88E6390

2018-02-05 Thread Andrew Lunn
On Mon, Feb 05, 2018 at 03:52:20PM -0500, David Thompson wrote:
> Hello all!  I really appreciate any feedback with regards to my inquiry.

Hi David

No top posting please.

Please make your signature confirm the netiquette. That general means
4 lines max, no legal gibberish, etc.

 Andrew


Re: [PATCH net v4] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Cong Wang
On Mon, Feb 5, 2018 at 1:23 PM, Paolo Abeni  wrote:
> The problem is that the htnode is freed before the linked knodes and the
> latter will try to access the first at u32_destroy_key() time.
> This change addresses the issue using the htnode refcnt to guarantee
> the correct free order. While at it also add a RCU annotation,
> to keep sparse happy.
>
> v1 -> v2: use rtnl_derefence() instead of RCU read locks
> v2 -> v3:
>   - don't check refcnt in u32_destroy_hnode()
>   - cleaned-up u32_destroy() implementation
>   - cleaned-up code comment
> v3 -> v4:
>   - dropped unneeded comment
>
> Reported-by: Li Shuang 
> Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
> Signed-off-by: Paolo Abeni 

Acked-by: Cong Wang 

Thanks.


Re: [NetDev-info] Distributed Switch Architecture for 88E6390

2018-02-05 Thread Andrew Lunn
On Mon, Feb 05, 2018 at 02:59:48PM -0500, S.Y. Park wrote:
> Dear Mr. Thompson,
> 
> I'm forwarding to you to the technical discussion mailing list called
> "netdev@vger.kernel.org".
> 
> i...@netdevconf.org is for discussions regarding The NetDev Society's
> NetDev Conference  attendance & participation questions & concerns,
> not technical discussion.
> 
> Good luck w/ your 88E6390 driver functionality testing contribution,
> and I'm sure someone from the netdev mailing list can address your
> question.
> 
> Sincerely,
> 
> Soyoung Park

Hi Soyoung

Thanks for forwarding this.

> > I am looking for contacts to anyone that is currently working on the Marvell
> > 88E6390 port within the Distributed Switch Architecture.  I am working
> > through a back port to 4.4.38 in a non-trivial integration and have found a
> > number of issues I believe are legitimate issues that I would like to
> > feedback.  I am also seeing a potential issue with outgoing TCP traffic and
> > was looking to see if this is a known issue or has been investigated prior.
> > I would be very happy to assist in testing the 88E6390 driver functionality.

Hi David

Are the issues you report here with your backported 4.4.38, or a
net-next?

> > For my configuration the CPU port is port 6, there is only one 88E6390 and
> > finally port 0 and the SERDES are not used.  I am also using the GPIO
> > bitbang driver for MDIO to the 88E6390 (I've confirmed that it works
> > independently).

Bit banging should not be a problem. I've used it with other Marvell
switches.

> > First, I've noticed a few interesting discrepancies in setup and the
> > datasheet for the 88E6390.  They are as follows:
> >
> > CPU Port port control configuration (port 0x6 reg 0x4) is configured 0x3107.
> > This indicates
> >
> > 15:14 SA Filter Disabled
> > 9:8 Frame Mode DSA
> > 13:12 Egress Mode - is then in an undefined state.  According the datasheet
> > it must be set to 00.

So mv88e6xxx_setup_port_mode() calls
mv88e6xxx_set_port_mode_dsa(), which calls
mv88e6xxx_set_port_mode(chip, port, MV88E6XXX_FRAME_MODE_DSA,
MV88E6XXX_EGRESS_MODE_UNMODIFIED,
MV88E6XXX_PORT_ETH_TYPE_DEFAULT);
which calls
mv88e6xxx_port_set_egress_mode(chip, port, egress);
where egress is MV88E6XXX_EGRESS_MODE_UNMODIFIED.

int mv88e6xxx_port_set_egress_mode(struct mv88e6xxx_chip *chip, int port,
   enum mv88e6xxx_egress_mode mode)
{
int err;
u16 reg;

err = mv88e6xxx_port_read(chip, port, MV88E6XXX_PORT_CTL0, ®);
if (err)
return err;

reg &= ~MV88E6XXX_PORT_CTL0_EGRESS_MODE_MASK;

switch (mode) {
case MV88E6XXX_EGRESS_MODE_UNMODIFIED:
reg |= MV88E6XXX_PORT_CTL0_EGRESS_MODE_UNMODIFIED;
break;

#define MV88E6XXX_PORT_CTL0_EGRESS_MODE_UNMODIFIED  0x

So that should me 13:12 are set to 00.

On my board, i have a value of 0x53f.

> > Global1 Global Control 2 (addr 0x1b reg 0x1c) is configured 0xf000.  Bits
> > 15:14 only supports 0x0, 0x1, and 0x2 while bits 13:12 are reserved.

>From my board

Global1@0
 0: c801
 1:0
 2:0
 3:0
 4: 40a8
 5: 1000
 6:0
 7:0
 8:0
 9:0
10:  509
11: 3000
12: 7ff7
13: 
14: 
15: 
16:0
17:0
18:0
19:0
20:0
21:0
22:0
23:0
24:0
25:0
26:  3ff
27:  3f9
28: f0c0

1c = 28. Yes, we have the upper bits set.

This is from mv88e6xxx_g1_setup(), setting
MV88E6XXX_G1_CTL2_MULTIPLE_CASCADE which is not valid for this family.

29: 1000
30:0
31:0


> > Global1 registers 0x10 -> 0x18 are written 0x, 0x, 0x, 0x,
> > 0x, 0x, 0x, and 0x.  According the 88E6390 datasheet
> > 0x10->0x18 are reserved.

Agreed. mv88e6xxx_g1_setup() needs a few fixes.

> > I now face an interesting problem, I am currently unable to setup or
> > maintain a TCP session.  Doing a wireshark capture on my host pc to the
> > target device (including the 88E6390) with the appropriate port enabled I am
> > seeing packet re-transmission issues from the target device.  For example,
> > if I try to initiate from my host pc I will send a SYN packet and receive
> > multiple identical  SYN,ACK packets returned or if I initiate from the
> > target device I see multiple identical SYN packets sent.  I was wondering if
> > you had ever seen this anything like this issue and if so had any
> > suggestions with regards to what might be the cause?

I would suggest running tcpdump on the target as well. See if packets
it is trying to send are not making it out. This is made a bit harder
in the tcpdump does not understand the DSA header. But you can at
least see the raw frames.

What Ethernet device are you using to the SoC side of the CPU port?
Adding the DSA header makes the Ethernet frame bigger. It is then
bigger than the standard Ethernet MTU. I've seen some Ethernet
drivers/HW discard such frames. Check you can actually

[PATCH net 2/3] net: erspan: fix erspan config overwrite

2018-02-05 Thread William Tu
When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata.  The patch fixes it.

Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support")
Signed-off-by: William Tu 
---
 net/ipv4/ip_gre.c  | 9 -
 net/ipv6/ip6_gre.c | 9 -
 2 files changed, 18 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 9b50eddd1882..45d97e9b2759 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -322,15 +322,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
info = &tun_dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
-   } else {
-   tunnel->erspan_ver = ver;
-   if (ver == 1) {
-   tunnel->index = ntohl(pkt_md->u.index);
-   } else {
-   tunnel->dir = pkt_md->u.md2.dir;
-   tunnel->hwid = get_hwid(&pkt_md->u.md2);
-   }
-
}
 
skb_reset_mac_header(skb);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 50913dbd0612..3c353125546d 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -562,15 +562,6 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
 
} else {
-   tunnel->parms.erspan_ver = ver;
-
-   if (ver == 1) {
-   tunnel->parms.index = ntohl(pkt_md->u.index);
-   } else {
-   tunnel->parms.dir = pkt_md->u.md2.dir;
-   tunnel->parms.hwid = get_hwid(&pkt_md->u.md2);
-   }
-
ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
}
 
-- 
2.7.4



[PATCH net 3/3] sample/bpf: fix erspan metadata

2018-02-05 Thread William Tu
The commit c69de58ba84f ("net: erspan: use bitfield instead of
mask and offset") changes the erspan header to use bitfield, and
commit d350a823020e ("net: erspan: create erspan metadata uapi header")
creates a uapi header file.  The above two commit breaks the current
erspan test.  This patch fixes it by adapting the above two changes.

Fixes: ac80c2a165af ("samples/bpf: add erspan v2 sample code")
Fixes: ef88f89c830f ("samples/bpf: extend test_tunnel_bpf.sh with ERSPAN")
Signed-off-by: William Tu 
---
 samples/bpf/tcbpf2_kern.c  | 41 -
 samples/bpf/test_tunnel_bpf.sh |  4 ++--
 2 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index f6bbf8f50da3..efdc16d195ff 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "bpf_helpers.h"
 #include "bpf_endian.h"
@@ -35,24 +36,10 @@ struct geneve_opt {
u8  opt_data[8]; /* hard-coded to 8 byte */
 };
 
-struct erspan_md2 {
-   __be32 timestamp;
-   __be16 sgt;
-   __be16 flags;
-};
-
 struct vxlan_metadata {
u32 gbp;
 };
 
-struct erspan_metadata {
-   union {
-   __be32 index;
-   struct erspan_md2 md2;
-   } u;
-   int version;
-};
-
 SEC("gre_set_tunnel")
 int _gre_set_tunnel(struct __sk_buff *skb)
 {
@@ -156,13 +143,15 @@ int _erspan_set_tunnel(struct __sk_buff *skb)
__builtin_memset(&md, 0, sizeof(md));
 #ifdef ERSPAN_V1
md.version = 1;
-   md.u.index = htonl(123);
+   md.u.index = bpf_htonl(123);
 #else
u8 direction = 1;
-   u16 hwid = 7;
+   u8 hwid = 7;
 
md.version = 2;
-   md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+   md.u.md2.dir = direction;
+   md.u.md2.hwid = hwid & 0xf;
+   md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
 #endif
 
ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
@@ -207,9 +196,9 @@ int _erspan_get_tunnel(struct __sk_buff *skb)
char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
 
bpf_trace_printk(fmt2, sizeof(fmt2),
-   (ntohs(md.u.md2.flags) >> 3) & 0x1,
-   (ntohs(md.u.md2.flags) >> 4) & 0x3f,
-   bpf_ntohl(md.u.md2.timestamp));
+md.u.md2.dir,
+(md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+bpf_ntohl(md.u.md2.timestamp));
 #endif
 
return TC_ACT_OK;
@@ -242,10 +231,12 @@ int _ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
md.version = 1;
 #else
u8 direction = 0;
-   u16 hwid = 17;
+   u8 hwid = 17;
 
md.version = 2;
-   md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+   md.u.md2.dir = direction;
+   md.u.md2.hwid = hwid & 0xf;
+   md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
 #endif
 
ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
@@ -290,9 +281,9 @@ int _ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
 
bpf_trace_printk(fmt2, sizeof(fmt2),
-   (ntohs(md.u.md2.flags) >> 3) & 0x1,
-   (ntohs(md.u.md2.flags) >> 4) & 0x3f,
-   bpf_ntohl(md.u.md2.timestamp));
+md.u.md2.dir,
+(md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+bpf_ntohl(md.u.md2.timestamp));
 #endif
 
return TC_ACT_OK;
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index ae7f7c38309b..43ce049996ee 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -68,7 +68,7 @@ function add_erspan_tunnel {
ip netns exec at_ns0 \
ip link add dev $DEV_NS type $TYPE seq key 2 \
local 172.16.1.100 remote 172.16.1.200 \
-   erspan_ver 2 erspan_dir 1 erspan_hwid 3
+   erspan_ver 2 erspan_dir egress erspan_hwid 3
fi
ip netns exec at_ns0 ip link set dev $DEV_NS up
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
@@ -97,7 +97,7 @@ function add_ip6erspan_tunnel {
ip netns exec at_ns0 \
ip link add dev $DEV_NS type $TYPE seq key 2 \
local ::11 remote ::22 \
-   erspan_ver 2 erspan_dir 1 erspan_hwid 7
+   erspan_ver 2 erspan_dir egress erspan_hwid 7
fi
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
ip netns exec at_ns0 ip link set dev $DEV_NS up
-- 
2.7.4



[PATCH net 1/3] net: erspan: fix metadata extraction

2018-02-05 Thread William Tu
Commit d350a823020e ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason.  This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr.  This patch
fixes it.

Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: William Tu 
---
 include/net/erspan.h | 26 +-
 net/ipv4/ip_gre.c|  5 -
 net/ipv6/ip6_gre.c   |  6 --
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index 5daa4866412b..d044aa60cc76 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -159,13 +159,13 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
struct ethhdr *eth = (struct ethhdr *)skb->data;
enum erspan_encap_type enc_type;
struct erspan_base_hdr *ershdr;
-   struct erspan_metadata *ersmd;
struct qtag_prefix {
__be16 eth_type;
__be16 tci;
} *qp;
u16 vlan_tci = 0;
u8 tos;
+   __be32 *idx;
 
tos = is_ipv4 ? ip_hdr(skb)->tos :
(ipv6_hdr(skb)->priority << 4) +
@@ -195,8 +195,8 @@ static inline void erspan_build_header(struct sk_buff *skb,
set_session_id(ershdr, id);
 
/* Build metadata */
-   ersmd = (struct erspan_metadata *)(ershdr + 1);
-   ersmd->u.index = htonl(index & INDEX_MASK);
+   idx = (__be32 *)(ershdr + 1);
+   *idx = htonl(index & INDEX_MASK);
 }
 
 /* ERSPAN GRA: timestamp granularity
@@ -225,7 +225,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
 {
struct ethhdr *eth = (struct ethhdr *)skb->data;
struct erspan_base_hdr *ershdr;
-   struct erspan_metadata *md;
+   struct erspan_md2 *md2;
struct qtag_prefix {
__be16 eth_type;
__be16 tci;
@@ -261,15 +261,15 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
set_session_id(ershdr, id);
 
/* Build metadata */
-   md = (struct erspan_metadata *)(ershdr + 1);
-   md->u.md2.timestamp = erspan_get_timestamp();
-   md->u.md2.sgt = htons(sgt);
-   md->u.md2.p = 1;
-   md->u.md2.ft = 0;
-   md->u.md2.dir = direction;
-   md->u.md2.gra = gra;
-   md->u.md2.o = 0;
-   set_hwid(&md->u.md2, hwid);
+   md2 = (struct erspan_md2 *)(ershdr + 1);
+   md2->timestamp = erspan_get_timestamp();
+   md2->sgt = htons(sgt);
+   md2->p = 1;
+   md2->ft = 0;
+   md2->dir = direction;
+   md2->gra = gra;
+   md2->o = 0;
+   set_hwid(md2, hwid);
 }
 
 #endif
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 6ec670fbbbdd..9b50eddd1882 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -261,6 +261,7 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
struct ip_tunnel_net *itn;
struct ip_tunnel *tunnel;
const struct iphdr *iph;
+   struct erspan_md2 *md2;
int ver;
int len;
 
@@ -313,8 +314,10 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
return PACKET_REJECT;
 
md = ip_tunnel_info_opts(&tun_dst->u.tun_info);
-   memcpy(md, pkt_md, sizeof(*md));
md->version = ver;
+   md2 = &md->u.md2;
+   memcpy(md2, pkt_md, ver == 1 ? ERSPAN_V1_MDSIZE :
+  ERSPAN_V2_MDSIZE);
 
info = &tun_dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 05f070e123e4..50913dbd0612 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -505,6 +505,7 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
struct erspan_base_hdr *ershdr;
struct erspan_metadata *pkt_md;
const struct ipv6hdr *ipv6h;
+   struct erspan_md2 *md2;
struct ip6_tnl *tunnel;
u8 ver;
 
@@ -551,9 +552,10 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
 
info = &tun_dst->u.tun_info;
md = ip_tunnel_info_opts(info);
-
-   memcpy(md, pkt_md, sizeof(*md));
md->version = ver;
+   md2 = &md->u.md2;
+   memcpy(md2, pkt_md, ver == 1 ? ERSPAN_V1_MDSIZE :
+  ERSPAN_V2_MDSIZE);
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
 
-- 
2.7.4



[PATCH net 0/3] net: erspan fixes

2018-02-05 Thread William Tu
The first patch fixes erspan metadata extraction issue from packet
header due to commit d350a823020e ("net: erspan: create erspan metadata
uapi header").  The commit moves the erspan 'version' in
'struct erspan_metadata' in front of 'struct erspan_md2' for later
extensibility, but breaks the existing metadata extraction code due
to extra 4-byte size 'version'.  The second patch fixes the case where
tunnel device receives an erspan packet with different tunnel metadata
(ex: version, index, hwid, direction), existing code overwrites the
tunnel device's erspan configuration.  The third patch fixes the bpf
tests due to the above patches.

William Tu (3):
  net: erspan: fix metadata extraction
  net: erspan: fix erspan config overwrite
  sample/bpf: fix erspan metadata

 include/net/erspan.h   | 26 +-
 net/ipv4/ip_gre.c  | 14 --
 net/ipv6/ip6_gre.c | 15 ---
 samples/bpf/tcbpf2_kern.c  | 41 -
 samples/bpf/test_tunnel_bpf.sh |  4 ++--
 5 files changed, 39 insertions(+), 61 deletions(-)

-- 
2.7.4



[PATCH net v4] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Paolo Abeni
Li Shuang reported an Oops with cls_u32 due to an use-after-free
in u32_destroy_key(). The use-after-free can be triggered with:

dev=lo
tc qdisc add dev $dev root handle 1: htb default 10
tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
 10.0.0.0/8 hashkey mask 0xff00 at 16 link 1:
tc qdisc del dev $dev root

Which causes the following kasan splat:

 ==
 BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 
[cls_u32]
 Read of size 4 at addr 881b83dae618 by task kworker/u48:5/571

 CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
 Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
 Call Trace:
  dump_stack+0xd6/0x182
  ? dma_virt_map_sg+0x22e/0x22e
  print_address_description+0x73/0x290
  kasan_report+0x277/0x360
  ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
  process_one_work+0xae0/0x1c80
  ? sched_clock+0x5/0x10
  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? check_noncircular+0x20/0x20
  ? firmware_map_remove+0x73/0x73
  ? find_held_lock+0x39/0x1c0
  ? worker_thread+0x434/0x1820
  ? lock_contended+0xee0/0xee0
  ? lock_release+0x1100/0x1100
  ? init_rescuer.part.16+0x150/0x150
  ? retint_kernel+0x10/0x10
  worker_thread+0x216/0x1820
  ? process_one_work+0x1c80/0x1c80
  ? lock_acquire+0x1a5/0x540
  ? lock_downgrade+0x6b0/0x6b0
  ? sched_clock+0x5/0x10
  ? lock_release+0x1100/0x1100
  ? compat_start_thread+0x80/0x80
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? kmem_cache_alloc_trace+0x143/0x320
  ? firmware_map_remove+0x73/0x73
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1c0
  ? schedule+0xf3/0x3b0
  ? lock_downgrade+0x6b0/0x6b0
  ? __schedule+0x1ee0/0x1ee0
  ? do_wait_intr_irq+0x340/0x340
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irqrestore+0x32/0x60
  ? process_one_work+0x1c80/0x1c80
  ? process_one_work+0x1c80/0x1c80
  kthread+0x312/0x3d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x3a/0x50

 Allocated by task 1688:
  kasan_kmalloc+0xa0/0xd0
  __kmalloc+0x162/0x380
  u32_change+0x1220/0x3c9e [cls_u32]
  tc_ctl_tfilter+0x1ba6/0x2f80
  rtnetlink_rcv_msg+0x4f0/0x9d0
  netlink_rcv_skb+0x124/0x320
  netlink_unicast+0x430/0x600
  netlink_sendmsg+0x8fa/0xd60
  sock_sendmsg+0xb1/0xe0
  ___sys_sendmsg+0x678/0x980
  __sys_sendmsg+0xc4/0x210
  do_syscall_64+0x232/0x7f0
  return_from_SYSCALL_64+0x0/0x75

 Freed by task 112:
  kasan_slab_free+0x71/0xc0
  kfree+0x114/0x320
  rcu_process_callbacks+0xc3f/0x1600
  __do_softirq+0x2bf/0xc06

 The buggy address belongs to the object at 881b83dae600
  which belongs to the cache kmalloc-4096 of size 4096
 The buggy address is located 24 bytes inside of
  4096-byte region [881b83dae600, 881b83daf600)
 The buggy address belongs to the page:
 page:ea006e0f6a00 count:1 mapcount:0 mapping:  (null) index:0x0 
compound_mapcount: 0
 flags: 0x17c0008100(slab|head)
 raw: 0017c0008100   000100070007
 raw: dead0100 dead0200 880187c0e600 
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ^
  881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ==

The problem is that the htnode is freed before the linked knodes and the
latter will try to access the first at u32_destroy_key() time.
This change addresses the issue using the htnode refcnt to guarantee
the correct free order. While at it also add a RCU annotation,
to keep sparse happy.

v1 -> v2: use rtnl_derefence() instead of RCU read locks
v2 -> v3:
  - don't check refcnt in u32_destroy_hnode()
  - cleaned-up u32_destroy() implementation
  - cleaned-up code comment
v3 -> v4:
  - dropped unneeded comment

Reported-by: Li Shuang 
Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 fi

Re: [NetDev-info] Distributed Switch Architecture for 88E6390

2018-02-05 Thread Florian Fainelli


On 02/05/2018 11:59 AM, S.Y. Park wrote:
> Dear Mr. Thompson,
> 
> I'm forwarding to you to the technical discussion mailing list called
> "netdev@vger.kernel.org".
> 
> i...@netdevconf.org is for discussions regarding The NetDev Society's
> NetDev Conference  attendance & participation questions & concerns,
> not technical discussion.
> 
> Good luck w/ your 88E6390 driver functionality testing contribution,
> and I'm sure someone from the netdev mailing list can address your
> question.
> 
> Sincerely,
> 
> Soyoung Park
> Executive Director
> The NetDev Society
> 
> On Mon, Feb 5, 2018 at 1:09 PM, David Thompson  
> wrote:
>> Hello,
>>
>> I am looking for contacts to anyone that is currently working on the Marvell
>> 88E6390 port within the Distributed Switch Architecture.  I am working
>> through a back port to 4.4.38 in a non-trivial integration and have found a
>> number of issues I believe are legitimate issues that I would like to
>> feedback.  I am also seeing a potential issue with outgoing TCP traffic and
>> was looking to see if this is a known issue or has been investigated prior.
>> I would be very happy to assist in testing the 88E6390 driver functionality.
>>
>> Details follow:
>>
>> For my configuration the CPU port is port 6, there is only one 88E6390 and
>> finally port 0 and the SERDES are not used.  I am also using the GPIO
>> bitbang driver for MDIO to the 88E6390 (I've confirmed that it works
>> independently).
>>
>> First, I've noticed a few interesting discrepancies in setup and the
>> datasheet for the 88E6390.  They are as follows:
>>
>> CPU Port port control configuration (port 0x6 reg 0x4) is configured 0x3107.
>> This indicates
>>
>> 15:14 SA Filter Disabled
>> 9:8 Frame Mode DSA
>> 13:12 Egress Mode - is then in an undefined state.  According the datasheet
>> it must be set to 00.
>>
>> Global1 Global Control 2 (addr 0x1b reg 0x1c) is configured 0xf000.  Bits
>> 15:14 only supports 0x0, 0x1, and 0x2 while bits 13:12 are reserved.
>> Global1 registers 0x10 -> 0x18 are written 0x, 0x, 0x, 0x,
>> 0x, 0x, 0x, and 0x.  According the 88E6390 datasheet
>> 0x10->0x18 are reserved.
>>
>>
>> When I got the back-port compiling and flashed on my target device I was
>> able to bring up individual ports (the phy would come up and auto-negotiate
>> successfully) but could not send or receive ICMP echo/reply.  Making the
>> changes associated with 1) above to be in accordance with the Marvell
>> datasheet got me to a point where I could send and receive pings.
>>
>> I now face an interesting problem, I am currently unable to setup or
>> maintain a TCP session.  Doing a wireshark capture on my host pc to the
>> target device (including the 88E6390) with the appropriate port enabled I am
>> seeing packet re-transmission issues from the target device.  For example,
>> if I try to initiate from my host pc I will send a SYN packet and receive
>> multiple identical  SYN,ACK packets returned or if I initiate from the
>> target device I see multiple identical SYN packets sent.  I was wondering if
>> you had ever seen this anything like this issue and if so had any
>> suggestions with regards to what might be the cause?
>>
>> Much appreciated,
>>
>> Dave Thompson
>>
>> David Thompson BSc, MSc
>> Chief Systems Architect
>> Miovision
>> dthomp...@miovision.com
>> 519-513-2407 x 225
>> 877-646-8476 (toll-free)
>> 519-590-5458 (mobile)
>> Web | Blog  |  LinkedIn  |  Twitter  |  Facebook
>>
>> We’re on to bigger and better things (and spaces)! Please note that
>> Miovision’s head office has moved to 137 Glasgow Street, Suite 110,
>> Kitchener ON, N2G 4X8.
>>
>> Miovision | 137 Glasgow Street, Suite 110, Kitchener ON | N2G 4X8
>>
>> This e-mail may contain information that is privileged or confidential. If
>> you are not the intended recipient, please delete the e-mail and any
>> attachments and notify us immediately. Please advise if you require
>> reasonable accommodation or assistance.
>>
>> ___
>> info mailing list
>> i...@lists.netdevconf.org
>> http://lists.netdevconf.org/cgi-bin/mailman/listinfo/info
>>

-- 
Florian


[PATCH 4/4] net: amd-xgbe: fix comparison to bitshift when dealing with a mask

2018-02-05 Thread Wolfram Sang
Due to a typo, the mask was destroyed by a comparison instead of a bit
shift.

Signed-off-by: Wolfram Sang 
---
Only build tested. To be applied individually per subsystem.

 drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index 7a3ebfd236f5eb..100adee778dfd6 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -595,7 +595,7 @@ static void xgbe_isr_task(unsigned long data)
 
reissue_mask = 1 << 0;
if (!pdata->per_channel_irq)
-   reissue_mask |= 0x < 4;
+   reissue_mask |= 0x << 4;
 
XP_IOWRITE(pdata, XP_INT_REISSUE_EN, reissue_mask);
}
-- 
2.11.0



[PATCH 0/4] tree-wide: fix comparison to bitshift when dealing with a mask

2018-02-05 Thread Wolfram Sang
In one Renesas driver, I found a typo which turned an intended bit shift ('<<')
into a comparison ('<'). Because this is a subtle issue, I looked tree wide for
similar patterns. This small patch series is the outcome.

Buildbot and checkpatch are happy. Only compile-tested. To be applied
individually per sub-system, I think. I'd think only the net: amd: patch needs
to be conisdered for stable, but I leave this to people who actually know this
driver.

CCing Dan. Maybe he has an idea how to add a test to smatch? In my setup, only
cppcheck reported a 'coding style' issue with a low prio.

Wolfram Sang (4):
  v4l: vsp1: fix mask creation for MULT_ALPHA_RATIO
  drm/exynos: fix comparison to bitshift when dealing with a mask
  v4l: dvb-frontends: stb0899: fix comparison to bitshift when dealing
with a mask
  net: amd-xgbe: fix comparison to bitshift when dealing with a mask

 drivers/gpu/drm/exynos/regs-fimc.h| 2 +-
 drivers/media/dvb-frontends/stb0899_reg.h | 8 
 drivers/media/platform/vsp1/vsp1_regs.h   | 2 +-
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c  | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

-- 
2.11.0



Re: [NetDev-info] Distributed Switch Architecture for 88E6390

2018-02-05 Thread S.Y. Park
Dear Mr. Thompson,

I'm forwarding to you to the technical discussion mailing list called
"netdev@vger.kernel.org".

i...@netdevconf.org is for discussions regarding The NetDev Society's
NetDev Conference  attendance & participation questions & concerns,
not technical discussion.

Good luck w/ your 88E6390 driver functionality testing contribution,
and I'm sure someone from the netdev mailing list can address your
question.

Sincerely,

Soyoung Park
Executive Director
The NetDev Society

On Mon, Feb 5, 2018 at 1:09 PM, David Thompson  wrote:
> Hello,
>
> I am looking for contacts to anyone that is currently working on the Marvell
> 88E6390 port within the Distributed Switch Architecture.  I am working
> through a back port to 4.4.38 in a non-trivial integration and have found a
> number of issues I believe are legitimate issues that I would like to
> feedback.  I am also seeing a potential issue with outgoing TCP traffic and
> was looking to see if this is a known issue or has been investigated prior.
> I would be very happy to assist in testing the 88E6390 driver functionality.
>
> Details follow:
>
> For my configuration the CPU port is port 6, there is only one 88E6390 and
> finally port 0 and the SERDES are not used.  I am also using the GPIO
> bitbang driver for MDIO to the 88E6390 (I've confirmed that it works
> independently).
>
> First, I've noticed a few interesting discrepancies in setup and the
> datasheet for the 88E6390.  They are as follows:
>
> CPU Port port control configuration (port 0x6 reg 0x4) is configured 0x3107.
> This indicates
>
> 15:14 SA Filter Disabled
> 9:8 Frame Mode DSA
> 13:12 Egress Mode - is then in an undefined state.  According the datasheet
> it must be set to 00.
>
> Global1 Global Control 2 (addr 0x1b reg 0x1c) is configured 0xf000.  Bits
> 15:14 only supports 0x0, 0x1, and 0x2 while bits 13:12 are reserved.
> Global1 registers 0x10 -> 0x18 are written 0x, 0x, 0x, 0x,
> 0x, 0x, 0x, and 0x.  According the 88E6390 datasheet
> 0x10->0x18 are reserved.
>
>
> When I got the back-port compiling and flashed on my target device I was
> able to bring up individual ports (the phy would come up and auto-negotiate
> successfully) but could not send or receive ICMP echo/reply.  Making the
> changes associated with 1) above to be in accordance with the Marvell
> datasheet got me to a point where I could send and receive pings.
>
> I now face an interesting problem, I am currently unable to setup or
> maintain a TCP session.  Doing a wireshark capture on my host pc to the
> target device (including the 88E6390) with the appropriate port enabled I am
> seeing packet re-transmission issues from the target device.  For example,
> if I try to initiate from my host pc I will send a SYN packet and receive
> multiple identical  SYN,ACK packets returned or if I initiate from the
> target device I see multiple identical SYN packets sent.  I was wondering if
> you had ever seen this anything like this issue and if so had any
> suggestions with regards to what might be the cause?
>
> Much appreciated,
>
> Dave Thompson
>
> David Thompson BSc, MSc
> Chief Systems Architect
> Miovision
> dthomp...@miovision.com
> 519-513-2407 x 225
> 877-646-8476 (toll-free)
> 519-590-5458 (mobile)
> Web | Blog  |  LinkedIn  |  Twitter  |  Facebook
>
> We’re on to bigger and better things (and spaces)! Please note that
> Miovision’s head office has moved to 137 Glasgow Street, Suite 110,
> Kitchener ON, N2G 4X8.
>
> Miovision | 137 Glasgow Street, Suite 110, Kitchener ON | N2G 4X8
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately. Please advise if you require
> reasonable accommodation or assistance.
>
> ___
> info mailing list
> i...@lists.netdevconf.org
> http://lists.netdevconf.org/cgi-bin/mailman/listinfo/info
>


[PATCH iproute2-next 9/9] ipaddress: Make print_linkinfo_brief() static

2018-02-05 Thread Serhey Popovych
It shares lot of code with print_linkinfo(): drop duplicated part,
change parameters list, make it static and call from print_linkinfo()
after common path.

While there move SPRINT_BUF() to the function scope from blocks to
avoid duplication and use "%s" to print "\n" to help compiler optimize
exit for both print_linkinfo_brief() and normal paths.

Signed-off-by: Serhey Popovych 
---
 ip/ip_common.h |2 --
 ip/ipaddress.c |   76 
 ip/iplink.c|5 +---
 3 files changed, 11 insertions(+), 72 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 1397d99..e4e628b 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -29,8 +29,6 @@ struct link_filter {
 int get_operstate(const char *name);
 int print_linkinfo(const struct sockaddr_nl *who,
   struct nlmsghdr *n, void *arg);
-int print_linkinfo_brief(const struct sockaddr_nl *who,
-struct nlmsghdr *n, void *arg);
 int print_addrinfo(const struct sockaddr_nl *who,
   struct nlmsghdr *n, void *arg);
 int print_addrlabel(const struct sockaddr_nl *who,
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 65c2559..417b899 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -917,63 +917,12 @@ static void print_link_stats(FILE *fp, struct nlmsghdr *n)
fprintf(fp, "%s", _SL_);
 }
 
-int print_linkinfo_brief(const struct sockaddr_nl *who,
-struct nlmsghdr *n, void *arg)
+static int print_linkinfo_brief(FILE *fp, const char *name,
+   const struct ifinfomsg *ifi,
+   struct rtattr *tb[])
 {
-   FILE *fp = (FILE *)arg;
-   struct ifinfomsg *ifi = NLMSG_DATA(n);
-   struct rtattr *tb[IFLA_MAX+1];
-   int len = n->nlmsg_len;
-   const char *name;
unsigned int m_flag = 0;
 
-   if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
-   return -1;
-
-   len -= NLMSG_LENGTH(sizeof(*ifi));
-   if (len < 0)
-   return -1;
-
-   if (filter.ifindex && ifi->ifi_index != filter.ifindex)
-   return -1;
-   if (filter.up && !(ifi->ifi_flags&IFF_UP))
-   return -1;
-
-   parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
-
-   name = get_ifname_rta(ifi->ifi_index, tb[IFLA_IFNAME]);
-   if (!name)
-   return -1;
-
-   if (filter.label &&
-   (!filter.family || filter.family == AF_PACKET) &&
-   fnmatch(filter.label, name, 0))
-   return -1;
-
-   if (tb[IFLA_GROUP]) {
-   int group = rta_getattr_u32(tb[IFLA_GROUP]);
-
-   if (filter.group != -1 && group != filter.group)
-   return -1;
-   }
-
-   if (tb[IFLA_MASTER]) {
-   int master = rta_getattr_u32(tb[IFLA_MASTER]);
-
-   if (filter.master > 0 && master != filter.master)
-   return -1;
-   } else if (filter.master > 0)
-   return -1;
-
-   if (filter.kind && match_link_kind(tb, filter.kind, 0))
-   return -1;
-
-   if (filter.slave_kind && match_link_kind(tb, filter.slave_kind, 1))
-   return -1;
-
-   if (n->nlmsg_type == RTM_DELLINK)
-   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
-
m_flag = print_name_and_link("%-16s ", COLOR_NONE, name, tb);
 
if (tb[IFLA_OPERSTATE])
@@ -1033,6 +982,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
int len = n->nlmsg_len;
const char *name;
unsigned int m_flag = 0;
+   SPRINT_BUF(b1);
 
if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
return 0;
@@ -1081,6 +1031,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
if (n->nlmsg_type == RTM_DELLINK)
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
+   if (brief)
+   return print_linkinfo_brief(fp, name, ifi, tb);
+
print_int(PRINT_ANY, "ifindex", "%d: ", ifi->ifi_index);
 
m_flag = print_name_and_link("%s: ", COLOR_IFNAME, name, tb);
@@ -1113,7 +1066,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_linkmode(fp, tb[IFLA_LINKMODE]);
 
if (tb[IFLA_GROUP]) {
-   SPRINT_BUF(b1);
int group = rta_getattr_u32(tb[IFLA_GROUP]);
 
print_string(PRINT_ANY,
@@ -1129,8 +1081,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_link_event(fp, rta_getattr_u32(tb[IFLA_EVENT]));
 
if (!filter.family || filter.family == AF_PACKET || show_details) {
-   SPRINT_BUF(b1);
-
print_string(PRINT_FP, NULL, "%s", _SL_);
print_string(PRINT_ANY,
 "link_type",
@@ -1230,7 +1180,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
 rta_getattr_str(tb[IFLA_PHYS_PORT_NAME]));
 
   

[PATCH iproute2-next 8/9] utils: Introduce and use print_name_and_link() to print name@link

2018-02-05 Thread Serhey Popovych
There is at least three places implementing same things: two in
ipaddress.c print_linkinfo() & print_linkinfo_brief() and one in
bridge/link.c.

They are diverge from each other very little: bridge/link.c does not
support JSON output at the moment and print_linkinfo_brief() does not
handle IFLA_LINK_NETNS case.

Introduce and use print_name_and_link() routine to handle name@link
output in all possible variations; respect IFLA_LINK_NETNS attribute to
handle case when link is in different namespace; use ll_idx_n2a() for
interface name instead of "" to share logic with other code (e.g.
ll_name_to_index() and ll_index_to_name()) supporting such template.

Signed-off-by: Serhey Popovych 
---
 bridge/link.c   |   13 +++--
 include/utils.h |4 
 ip/ipaddress.c  |   44 ++--
 lib/utils.c |   49 +
 4 files changed, 58 insertions(+), 52 deletions(-)

diff --git a/bridge/link.c b/bridge/link.c
index a11cbb1..90c9734 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -125,20 +125,13 @@ int print_linkinfo(const struct sockaddr_nl *who,
if (n->nlmsg_type == RTM_DELLINK)
fprintf(fp, "Deleted ");
 
-   fprintf(fp, "%d: %s ", ifi->ifi_index,
-   tb[IFLA_IFNAME] ? rta_getattr_str(tb[IFLA_IFNAME]) : "");
+   fprintf(fp, "%d: ", ifi->ifi_index);
+
+   print_name_and_link("%s: ", COLOR_NONE, name, tb);
 
if (tb[IFLA_OPERSTATE])
print_operstate(fp, rta_getattr_u8(tb[IFLA_OPERSTATE]));
 
-   if (tb[IFLA_LINK]) {
-   int iflink = rta_getattr_u32(tb[IFLA_LINK]);
-
-   fprintf(fp, "@%s: ",
-   iflink ? ll_index_to_name(iflink) : "NONE");
-   } else
-   fprintf(fp, ": ");
-
print_link_flags(fp, ifi->ifi_flags);
 
if (tb[IFLA_MTU])
diff --git a/include/utils.h b/include/utils.h
index ced5b7b..fe8f6d4 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -12,6 +12,7 @@
 #include "libnetlink.h"
 #include "ll_map.h"
 #include "rtm_map.h"
+#include "json_print.h"
 
 extern int preferred_family;
 extern int human_readable;
@@ -240,6 +241,9 @@ void print_escape_buf(const __u8 *buf, size_t len, const 
char *escape);
 int print_timestamp(FILE *fp);
 void print_nlmsg_timestamp(FILE *fp, const struct nlmsghdr *n);
 
+unsigned int print_name_and_link(const char *fmt, enum color_attr color,
+const char *name, struct rtattr *tb[]);
+
 #define BIT(nr) (1UL << (nr))
 
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 458d47c..65c2559 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -925,7 +925,6 @@ int print_linkinfo_brief(const struct sockaddr_nl *who,
struct rtattr *tb[IFLA_MAX+1];
int len = n->nlmsg_len;
const char *name;
-   char buf[32] = { 0, };
unsigned int m_flag = 0;
 
if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
@@ -975,25 +974,7 @@ int print_linkinfo_brief(const struct sockaddr_nl *who,
if (n->nlmsg_type == RTM_DELLINK)
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
-   if (tb[IFLA_LINK]) {
-   int iflink = rta_getattr_u32(tb[IFLA_LINK]);
-
-   if (iflink == 0) {
-   snprintf(buf, sizeof(buf), "%s@NONE", name);
-   print_null(PRINT_JSON, "link", NULL, NULL);
-   } else {
-   const char *link = ll_index_to_name(iflink);
-
-   print_string(PRINT_JSON, "link", NULL, link);
-   snprintf(buf, sizeof(buf), "%s@%s", name, link);
-   m_flag = ll_index_to_flags(iflink);
-   m_flag = !(m_flag & IFF_UP);
-   }
-   } else
-   snprintf(buf, sizeof(buf), "%s", name);
-
-   print_string(PRINT_FP, NULL, "%-16s ", buf);
-   print_string(PRINT_JSON, "ifname", NULL, name);
+   m_flag = print_name_and_link("%-16s ", COLOR_NONE, name, tb);
 
if (tb[IFLA_OPERSTATE])
print_operstate(fp, rta_getattr_u8(tb[IFLA_OPERSTATE]));
@@ -1101,29 +1082,8 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
print_int(PRINT_ANY, "ifindex", "%d: ", ifi->ifi_index);
-   print_color_string(PRINT_ANY, COLOR_IFNAME, "ifname", "%s", name);
-
-   if (tb[IFLA_LINK]) {
-   int iflink = rta_getattr_u32(tb[IFLA_LINK]);
 
-   if (iflink == 0)
-   print_null(PRINT_ANY, "link", "@%s: ", "NONE");
-   else {
-   if (tb[IFLA_LINK_NETNSID])
-   print_int(PRINT_ANY,
- "link_index", "@if%d: ", iflink);
-   else {
-   print_string(PRI

[PATCH iproute2-next 7/9] utils: Introduce and use get_ifname_rta()

2018-02-05 Thread Serhey Popovych
Be consistent in handling of IFLA_IFNAME attribute in all places: if
there is no attribute report bug to stderr and use ll_idx_n2a() as
last measure to get name in "if%u" format instead of "".

Use check_ifname() to validate network device name: this catches both
unexpected return from kernel and ll_idx_n2a().

Signed-off-by: Serhey Popovych 
---
 bridge/link.c   |8 
 include/utils.h |1 +
 ip/ipaddress.c  |   20 
 lib/utils.c |   19 +++
 4 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/bridge/link.c b/bridge/link.c
index 870ebe0..a11cbb1 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -99,9 +99,10 @@ int print_linkinfo(const struct sockaddr_nl *who,
   struct nlmsghdr *n, void *arg)
 {
FILE *fp = arg;
-   int len = n->nlmsg_len;
struct ifinfomsg *ifi = NLMSG_DATA(n);
struct rtattr *tb[IFLA_MAX+1];
+   int len = n->nlmsg_len;
+   const char *name;
 
len -= NLMSG_LENGTH(sizeof(*ifi));
if (len < 0) {
@@ -117,10 +118,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
 
parse_rtattr_flags(tb, IFLA_MAX, IFLA_RTA(ifi), len, NLA_F_NESTED);
 
-   if (tb[IFLA_IFNAME] == NULL) {
-   fprintf(stderr, "BUG: nil ifname\n");
+   name = get_ifname_rta(ifi->ifi_index, tb[IFLA_IFNAME]);
+   if (!name)
return -1;
-   }
 
if (n->nlmsg_type == RTM_DELLINK)
fprintf(fp, "Deleted ");
diff --git a/include/utils.h b/include/utils.h
index f81928a..ced5b7b 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -173,6 +173,7 @@ void duparg(const char *, const char *) 
__attribute__((noreturn));
 void duparg2(const char *, const char *) __attribute__((noreturn));
 int check_ifname(const char *);
 int get_ifname(char *, const char *);
+const char *get_ifname_rta(int ifindex, const struct rtattr *rta);
 int matches(const char *arg, const char *pattern);
 int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits);
 int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta);
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 89e3cc4..458d47c 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -941,12 +941,10 @@ int print_linkinfo_brief(const struct sockaddr_nl *who,
return -1;
 
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
-   if (tb[IFLA_IFNAME] == NULL) {
-   fprintf(stderr, "BUG: device with ifindex %d has nil ifname\n", 
ifi->ifi_index);
-   name = ll_idx_n2a(ifi->ifi_index);
-   } else {
-   name = rta_getattr_str(tb[IFLA_IFNAME]);
-   }
+
+   name = get_ifname_rta(ifi->ifi_index, tb[IFLA_IFNAME]);
+   if (!name)
+   return -1;
 
if (filter.label &&
(!filter.family || filter.family == AF_PACKET) &&
@@ -1068,12 +1066,10 @@ int print_linkinfo(const struct sockaddr_nl *who,
return -1;
 
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
-   if (tb[IFLA_IFNAME] == NULL) {
-   fprintf(stderr, "BUG: device with ifindex %d has nil ifname\n", 
ifi->ifi_index);
-   name = ll_idx_n2a(ifi->ifi_index);
-   } else {
-   name = rta_getattr_str(tb[IFLA_IFNAME]);
-   }
+
+   name = get_ifname_rta(ifi->ifi_index, tb[IFLA_IFNAME]);
+   if (!name)
+   return -1;
 
if (filter.label &&
(!filter.family || filter.family == AF_PACKET) &&
diff --git a/lib/utils.c b/lib/utils.c
index 8e15625..22fe637 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -871,6 +871,25 @@ int get_ifname(char *buf, const char *name)
return ret;
 }
 
+const char *get_ifname_rta(int ifindex, const struct rtattr *rta)
+{
+   const char *name;
+
+   if (rta) {
+   name = rta_getattr_str(rta);
+   } else {
+   fprintf(stderr,
+   "BUG: device with ifindex %d has nil ifname\n",
+   ifindex);
+   name = ll_idx_n2a(ifindex);
+   }
+
+   if (check_ifname(name))
+   return NULL;
+
+   return name;
+}
+
 int matches(const char *cmd, const char *pattern)
 {
int len = strlen(cmd);
-- 
1.7.10.4



[PATCH iproute2-next 4/9] ipaddress: Improve print_linkinfo()

2018-02-05 Thread Serhey Popovych
There are few places to improve:

  1) return -1 when entry is filtered instead of zero, which means
 accept entry: ipaddress_list_flush_or_save() the only user of this

  2) use ll_idx_n2a() as last resort to translate name to index for
 "should never happen" cases when cache shouldn't be considered

  3) replace open coded access to IFLA_IFNAME attribute data by
 RTA_DATA() with rta_getattr_str()

  4) simplify ifname printing since name is never NULL, thanks to (2).

Signed-off-by: Serhey Popovych 
---
 ip/ipaddress.c |   30 +-
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 366a304..02197e2 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -943,14 +943,14 @@ int print_linkinfo_brief(const struct sockaddr_nl *who,
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
if (tb[IFLA_IFNAME] == NULL) {
fprintf(stderr, "BUG: device with ifindex %d has nil ifname\n", 
ifi->ifi_index);
-   name = "";
+   name = ll_idx_n2a(ifi->ifi_index);
} else {
name = rta_getattr_str(tb[IFLA_IFNAME]);
}
 
if (filter.label &&
(!filter.family || filter.family == AF_PACKET) &&
-   fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0))
+   fnmatch(filter.label, name, 0))
return -1;
 
if (tb[IFLA_GROUP]) {
@@ -1052,6 +1052,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
struct ifinfomsg *ifi = NLMSG_DATA(n);
struct rtattr *tb[IFLA_MAX+1];
int len = n->nlmsg_len;
+   const char *name;
unsigned int m_flag = 0;
 
if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
@@ -1062,18 +1063,22 @@ int print_linkinfo(const struct sockaddr_nl *who,
return -1;
 
if (filter.ifindex && ifi->ifi_index != filter.ifindex)
-   return 0;
+   return -1;
if (filter.up && !(ifi->ifi_flags&IFF_UP))
-   return 0;
+   return -1;
 
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
-   if (tb[IFLA_IFNAME] == NULL)
+   if (tb[IFLA_IFNAME] == NULL) {
fprintf(stderr, "BUG: device with ifindex %d has nil ifname\n", 
ifi->ifi_index);
+   name = ll_idx_n2a(ifi->ifi_index);
+   } else {
+   name = rta_getattr_str(tb[IFLA_IFNAME]);
+   }
 
if (filter.label &&
(!filter.family || filter.family == AF_PACKET) &&
-   fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0))
-   return 0;
+   fnmatch(filter.label, name, 0))
+   return -1;
 
if (tb[IFLA_GROUP]) {
int group = rta_getattr_u32(tb[IFLA_GROUP]);
@@ -1100,16 +1105,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
print_int(PRINT_ANY, "ifindex", "%d: ", ifi->ifi_index);
-   if (tb[IFLA_IFNAME]) {
-   print_color_string(PRINT_ANY,
-  COLOR_IFNAME,
-  "ifname", "%s",
-  rta_getattr_str(tb[IFLA_IFNAME]));
-   } else {
-   print_null(PRINT_JSON, "ifname", NULL, NULL);
-   print_color_null(PRINT_FP, COLOR_IFNAME,
-"ifname", "%s", "");
-   }
+   print_color_string(PRINT_ANY, COLOR_IFNAME, "ifname", "%s", name);
 
if (tb[IFLA_LINK]) {
int iflink = rta_getattr_u32(tb[IFLA_LINK]);
-- 
1.7.10.4



[PATCH iproute2-next 1/9] ipaddress: Abstract IFA_LABEL matching code

2018-02-05 Thread Serhey Popovych
There at least two places in ip/ipaddress.c where we match IFA_LABEL
against filter.label if that is given.

Get rid of "common" if () statement for inet_addr_match_rta() and
ifa_label_match_rta(): it is not common because first will check for
filter.pfx.family != AF_UNSPEC inside and second for filter.label being
non NULL.

This allows us to further simplify down code and prepare for
ll_idx_n2a() replacement with ll_index_to_name() without 80 columns
checkpatch notice.

Signed-off-by: Serhey Popovych 
---
 ip/ipaddress.c |   57 +++-
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 4707c2b..38ef5d7 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1470,6 +1470,22 @@ static int get_filter(const char *arg)
return 0;
 }
 
+static int ifa_label_match_rta(int ifindex, const struct rtattr *rta)
+{
+   const char *label;
+   SPRINT_BUF(b1);
+
+   if (!filter.label)
+   return 0;
+
+   if (rta)
+   label = RTA_DATA(rta);
+   else
+   label = ll_idx_n2a(ifindex, b1);
+
+   return fnmatch(filter.label, label, 0);
+}
+
 int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
   void *arg)
 {
@@ -1508,21 +1524,13 @@ int print_addrinfo(const struct sockaddr_nl *who, 
struct nlmsghdr *n,
return 0;
if ((filter.flags ^ ifa_flags) & filter.flagmask)
return 0;
-   if (filter.label) {
-   SPRINT_BUF(b1);
-   const char *label;
-
-   if (rta_tb[IFA_LABEL])
-   label = RTA_DATA(rta_tb[IFA_LABEL]);
-   else
-   label = ll_idx_n2a(ifa->ifa_index, b1);
-   if (fnmatch(filter.label, label, 0) != 0)
-   return 0;
-   }
 
if (filter.family && filter.family != ifa->ifa_family)
return 0;
 
+   if (ifa_label_match_rta(ifa->ifa_index, rta_tb[IFA_LABEL]))
+   return 0;
+
if (inet_addr_match_rta(&filter.pfx, rta_tb[IFA_LOCAL]))
return 0;
 
@@ -1878,25 +1886,14 @@ static void ipaddr_filter(struct nlmsg_chain *linfo, 
struct nlmsg_chain *ainfo)
 
if ((filter.flags ^ ifa_flags) & filter.flagmask)
continue;
-   if (filter.pfx.family || filter.label) {
-   struct rtattr *rta =
-   tb[IFA_LOCAL] ? : tb[IFA_ADDRESS];
-
-   if (inet_addr_match_rta(&filter.pfx, rta))
-   continue;
-
-   if (filter.label) {
-   SPRINT_BUF(b1);
-   const char *label;
-
-   if (tb[IFA_LABEL])
-   label = RTA_DATA(tb[IFA_LABEL]);
-   else
-   label = 
ll_idx_n2a(ifa->ifa_index, b1);
-   if (fnmatch(filter.label, label, 0) != 
0)
-   continue;
-   }
-   }
+
+   if (ifa_label_match_rta(ifa->ifa_index, tb[IFA_LABEL]))
+   continue;
+
+   if (!tb[IFA_LOCAL])
+   tb[IFA_LOCAL] = tb[IFA_ADDRESS];
+   if (inet_addr_match_rta(&filter.pfx, tb[IFA_LOCAL]))
+   continue;
 
ok = 1;
break;
-- 
1.7.10.4



[PATCH iproute2-next 5/9] ipaddress: Simplify print_linkinfo_brief() and it's usage

2018-02-05 Thread Serhey Popovych
Simplify calling code in ipaddr_list_flush_or_save() by introducing
intermediate variable of @struct nlmsghdr, drop duplicated code:
print_linkinfo_brief() never returns values other than <= 0 so we can
move print_selected_addrinfo() outside of each block.

Signed-off-by: Serhey Popovych 
---
 ip/ipaddress.c |   31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 02197e2..89e3cc4 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -918,7 +918,7 @@ static void print_link_stats(FILE *fp, struct nlmsghdr *n)
 }
 
 int print_linkinfo_brief(const struct sockaddr_nl *who,
-   struct nlmsghdr *n, void *arg)
+struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE *)arg;
struct ifinfomsg *ifi = NLMSG_DATA(n);
@@ -2177,24 +2177,21 @@ static int ipaddr_list_flush_or_save(int argc, char 
**argv, int action)
ipaddr_filter(&linfo, ainfo);
 
for (l = linfo.head; l; l = l->next) {
-   int res = 0;
-   struct ifinfomsg *ifi = NLMSG_DATA(&l->h);
+   struct nlmsghdr *n = &l->h;
+   struct ifinfomsg *ifi = NLMSG_DATA(n);
+   int res;
 
open_json_object(NULL);
-   if (brief) {
-   if (print_linkinfo_brief(NULL, &l->h, stdout) == 0)
-   if (filter.family != AF_PACKET)
-   print_selected_addrinfo(ifi,
-   ainfo->head,
-   stdout);
-   } else if (no_link ||
-  (res = print_linkinfo(NULL, &l->h, stdout)) >= 0) {
-   if (filter.family != AF_PACKET)
-   print_selected_addrinfo(ifi,
-   ainfo->head, stdout);
-   if (res > 0 && !do_link && show_stats)
-   print_link_stats(stdout, &l->h);
-   }
+   if (brief)
+   res = print_linkinfo_brief(NULL, n, stdout);
+   else if (no_link)
+   res = 0;
+   else
+   res = print_linkinfo(NULL, n, stdout);
+   if (res >= 0 && filter.family != AF_PACKET)
+   print_selected_addrinfo(ifi, ainfo->head, stdout);
+   if (res > 0 && !do_link && show_stats)
+   print_link_stats(stdout, n);
close_json_object();
}
fflush(stdout);
-- 
1.7.10.4



[PATCH iproute2-next 3/9] utils: Reimplement ll_idx_n2a() and introduce ll_idx_a2n()

2018-02-05 Thread Serhey Popovych
Now all users of ll_idx_n2a() replaced with ll_index_to_name() we can
move it's functionality to ll_index_to_name() and implement index to
name conversion using snprintf() and "if%u".

Use %u specifier in "if%..." template consistently: network device
indexes are always greather than zero.

Also introduce ll_idx_n2a() conterpart: ll_idx_a2n() that is used
to translate name of the "if%u" form to index using sscanf().

Signed-off-by: Serhey Popovych 
---
 include/ll_map.h |4 +++-
 lib/ll_map.c |   31 +--
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/ll_map.h b/include/ll_map.h
index c8474e6..8546ff9 100644
--- a/include/ll_map.h
+++ b/include/ll_map.h
@@ -8,9 +8,11 @@ int ll_remember_index(const struct sockaddr_nl *who,
 void ll_init_map(struct rtnl_handle *rth);
 unsigned ll_name_to_index(const char *name);
 const char *ll_index_to_name(unsigned idx);
-const char *ll_idx_n2a(unsigned idx, char *buf);
 int ll_index_to_type(unsigned idx);
 int ll_index_to_flags(unsigned idx);
 unsigned namehash(const char *str);
 
+const char *ll_idx_n2a(unsigned int idx);
+unsigned int ll_idx_a2n(const char *name);
+
 #endif /* __LL_MAP_H__ */
diff --git a/lib/ll_map.c b/lib/ll_map.c
index f65614f..0afe689 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -136,8 +136,26 @@ int ll_remember_index(const struct sockaddr_nl *who,
return 0;
 }
 
-const char *ll_idx_n2a(unsigned idx, char *buf)
+const char *ll_idx_n2a(unsigned int idx)
 {
+   static char buf[IFNAMSIZ];
+
+   snprintf(buf, sizeof(buf), "if%u", idx);
+   return buf;
+}
+
+unsigned int ll_idx_a2n(const char *name)
+{
+   unsigned int idx;
+
+   if (sscanf(name, "if%u", &idx) != 1)
+   return 0;
+   return idx;
+}
+
+const char *ll_index_to_name(unsigned int idx)
+{
+   static char buf[IFNAMSIZ];
const struct ll_cache *im;
 
if (idx == 0)
@@ -148,18 +166,11 @@ const char *ll_idx_n2a(unsigned idx, char *buf)
return im->name;
 
if (if_indextoname(idx, buf) == NULL)
-   snprintf(buf, IFNAMSIZ, "if%d", idx);
+   snprintf(buf, IFNAMSIZ, "if%u", idx);
 
return buf;
 }
 
-const char *ll_index_to_name(unsigned idx)
-{
-   static char nbuf[IFNAMSIZ];
-
-   return ll_idx_n2a(idx, nbuf);
-}
-
 int ll_index_to_type(unsigned idx)
 {
const struct ll_cache *im;
@@ -196,7 +207,7 @@ unsigned ll_name_to_index(const char *name)
 
idx = if_nametoindex(name);
if (idx == 0)
-   sscanf(name, "if%u", &idx);
+   idx = ll_idx_a2n(name);
return idx;
 }
 
-- 
1.7.10.4



[PATCH iproute2-next 2/9] ipaddress: ll_map: Replace ll_idx_n2a() with ll_index_to_name()

2018-02-05 Thread Serhey Popovych
There is no reentrancy as well as deferred result usage for all cases
where ll_idx_n2a() being used: it is safe to use ll_index_to_name() that
internally calls ll_idx_n2a() with static buffer to hold result.

Signed-off-by: Serhey Popovych 
---
 ip/ipaddress.c |   14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 38ef5d7..366a304 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -978,14 +978,13 @@ int print_linkinfo_brief(const struct sockaddr_nl *who,
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (tb[IFLA_LINK]) {
-   SPRINT_BUF(b1);
int iflink = rta_getattr_u32(tb[IFLA_LINK]);
 
if (iflink == 0) {
snprintf(buf, sizeof(buf), "%s@NONE", name);
print_null(PRINT_JSON, "link", NULL, NULL);
} else {
-   const char *link = ll_idx_n2a(iflink, b1);
+   const char *link = ll_index_to_name(iflink);
 
print_string(PRINT_JSON, "link", NULL, link);
snprintf(buf, sizeof(buf), "%s@%s", name, link);
@@ -1122,12 +1121,10 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_int(PRINT_ANY,
  "link_index", "@if%d: ", iflink);
else {
-   SPRINT_BUF(b1);
-
print_string(PRINT_ANY,
 "link",
 "@%s: ",
-ll_idx_n2a(iflink, b1));
+ll_index_to_name(iflink));
m_flag = ll_index_to_flags(iflink);
m_flag = !(m_flag & IFF_UP);
}
@@ -1149,12 +1146,12 @@ int print_linkinfo(const struct sockaddr_nl *who,
 "qdisc %s ",
 rta_getattr_str(tb[IFLA_QDISC]));
if (tb[IFLA_MASTER]) {
-   SPRINT_BUF(b1);
+   int master = rta_getattr_u32(tb[IFLA_MASTER]);
 
print_string(PRINT_ANY,
 "master",
 "master %s ",
-ll_idx_n2a(rta_getattr_u32(tb[IFLA_MASTER]), b1));
+ll_index_to_name(master));
}
 
if (tb[IFLA_OPERSTATE])
@@ -1473,7 +1470,6 @@ static int get_filter(const char *arg)
 static int ifa_label_match_rta(int ifindex, const struct rtattr *rta)
 {
const char *label;
-   SPRINT_BUF(b1);
 
if (!filter.label)
return 0;
@@ -1481,7 +1477,7 @@ static int ifa_label_match_rta(int ifindex, const struct 
rtattr *rta)
if (rta)
label = RTA_DATA(rta);
else
-   label = ll_idx_n2a(ifindex, b1);
+   label = ll_index_to_name(ifindex);
 
return fnmatch(filter.label, label, 0);
 }
-- 
1.7.10.4



[PATCH iproute2-next 6/9] lib: Correct object file dependencies

2018-02-05 Thread Serhey Popovych
Neither internal libnetlink nor libgenl depends on ll_map.o: prepare for
upcoming changes that brings much more cleaner dependency between
utils.o and ll_map.o.

Signed-off-by: Serhey Popovych 
---
 lib/Makefile |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/Makefile b/lib/Makefile
index 7b34ed5..bab8cbf 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -3,11 +3,11 @@ include ../config.mk
 
 CFLAGS += -fPIC
 
-UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
+UTILOBJ = utils.o rt_names.o ll_map.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o json_print.o \
names.o color.o bpf.o exec.o fs.o
 
-NLOBJ=libgenl.o ll_map.o libnetlink.o
+NLOBJ=libgenl.o libnetlink.o
 
 all: libnetlink.a libutil.a
 
-- 
1.7.10.4



[PATCH iproute2-next 0/9] ipaddress: Make print_linkinfo_brief() static

2018-02-05 Thread Serhey Popovych
With this series I propose to make print_linkinfo_brief() static in
favor of print_linkinfo() as single point for linkinfo printing.

Changes presented with this series tested using following script:

\#!/bin/bash

iproute2_dir="$1"
iface='eth0.2'

pushd "$iproute2_dir" &>/dev/null

for i in new old; do
DIR="/tmp/$i"
mkdir -p "$DIR"

ln -snf ip.$i ip/ip

# normal
ip/ip link show  >"$DIR/ip-link-show"
ip/ip -4 addr show   >"$DIR/ip-4-addr-show"
ip/ip -6 addr show   >"$DIR/ip-6-addr-show"
ip/ip addr show dev "$iface" >"$DIR/ip-addr-show-$iface"

# brief
ip/ip -br link show  >"$DIR/ip-br-link-show"
ip/ip -br -4 addr show   >"$DIR/ip-br-4-addr-show"
ip/ip -br -6 addr show   >"$DIR/ip-br-6-addr-show"
ip/ip -br addr show dev "$iface" >"$DIR/ip-br-addr-show-$iface"
done
rm -f ip/ip

diff -urN /tmp/{old,new} |sed -n -Ee'/^(-{3}|\+{3})[[:space:]]+/!p'
rc=$?

popd &>/dev/null
exit $rc

Expected results : 
Actual results   : 

Although test coverage is far from ideal in my opinion it covers most
important aspects of the changes presented by the series.

All this work is done in prepare of iplink_get() enhancements to support
attribute parse that finally will be used to simplify ip/tunnel
RTM_GETLINK code.

As always reviews, comments, suggestions and criticism is welcome.

Thanks,
Serhii

Serhey Popovych (9):
  ipaddress: Abstract IFA_LABEL matching code
  ipaddress: ll_map: Replace ll_idx_n2a() with ll_index_to_name()
  utils: Reimplement ll_idx_n2a() and introduce ll_idx_a2n()
  ipaddress: Improve print_linkinfo()
  ipaddress: Simplify print_linkinfo_brief() and it's usage
  lib: Correct object file dependencies
  utils: Introduce and use get_ifname_rta()
  utils: Introduce and use print_name_and_link() to print name@link
  ipaddress: Make print_linkinfo_brief() static

 bridge/link.c|   21 ++---
 include/ll_map.h |4 +-
 include/utils.h  |5 ++
 ip/ip_common.h   |2 -
 ip/ipaddress.c   |  224 ++
 ip/iplink.c  |5 +-
 lib/Makefile |4 +-
 lib/ll_map.c |   31 +---
 lib/utils.c  |   68 +
 9 files changed, 162 insertions(+), 202 deletions(-)

-- 
1.7.10.4



[PATCH net v3] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Tommi Rantala
Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):

When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.

The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst refcnt leak.

Testcase:
  ip netns add TEST
  ip netns exec TEST ip link set lo up
  ip link add dummy0 type dummy
  ip link add dummy1 type dummy
  ip link add dummy2 type dummy
  ip link set dev dummy0 netns TEST
  ip link set dev dummy1 netns TEST
  ip link set dev dummy2 netns TEST
  ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
  ip netns exec TEST ip link set dummy0 up
  ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
  ip netns exec TEST ip link set dummy1 up
  ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
  ip netns exec TEST ip link set dummy2 up
  ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 2 
-s -B 192.168.1.3
  ip netns del TEST

In 4.4 and 4.9 kernels this results to:
  [  354.179591] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  364.419674] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  374.663664] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  384.903717] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  395.143724] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  405.383645] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  ...

Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary 
addresses")
Signed-off-by: Tommi Rantala 
---

v2: unconditional dst_release() before breaking out of the loop:
 - if dst == NULL, then dst_release() is no-op.
 - if dst != &rt->dst, then drop reference to the first dst, we do not
   need it after all.
 - if dst == &rt->dst, then the refcnt was incremented twice for this
   dst, so drop the second ref.

 net/sctp/protocol.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index ba28d270eb24..aba2381d3cf9 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -509,22 +509,20 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
if (IS_ERR(rt))
continue;
 
-   if (!dst)
-   dst = &rt->dst;
-
/* Ensure the src address belongs to the output
 * interface.
 */
odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
 false);
if (!odev || odev->ifindex != fl4->flowi4_oif) {
-   if (&rt->dst != dst)
+   if (!dst)
+   dst = &rt->dst;
+   else
dst_release(&rt->dst);
continue;
}
 
-   if (dst != &rt->dst)
-   dst_release(dst);
+   dst_release(dst);
dst = &rt->dst;
break;
}
-- 
2.15.1



Re: [PATCH v2 1/1] tcp: Honor the eor bit in tcp_mtu_probe

2018-02-05 Thread Eric Dumazet
On Mon, 2018-02-05 at 20:18 +0200, Ilya Lesokhin wrote:
> Avoid SKB coalescing if eor bit is set in one of the relevant
> SKBs.
> 
> Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
> Signed-off-by: Ilya Lesokhin 
> ---
>  net/ipv4/tcp_output.c | 12 
>  1 file changed, 12 insertions(+)

Reviewed-by: Eric Dumazet 

Thanks Ilya



Re: [PATCH net v3] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Cong Wang
On Mon, Feb 5, 2018 at 1:20 AM, Paolo Abeni  wrote:
> @@ -625,6 +627,8 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
> tc_u_hnode *ht,
> idr_destroy(&ht->handle_idr);
> idr_remove_ext(&tp_c->handle_idr, ht->handle);
> RCU_INIT_POINTER(*hn, ht->next);
> +
> +   /* the caller ensures ht->refcnt is 0 at this point */

This comment is not needed, because there is already a WARN_ON():

 608 static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
 609  struct netlink_ext_ack *extack)
 610 {
 611 struct tc_u_common *tp_c = tp->data;
 612 struct tc_u_hnode __rcu **hn;
 613 struct tc_u_hnode *phn;
 614
 615 WARN_ON(ht->refcnt);


Other than this, looks good.

Thanks.


[PATCH v2 1/1] tcp: Honor the eor bit in tcp_mtu_probe

2018-02-05 Thread Ilya Lesokhin
Avoid SKB coalescing if eor bit is set in one of the relevant
SKBs.

Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
Signed-off-by: Ilya Lesokhin 
---
 net/ipv4/tcp_output.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e9f985e42405..70c5bb4958c3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2099,6 +2099,17 @@ static int tcp_mtu_probe(struct sock *sk)
return 0;
}
 
+   len = probe_size;
+   tcp_for_write_queue_from_safe(skb, next, sk) {
+   if (len <= skb->len)
+   break;
+
+   if (unlikely(TCP_SKB_CB(skb)->eor))
+   return -1;
+
+   len -= skb->len;
+   }
+
/* We're allowed to probe.  Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2134,6 +2145,7 @@ static int tcp_mtu_probe(struct sock *sk)
/* We've eaten all the data from this skb.
 * Throw it away. */
TCP_SKB_CB(nskb)->tcp_flags |= 
TCP_SKB_CB(skb)->tcp_flags;
+   TCP_SKB_CB(nskb)->eor = TCP_SKB_CB(skb)->eor;
tcp_unlink_write_queue(skb, sk);
sk_wmem_free_skb(sk, skb);
} else {
-- 
2.15.0.317.g14c63a9



[bpf PATCH v4 3/3] bpf: sockmap, fix leaking maps with attached but not detached progs

2018-02-05 Thread John Fastabend
When a program is attached to a map we increment the program refcnt
to ensure that the program is not removed while it is potentially
being referenced from sockmap side. However, if this same program
also references the map (this is a reasonably common pattern in
my programs) then the verifier will also increment the maps refcnt
from the verifier. This is to ensure the map doesn't get garbage
collected while the program has a reference to it.

So we are left in a state where the map holds the refcnt on the
program stopping it from being removed and releasing the map refcnt.
And vice versa the program holds a refcnt on the map stopping it
from releasing the refcnt on the prog.

All this is fine as long as users detach the program while the
map fd is still around. But, if the user omits this detach command
we are left with a dangling map we can no longer release.

To resolve this when the map fd is released decrement the program
references and remove any reference from the map to the program.
This fixes the issue with possibly dangling map and creates a
user side API constraint. That is, the map fd must be held open
for programs to be attached to a map.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index bd4a6d9..48c3341 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -620,11 +620,6 @@ static void sock_map_free(struct bpf_map *map)
}
rcu_read_unlock();
 
-   if (stab->bpf_verdict)
-   bpf_prog_put(stab->bpf_verdict);
-   if (stab->bpf_parse)
-   bpf_prog_put(stab->bpf_parse);
-
sock_map_remove_complete(stab);
 }
 
@@ -900,6 +895,19 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
 }
 
+static void sock_map_release(struct bpf_map *map, struct file *map_file)
+{
+   struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+   struct bpf_prog *orig;
+
+   orig = xchg(&stab->bpf_parse, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+   orig = xchg(&stab->bpf_verdict, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+}
+
 const struct bpf_map_ops sock_map_ops = {
.map_alloc = sock_map_alloc,
.map_free = sock_map_free,
@@ -907,6 +915,7 @@ static int sock_map_update_elem(struct bpf_map *map,
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
+   .map_release = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,



[bpf PATCH v4 2/3] bpf: sockmap, add sock close() hook to remove socks

2018-02-05 Thread John Fastabend
The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.

The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.

To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole 
Signed-off-by: John Fastabend 
---
 include/net/tcp.h|2 +
 kernel/bpf/sockmap.c |  168 ++
 2 files changed, 103 insertions(+), 67 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a58292d..e3fc667 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1985,6 +1985,7 @@ static inline void tcp_listendrop(const struct sock *sk)
 
 enum {
TCP_ULP_TLS,
+   TCP_ULP_BPF,
 };
 
 struct tcp_ulp_ops {
@@ -2003,6 +2004,7 @@ struct tcp_ulp_ops {
 int tcp_register_ulp(struct tcp_ulp_ops *type);
 void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
+int tcp_set_ulp_id(struct sock *sk, const int ulp);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 0314d17..bd4a6d9 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -86,9 +86,10 @@ struct smap_psock {
struct work_struct tx_work;
struct work_struct gc_work;
 
+   struct proto *sk_proto;
+   void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
-   void (*save_state_change)(struct sock *sk);
 };
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -96,12 +97,102 @@ static inline struct smap_psock *smap_psock_sk(const 
struct sock *sk)
return rcu_dereference_sk_user_data(sk);
 }
 
+static struct proto tcp_bpf_proto;
+static int bpf_tcp_init(struct sock *sk)
+{
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   return -EINVAL;
+   }
+
+   if (unlikely(psock->sk_proto)) {
+   rcu_read_unlock();
+   return -EBUSY;
+   }
+
+   psock->save_close = sk->sk_prot->close;
+   psock->sk_proto = sk->sk_prot;
+   sk->sk_prot = &tcp_bpf_proto;
+   rcu_read_unlock();
+   return 0;
+}
+
+static void bpf_tcp_release(struct sock *sk)
+{
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+
+   if (likely(psock)) {
+   sk->sk_prot = psock->sk_proto;
+   psock->sk_proto = NULL;
+   }
+   rcu_read_unlock();
+}
+
+static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock_map_entry *e, *tmp;
+   struct smap_psock *psock;
+   struct sock *osk;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   return sk->sk_prot->close(sk, timeout);
+   }
+
+   /* The psock may be destroyed anytime after exiting the RCU critial
+* section so by the time we use close_fun the psock may no longer
+* be valid. However, bpf_tcp_close is called with the sock lock
+* held so the close hook and sk are still valid.
+*/
+   close_fun = psock->save_close;
+
+   write_lock_bh(&sk->sk_callback_lock);
+   list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+   osk = cmpxchg(e->entry, sk, NULL);
+   if (osk == sk) {
+   list_del(&e->list);
+   smap_release_sock(psock, sk);
+   }
+   }
+   write_unlock_bh(&sk->sk_callback_lock);
+   rcu_read_unlock();
+   close_fun(sk, timeout);
+}
+
 enum __sk_action {
__SK_DROP = 0,
__SK_PASS,
__SK_REDIRECT,
 };
 
+static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
+   .name   = "bpf_tcp",
+   .uid= TCP_ULP_BPF,
+   .user_visible   = false,
+   .owner  = NULL,
+   .init   = bpf_tcp_init,
+   .release= bpf_tcp_release,
+};
+
+static int bpf_tcp_ulp_register(void)
+{
+   tcp_bpf_proto = tcp_prot;
+   tcp_bpf_proto.close = bpf_tcp_close;
+   return tcp_register_ulp(&bpf_tcp_ulp_ops);
+}
+
 static 

[bpf PATCH v4 1/3] net: add a UID to use for ULP socket assignment

2018-02-05 Thread John Fastabend
Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.

Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |6 +
 net/ipv4/tcp_ulp.c |   59 
 net/tls/tls_main.c |2 ++
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5827866..a58292d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1983,6 +1983,10 @@ static inline void tcp_listendrop(const struct sock *sk)
 #define TCP_ULP_MAX128
 #define TCP_ULP_BUF_MAX(TCP_ULP_NAME_MAX*TCP_ULP_MAX)
 
+enum {
+   TCP_ULP_TLS,
+};
+
 struct tcp_ulp_ops {
struct list_headlist;
 
@@ -1991,7 +1995,9 @@ struct tcp_ulp_ops {
/* cleanup ulp */
void (*release)(struct sock *sk);
 
+   int uid;
charname[TCP_ULP_NAME_MAX];
+   booluser_visible;
struct module   *owner;
 };
 int tcp_register_ulp(struct tcp_ulp_ops *type);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 6bb9e14..622caa4 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -29,6 +29,18 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name)
return NULL;
 }
 
+static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp)
+{
+   struct tcp_ulp_ops *e;
+
+   list_for_each_entry_rcu(e, &tcp_ulp_list, list) {
+   if (e->uid == ulp)
+   return e;
+   }
+
+   return NULL;
+}
+
 static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
 {
const struct tcp_ulp_ops *ulp = NULL;
@@ -51,6 +63,18 @@ static const struct tcp_ulp_ops 
*__tcp_ulp_find_autoload(const char *name)
return ulp;
 }
 
+static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid)
+{
+   const struct tcp_ulp_ops *ulp;
+
+   rcu_read_lock();
+   ulp = tcp_ulp_find_id(uid);
+   if (!ulp || !try_module_get(ulp->owner))
+   ulp = NULL;
+   rcu_read_unlock();
+   return ulp;
+}
+
 /* Attach new upper layer protocol to the list
  * of available protocols.
  */
@@ -59,13 +83,10 @@ int tcp_register_ulp(struct tcp_ulp_ops *ulp)
int ret = 0;
 
spin_lock(&tcp_ulp_list_lock);
-   if (tcp_ulp_find(ulp->name)) {
-   pr_notice("%s already registered or non-unique name\n",
- ulp->name);
+   if (tcp_ulp_find(ulp->name))
ret = -EEXIST;
-   } else {
+   else
list_add_tail_rcu(&ulp->list, &tcp_ulp_list);
-   }
spin_unlock(&tcp_ulp_list_lock);
 
return ret;
@@ -124,6 +145,34 @@ int tcp_set_ulp(struct sock *sk, const char *name)
if (!ulp_ops)
return -ENOENT;
 
+   if (!ulp_ops->user_visible) {
+   module_put(ulp_ops->owner);
+   return -ENOENT;
+   }
+
+   err = ulp_ops->init(sk);
+   if (err) {
+   module_put(ulp_ops->owner);
+   return err;
+   }
+
+   icsk->icsk_ulp_ops = ulp_ops;
+   return 0;
+}
+
+int tcp_set_ulp_id(struct sock *sk, int ulp)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+   const struct tcp_ulp_ops *ulp_ops;
+   int err;
+
+   if (icsk->icsk_ulp_ops)
+   return -EEXIST;
+
+   ulp_ops = __tcp_ulp_lookup(ulp);
+   if (!ulp_ops)
+   return -ENOENT;
+
err = ulp_ops->init(sk);
if (err) {
module_put(ulp_ops->owner);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 736719c..b0d5fce 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -484,6 +484,8 @@ static int tls_init(struct sock *sk)
 
 static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.name   = "tls",
+   .uid= TCP_ULP_TLS,
+   .user_visible   = true,
.owner  = THIS_MODULE,
.init   = tls_init,
 };



[bpf PATCH v4 0/3] bpf: sockmap fixes

2018-02-05 Thread John Fastabend
A set of fixes for sockmap to resolve programs referencing sockmaps
and closing without deleting all entries in the map and/or not detaching
BPF programs attached to the map. Both leaving entries in the map and
not detaching programs may result in the map failing to be removed by
BPF infrastructure due to reference counts never reaching zero.

For this we pull in the ULP infrastructure to hook into the close()
hook of the sock layer. This seemed natural because we have additional
sockmap features (to add support for TX hooks) that will also use the
ULP infrastructure. This allows us to cleanup entries in the map when
socks are closed() and avoid trying to get the sk_state_change() hook
to fire in all cases.

The second issue resolved here occurs when users don't detach
programs. The gist is a refcnt issue resolved by implementing the
release callback. See patch for details.

For testing I ran both sample/sockmap and selftests bpf/test_maps.c.
Dave Watson ran TLS test suite on v1 version of the patches without
the put_module error path change.

v4 fix missing rcu_unlock()
v3 wrap psock reference in RCU
v2 changes rebased onto bpf-next with small update adding module_put

---

John Fastabend (3):
  net: add a UID to use for ULP socket assignment
  bpf: sockmap, add sock close() hook to remove socks
  bpf: sockmap, fix leaking maps with attached but not detached progs


 include/net/tcp.h|8 ++
 kernel/bpf/sockmap.c |  187 +++---
 net/ipv4/tcp_ulp.c   |   59 ++--
 net/tls/tls_main.c   |2 +
 4 files changed, 179 insertions(+), 77 deletions(-)

--
Signature


[PATCH net] net: phy: Handle not having GPIO enabled in the kernel

2018-02-05 Thread Andrew Lunn
If CONFIG_GPIOLIB is disabled, fwnode_get_named_gpiod() becomes a stub
function, which return -ENOSYS. Handle this in the same way as
-ENOENT, i.e. assume there is no GPIO used to reset the PHYs.

Reported-by: Christian Zigotzky 
Tested-by: Christian Zigotzky 
Signed-off-by: Andrew Lunn 
Reviewed-by: Florian Fainelli 
Fixes: bafbdd527d56 ("phylib: Add device reset GPIO support")
---
 drivers/net/phy/mdio_bus.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 88272b3ac2e2..24b5511222c8 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -56,7 +56,8 @@ static int mdiobus_register_gpiod(struct mdio_device *mdiodev)
gpiod = fwnode_get_named_gpiod(&mdiodev->dev.of_node->fwnode,
   "reset-gpios", 0, GPIOD_OUT_LOW,
   "PHY reset");
-   if (PTR_ERR(gpiod) == -ENOENT)
+   if (PTR_ERR(gpiod) == -ENOENT ||
+   PTR_ERR(gpiod) == -ENOSYS)
gpiod = NULL;
else if (IS_ERR(gpiod))
return PTR_ERR(gpiod);
-- 
2.15.1



[PATCH 2/2] netfilter: nf_tables: fix flowtable resource leak

2018-02-05 Thread Felix Fietkau
When cleaning up flowtable entries, associated dst and ct entries need
to be released as well

Signed-off-by: Felix Fietkau 
---
 net/netfilter/nf_flow_table.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index 20f86091ab98..9925bd3f93c4 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -124,7 +124,7 @@ void flow_offload_free(struct flow_offload *flow)
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_cache);
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_cache);
e = container_of(flow, struct flow_offload_entry, flow);
-   kfree(e);
+   kfree_rcu(e, rcu_head);
 }
 EXPORT_SYMBOL_GPL(flow_offload_free);
 
@@ -161,7 +161,9 @@ void flow_offload_del(struct nf_flowtable *flow_table,
   *flow_table->type->params);
 
e = container_of(flow, struct flow_offload_entry, flow);
-   kfree_rcu(e, rcu_head);
+   nf_ct_delete(e->ct, 0, 0);
+   nf_ct_put(e->ct);
+   flow_offload_free(flow);
 }
 EXPORT_SYMBOL_GPL(flow_offload_del);
 
@@ -174,15 +176,6 @@ flow_offload_lookup(struct nf_flowtable *flow_table,
 }
 EXPORT_SYMBOL_GPL(flow_offload_lookup);
 
-static void nf_flow_release_ct(const struct flow_offload *flow)
-{
-   struct flow_offload_entry *e;
-
-   e = container_of(flow, struct flow_offload_entry, flow);
-   nf_ct_delete(e->ct, 0, 0);
-   nf_ct_put(e->ct);
-}
-
 int nf_flow_table_iterate(struct nf_flowtable *flow_table,
  void (*iter)(struct flow_offload *flow, void *data),
  void *data)
@@ -276,10 +269,8 @@ void nf_flow_offload_work_gc(struct work_struct *work)
flow = container_of(tuplehash, struct flow_offload, 
tuplehash[0]);
 
if (nf_flow_has_expired(flow) ||
-   nf_flow_is_dying(flow)) {
+   nf_flow_is_dying(flow))
flow_offload_del(flow_table, flow);
-   nf_flow_release_ct(flow);
-   }
}
 out:
rhashtable_walk_stop(&hti);
-- 
2.14.2



[PATCH 1/2] netfilter: nf_tables: fix flowtable free

2018-02-05 Thread Felix Fietkau
Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.

Signed-off-by: Felix Fietkau 
---
 include/net/netfilter/nf_flow_table.h   |  2 ++
 net/ipv4/netfilter/nf_flow_table_ipv4.c |  1 +
 net/ipv6/netfilter/nf_flow_table_ipv6.c |  1 +
 net/netfilter/nf_flow_table.c   | 15 +++
 net/netfilter/nf_flow_table_inet.c  |  1 +
 net/netfilter/nf_tables_api.c   |  8 +---
 6 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index b22b22082733..67c61bda6a46 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -14,6 +14,7 @@ struct nf_flowtable_type {
struct list_headlist;
int family;
void(*gc)(struct work_struct *work);
+   void(*free)(struct nf_flowtable *table);
const struct rhashtable_params  *params;
nf_hookfn   *hook;
struct module   *owner;
@@ -95,6 +96,7 @@ struct flow_offload_tuple_rhash *flow_offload_lookup(struct 
nf_flowtable *flow_t
 int nf_flow_table_iterate(struct nf_flowtable *flow_table,
  void (*iter)(struct flow_offload *flow, void *data),
  void *data);
+void nf_flow_table_free(struct nf_flowtable *flow_table);
 void nf_flow_offload_work_gc(struct work_struct *work);
 extern const struct rhashtable_params nf_flow_offload_rhash_params;
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index b2d01eb25f2c..25d2975da156 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -260,6 +260,7 @@ static struct nf_flowtable_type flowtable_ipv4 = {
.family = NFPROTO_IPV4,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_ip_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index fff21602875a..d346705d6ee6 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -253,6 +253,7 @@ static struct nf_flowtable_type flowtable_ipv6 = {
.family = NFPROTO_IPV6,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_ipv6_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index 2f5099cb85b8..20f86091ab98 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -221,6 +221,21 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
 
+static void
+__nf_flow_offload_free(struct flow_offload *flow, void *data)
+{
+   struct nf_flowtable *flow_table = data;
+
+   flow_offload_del(flow_table, flow);
+}
+
+void nf_flow_table_free(struct nf_flowtable *flow_table)
+{
+   nf_flow_table_iterate(flow_table, __nf_flow_offload_free, flow_table);
+   rhashtable_destroy(&flow_table->rhashtable);
+}
+EXPORT_SYMBOL_GPL(nf_flow_table_free);
+
 static inline bool nf_flow_has_expired(const struct flow_offload *flow)
 {
return (__s32)(flow->timeout - (u32)jiffies) <= 0;
diff --git a/net/netfilter/nf_flow_table_inet.c 
b/net/netfilter/nf_flow_table_inet.c
index 281209aeba8f..375a1881d93d 100644
--- a/net/netfilter/nf_flow_table_inet.c
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -24,6 +24,7 @@ static struct nf_flowtable_type flowtable_inet = {
.family = NFPROTO_INET,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_inet_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0791813a1e7d..a6c4747c7d5d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5399,17 +5399,11 @@ static void nf_tables_flowtable_notify(struct nft_ctx 
*ctx,
nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS);
 }
 
-static void nft_flowtable_destroy(void *ptr, void *arg)
-{
-   kfree(ptr);
-}
-
 static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
 {
cancel_delayed_work_sync(&flowtable->data.gc_work);
kfree(flowtable->name);
-   rhashtable_free_and_destroy(&flowtable->data.rhashtable,
-  

Re: [PATCH v3] RDS: IB: Fix null pointer issue

2018-02-05 Thread Doug Ledford
On Mon, 2018-02-05 at 10:27 +0800, Guanglei Li wrote:
> Scenario:
> 1. Port down and do fail over
> 2. Ap do rds_bind syscall
> 
> PID: 47039  TASK: 89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
>  #0 [898e35f159f0] machine_kexec at 8103abf9
>  #1 [898e35f15a60] crash_kexec at 810b96e3
>  #2 [898e35f15b30] oops_end at 8150f518
>  #3 [898e35f15b60] no_context at 8104854c
>  #4 [898e35f15ba0] __bad_area_nosemaphore at 81048675
>  #5 [898e35f15bf0] bad_area_nosemaphore at 810487d3
>  #6 [898e35f15c00] do_page_fault at 815120b8
>  #7 [898e35f15d10] page_fault at 8150ea95
> [exception RIP: unknown or invalid address]
> RIP:   RSP: 898e35f15dc8  RFLAGS: 00010282
> RAX: fffe  RBX: 889b77f6fc00  RCX:81c99d88
> RDX:   RSI: 896019ee08e8  RDI:889b77f6fc00
> RBP: 898e35f15df0   R8: 896019ee08c8  R9:
> R10: 0400  R11:   R12:896019ee08c0
> R13: 889b77f6fe68  R14: 81c99d80  R15: a022a1e0
> ORIG_RAX:   CS: 0010 SS: 0018
>  #8 [898e35f15dc8] cma_ndev_work_handler at a022a228 [rdma_cm]
>  #9 [898e35f15df8] process_one_work at 8108a7c6
>  #10 [898e35f15e58] worker_thread at 8108bda0
>  #11 [898e35f15ee8] kthread at 81090fe6
> 
> PID: 45659  TASK: 880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
>  #0 [881024ccfc98] __schedule at 8150bac4
>  #1 [881024ccfd40] schedule at 8150c2cf
>  #2 [881024ccfd50] __mutex_lock_slowpath at 8150cee7
>  #3 [881024ccfdc0] mutex_lock at 8150cdeb
>  #4 [881024ccfde0] rdma_destroy_id at a022a027 [rdma_cm]
>  #5 [881024ccfe10] rds_ib_laddr_check at a0357857 [rds_rdma]
>  #6 [881024ccfe50] rds_trans_get_preferred at a0324c2a [rds]
>  #7 [881024ccfe80] rds_bind at a031d690 [rds]
>  #8 [881024ccfeb0] sys_bind at 8142a670
> 
> PID: 45659  PID: 47039
> rds_ib_laddr_check
>   /* create id_priv with a null event_handler */
>   rdma_create_id
>   rdma_bind_addr
> cma_acquire_dev
>   /* add id_priv to cma_dev->id_list */
>   cma_attach_to_dev
> cma_ndev_work_handler
>   /* event_hanlder is null */
>   id_priv->id.event_handler
> 
> Signed-off-by: Guanglei Li 
> Signed-off-by: Honglei Wang 
> Reviewed-by: Junxiao Bi 
> Reviewed-by: Yanjun Zhu 
> Reviewed-by: Leon Romanovsky 
> Acked-by: Santosh Shilimkar 

RDS patches normally go through Dave's tree, so for the RDMA aspects of
this patch:

Acked-by: Doug Ledford 

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2] socket: Provide put_cmsg_whitelist() for constant size copies

2018-02-05 Thread Kees Cook
On Tue, Feb 6, 2018 at 2:03 AM, David Miller  wrote:
> From: Kees Cook 
> Date: Fri, 2 Feb 2018 02:27:49 -0800
>
>> @@ -343,6 +343,14 @@ struct ucred {
>>
>>  extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct 
>> sockaddr_storage *kaddr);
>>  extern int put_cmsg(struct msghdr*, int level, int type, int len, void 
>> *data);
>> +/*
>> + * Provide a bounce buffer for copying cmsg data to userspace when the
>> + * target memory isn't already whitelisted for hardened usercopy.
>> + */
>> +#define put_cmsg_whitelist(_msg, _level, _type, _ptr) ({ \
>> + typeof(*(_ptr)) _val = *(_ptr); \
>> + put_cmsg(_msg, _level, _type, sizeof(_val), &_val); \
>> + })
>
> I understand what you are trying to achieve, but it's at a real cost
> here.  Some of these objects are structures, for example the struct
> sock_extended_err is 16 bytes.

It didn't look like put_cmsg() was on a fast path, so it seemed like a
bounce buffer was the best solution here (and it's not without
precedent).

> And now we're going to copy it twice, once into the on-stack copy,
> and then once again into the CMSG blob.
>
> Please find a way to make hardened user copy happy without adding
> new overhead.

Another idea would be breaking put_cmsg() up into a macro with helper
functions, rearrange the arguments to avoid the math, and leaving the
copy_to_user() inline to see the const-ness, but that seemed way
uglier to me.

I'll think about it some more, but I think having put_cmsg_whitelist()
called only in a few places is reasonable here.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH V2 net-next] rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management

2018-02-05 Thread Santosh Shilimkar

On 2/3/2018 4:26 AM, Sowmini Varadhan wrote:

An rds_connection can get added during netns deletion between lines 528
and 529 of

   506 static void rds_tcp_kill_sock(struct net *net)
   :
   /* code to pull out all the rds_connections that should be destroyed */
   :
   528 spin_unlock_irq(&rds_tcp_conn_lock);
   529 list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
   530 rds_conn_destroy(tc->t_cpath->cp_conn);

Such an rds_connection would miss out the rds_conn_destroy()
loop (that cancels all pending work) and (if it was scheduled
after netns deletion) could trigger the use-after-free.

A similar race-window exists for the module unload path
in rds_tcp_exit -> rds_tcp_destroy_conns

Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled
by checking check_net() before enqueuing new work or adding new
connections.

Concurrency with module-unload is handled by maintaining a module
specific flag that is set at the start of the module exit function,
and must be checked before enqueuing new work or adding new connections.

This commit refactors existing RDS_DESTROY_PENDING checks added by
commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with
connection teardown") and consolidates all the concurrency checks
listed above into the function rds_destroy_pending().

Signed-off-by: Sowmini Varadhan 
---
v2: use check_net() for the netns delete case, as recommended on the list.
 refactor RDS_DESTROY_PENDING checks and consolidate into
 rds_destroy_pending()


Thanks for the update. It looks inline as per off-list chat.

Acked-by: Santosh Shilimkar 


Re: [PATCH 1/1] tcp: Honor the eor bit in tcp_mtu_probe

2018-02-05 Thread Eric Dumazet
On Mon, 2018-02-05 at 07:52 -0800, Eric Dumazet wrote:
> On Mon, 2018-02-05 at 17:11 +0200, Ilya Lesokhin wrote:
> > Avoid SKB coalescing if eor bit is set in one of the relevant
> > SKBs.
> > 
> > Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
> > Signed-off-by: Ilya Lesokhin 
> > ---
> 
> Reviewed-by: Eric Dumazet 
> 
> Thanks.


I am taking this approval back.

You missed an eor propagation if it is in the last skb that is copied
to the new skb.

Something like this added to your patch :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
e9f985e42405a38fc95980da5debb7ac8b51fbb5..87c2ff458f7528ee3cd3e5e1154375a906c1bc67
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2134,6 +2134,7 @@ static int tcp_mtu_probe(struct sock *sk)
    /* We've eaten all the data from this skb.
     * Throw it away. */
    TCP_SKB_CB(nskb)->tcp_flags |= 
TCP_SKB_CB(skb)->tcp_flags;
+   TCP_SKB_CB(nskb)->eor = TCP_SKB_CB(skb)->eor;
    tcp_unlink_write_queue(skb, sk);
    sk_wmem_free_skb(sk, skb);
    } else {


[PATCH, net] ibmvnic: fix empty firmware version and errors cleanup

2018-02-05 Thread Desnes Augusto Nunes do Rosario
This patch makes sure that the firmware version is never NULL. Moreover,
it also performs some cleanup on the error messages.

Fixes: a107311d7fdf ("ibmvnic: fix firmware version when no firmware level
has been provided by the VIOS server")
Signed-off-by: Desnes A. Nunes do Rosario 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5caaa9033841..afaf29b201dc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3286,7 +3286,7 @@ static void handle_vpd_rsp(union ibmvnic_crq *crq,
   struct ibmvnic_adapter *adapter)
 {
struct device *dev = &adapter->vdev->dev;
-   unsigned char *substr = NULL, *ptr = NULL;
+   unsigned char *substr = NULL;
u8 fw_level_len = 0;
 
memset(adapter->fw_version, 0, 32);
@@ -3306,10 +3306,6 @@ static void handle_vpd_rsp(union ibmvnic_crq *crq,
substr = strnstr(adapter->vpd->buff, "RM", adapter->vpd->len);
if (!substr) {
dev_info(dev, "Warning - No FW level has been provided in the 
VPD buffer by the VIOS Server\n");
-   ptr = strncpy((char *)adapter->fw_version, "N/A",
- 3 * sizeof(char));
-   if (!ptr)
-   dev_err(dev, "Failed to inform that firmware version is 
unavailable to the adapter\n");
goto complete;
}
 
@@ -3324,16 +3320,14 @@ static void handle_vpd_rsp(union ibmvnic_crq *crq,
/* copy firmware version string from vpd into adapter */
if ((substr + 3 + fw_level_len) <
(adapter->vpd->buff + adapter->vpd->len)) {
-   ptr = strncpy((char *)adapter->fw_version,
- substr + 3, fw_level_len);
-
-   if (!ptr)
-   dev_err(dev, "Failed to isolate FW level string\n");
+   strncpy((char *)adapter->fw_version, substr + 3, fw_level_len);
} else {
dev_info(dev, "FW substr extrapolated VPD buff\n");
}
 
 complete:
+   if (adapter->fw_version[0] == '\0')
+   strncpy((char *)adapter->fw_version, "N/A", 3 * sizeof(char));
complete(&adapter->fw_done);
 }
 
-- 
2.14.3



Re: [PATCH net 1/1 v2] rtnetlink: require unique netns identifier

2018-02-05 Thread David Ahern
On 2/5/18 8:55 AM, Christian Brauner wrote:
> Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
> it is possible for userspace to send us requests with three different
> properties to identify a target network namespace. This affects at least
> RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
> network namespace which is confusing. For legacy reasons the kernel will
> pick the IFLA_NET_NS_PID property first and then look for the
> IFLA_NET_NS_FD property but there is no reason to extend this type of
> behavior to network namespace ids. The regression potential is quite
> minimal since the rtnetlink requests in question either won't allow
> IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
> support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.
> 
> Signed-off-by: Christian Brauner 
> ---
> ChangeLog v1->v2:
> * return errno when the specified network namespace id is invalid
> * fill in struct netlink_ext_ack if the network namespace id is invalid
> * rename rtnl_ensure_unique_netns_attr() to rtnl_ensure_unique_netns() to
>   indicate that a request without any network namespace identifying attributes
>   is also considered valid.
> 
> ChangeLog v0->v1:
> * report a descriptive error to userspace via struct netlink_ext_ack
> * do not fail when multiple properties specifiy the same network namespace
> ---
>  net/core/rtnetlink.c | 69 
> 
>  1 file changed, 69 insertions(+)

LGTM.

Acked-by: David Ahern 


Re: [BUG iproute2] ip tuntap show

2018-02-05 Thread David Ahern
On 2/5/18 8:34 AM, Serhey Popovych wrote:
> Sorry David, but I do no see problem with this function right now:
> 
>   1. It allocates string using malloc() in asprintf()
>   2. free() it *after* fopen() to prevent memory leak if fopen() fails
>   3. or if fopen() succeeded and we fscanf() with %ms that will allcate
>  buffer we return.
> 
> I agree this is GNU extensions, but it does not look as user after free
> to me. Should I get rid of these extensions? They are not last as I can
> find.
> 

ok, I missed the 'm' in the %ms of the fscanf. Thanks for checking.


Re: regression in igb driver from commit 182785335447

2018-02-05 Thread Michal Kubecek
On Fri, Feb 02, 2018 at 12:54:27PM -0600, Aaron Sierra wrote:
> > From: "Michal Kubecek" 
> > Sent: Thursday, February 1, 2018 6:47:32 AM
> > [   13.710535] igb: Intel(R) Gigabit Ethernet Network Driver - version 
> > 5.4.0-k
> > [   13.710538] igb: Copyright (c) 2007-2014 Intel Corporation.
> > [   13.710584] igb :08:00.0: PCI->APIC IRQ transform: INT A -> IRQ 56
> > [   13.712126] igb: probe of :08:00.0 failed with error -2
> > [   13.712152] igb :08:00.1: PCI->APIC IRQ transform: INT B -> IRQ 70
> > [   13.904537] igb :08:00.1: Intel(R) Gigabit Ethernet Network 
> > Connection
> > [   13.904545] igb :08:00.1: eth0: (PCIe:2.5Gb/s:Width x4) 
> > 00:30:48:7b:5d:37
> > [   13.904547] igb :08:00.1: eth0: PBA No: Unknown
> > [   13.904556] igb :08:00.1: Using MSI-X interrupts. 4 rx queue(s), 4 tx
> > queue(s)
> > [   13.927029] igb :08:00.1 eth1: renamed from eth0
> 
> Can you share whether (and which versions of) Intel Boot Agent was enabled
> for these ports.

It certainly was, first port is used for PXE boot.

This is what I caught on serial console on boot:


Initializing Intel(R) Boot Agent GE v0.0.13
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
PXE 2.1 Build 086 (WfM 2.0)  
   
  
Initializing Intel(R) Boot Agent GE v0.0.13
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
PXE 2.1 Build 086 (WfM 2.0)
Press Ctrl+S to enter the Setup Menu..   


Doesn't look wery trustworthy. :-( I'll check if I can find another
machine with the same card but more reasonable Intel Boot Agent.

>  Does the behavior change if Intel Boot Agent is in the
> opposite state? Was the failing device used for network booting?

This may be a bit tricky, the machine is in a test lab and I have only
remote access to it. I'll check if I can disable the Intel Boot Agent
using the serial console.

Michal Kubecek


[PATCH net 1/1 v2] rtnetlink: require unique netns identifier

2018-02-05 Thread Christian Brauner
Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
it is possible for userspace to send us requests with three different
properties to identify a target network namespace. This affects at least
RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
network namespace which is confusing. For legacy reasons the kernel will
pick the IFLA_NET_NS_PID property first and then look for the
IFLA_NET_NS_FD property but there is no reason to extend this type of
behavior to network namespace ids. The regression potential is quite
minimal since the rtnetlink requests in question either won't allow
IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.

Signed-off-by: Christian Brauner 
---
ChangeLog v1->v2:
* return errno when the specified network namespace id is invalid
* fill in struct netlink_ext_ack if the network namespace id is invalid
* rename rtnl_ensure_unique_netns_attr() to rtnl_ensure_unique_netns() to
  indicate that a request without any network namespace identifying attributes
  is also considered valid.

ChangeLog v0->v1:
* report a descriptive error to userspace via struct netlink_ext_ack
* do not fail when multiple properties specifiy the same network namespace
---
 net/core/rtnetlink.c | 69 
 1 file changed, 69 insertions(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 56af8e41abfc..c096c4ff9a00 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1951,6 +1951,59 @@ static struct net *rtnl_link_get_net_capable(const 
struct sk_buff *skb,
return net;
 }
 
+/* Verify that rtnetlink requests supporting network namespace ids
+ * do not pass additional properties referring to different network
+ * namespaces.
+ */
+static int rtnl_ensure_unique_netns(const struct sock *sk, struct nlattr *tb[],
+   struct netlink_ext_ack *extack)
+{
+   int ret = -EINVAL;
+   struct net *net = NULL, *unique_net = NULL;
+
+   /* Requests without network namespace ids have been able to specify
+* multiple properties referring to different network namespaces so
+* don't regress them.
+*/
+   if (!tb[IFLA_IF_NETNSID])
+   return 0;
+
+   /* Caller operates on the current network namespace. */
+   if (!tb[IFLA_NET_NS_PID] && !tb[IFLA_NET_NS_FD])
+   return 0;
+
+   unique_net = get_net_ns_by_id(sock_net(sk), 
nla_get_s32(tb[IFLA_IF_NETNSID]));
+   if (!unique_net) {
+   NL_SET_ERR_MSG(extack, "invalid network namespace id");
+   return ret;
+   }
+
+   if (tb[IFLA_NET_NS_PID]) {
+   net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
+   if (net != unique_net)
+   goto on_error;
+   }
+
+   if (tb[IFLA_NET_NS_FD]) {
+   net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
+   if (net != unique_net)
+   goto on_error;
+   }
+
+   ret = 0;
+
+on_error:
+   put_net(unique_net);
+
+   if (net && !IS_ERR(net))
+   put_net(net);
+
+   if (ret != 0)
+   NL_SET_ERR_MSG(extack, "multiple network namespaces specified");
+
+   return ret;
+}
+
 static int validate_linkmsg(struct net_device *dev, struct nlattr *tb[])
 {
if (dev) {
@@ -2553,6 +2606,10 @@ static int rtnl_setlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (err < 0)
goto errout;
 
+   err = rtnl_ensure_unique_netns(NETLINK_CB(skb).sk, tb, extack);
+   if (err < 0)
+   goto errout;
+
if (tb[IFLA_IFNAME])
nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
else
@@ -2649,6 +2706,10 @@ static int rtnl_dellink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (err < 0)
return err;
 
+   err = rtnl_ensure_unique_netns(NETLINK_CB(skb).sk, tb, extack);
+   if (err < 0)
+   return err;
+
if (tb[IFLA_IFNAME])
nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
 
@@ -2802,6 +2863,10 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (err < 0)
return err;
 
+   err = rtnl_ensure_unique_netns(NETLINK_CB(skb).sk, tb, extack);
+   if (err < 0)
+   return err;
+
if (tb[IFLA_IFNAME])
nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
else
@@ -3045,6 +3110,10 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (err < 0)
return err;
 
+   err = rtnl_ensure_unique_netns(NETLINK_CB(skb).sk, tb, extack);
+   if (err < 0)
+   return err;
+
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
tgt_net = get_target_net(NETLINK_CB(skb).s

[PATCH net 0/1 v2] rtnetlink: require unique netns identifier

2018-02-05 Thread Christian Brauner
Hey,

Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
it is possible for userspace to send us requests with three different
properties to identify a target network namespace. This affects at least
RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
network namespace which is confusing. For legacy reasons the kernel will
pick the IFLA_NET_NS_PID property first and then look for the
IFLA_NET_NS_FD property but there is no reason to extend this type of
behavior to network namespace ids. The regression potential is quite
minimal since the rtnetlink requests in question either won't allow
IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.

We obviously cannot prevent users from passing both IFLA_NET_NS_PID and
IFLA_NET_NS_FD since we have supported this somehow for a long time. So
the check I'm proposing is to only fail when both IFLA_IF_NETNSID, and
IFLA_NET_NS_PID or IFLA_NET_NS_FD are passed and they refer to different
network namespaces.

Thanks!
Christian

ChangeLog v1->v2:
* return errno when the specified network namespace id is invalid
* fill in struct netlink_ext_ack if the network namespace id is invalid
* rename rtnl_ensure_unique_netns_attr() to rtnl_ensure_unique_netns() to
  indicate that a request without any network namespace identifying attributes
  is also considered valid.

ChangeLog v0->v1:
* report a descriptive error to userspace via struct netlink_ext_ack
* do not fail when multiple properties specifiy the same network namespace

Christian Brauner (1):
  rtnetlink: require unique netns identifier

 net/core/rtnetlink.c | 69 
 1 file changed, 69 insertions(+)

-- 
2.14.1



Re: [PATCH] sctp: fix dst refcnt leak in sctp_v6_get_dst()

2018-02-05 Thread Marcelo Ricardo Leitner
On Mon, Feb 05, 2018 at 03:10:35PM +0300, Alexey Kodanev wrote:
> When going through the bind address list in sctp_v6_get_dst() and
> the previously found address is better ('matchlen > bmatchlen'),
> the code continues to the next iteration without releasing currently
> held destination.
> 
> Fix it by releasing 'bdst' before continue to the next iteration, and
> instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
> move the already existed one right after ip6_dst_lookup_flow(), i.e. we
> shouldn't proceed further if we get an error for the route lookup.
> 
> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary 
> addresses for ipv6")
> Signed-off-by: Alexey Kodanev 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/ipv6.c |   10 +++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 5d4c15b..e35d4f7 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -326,8 +326,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &final);
>   bdst = ip6_dst_lookup_flow(sk, fl6, final_p);
>  
> - if (!IS_ERR(bdst) &&
> - ipv6_chk_addr(dev_net(bdst->dev),
> + if (IS_ERR(bdst))
> + continue;
> +
> + if (ipv6_chk_addr(dev_net(bdst->dev),
> &laddr->a.v6.sin6_addr, bdst->dev, 1)) {
>   if (!IS_ERR_OR_NULL(dst))
>   dst_release(dst);
> @@ -336,8 +338,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   }
>  
>   bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> - if (matchlen > bmatchlen)
> + if (matchlen > bmatchlen) {
> + dst_release(bdst);
>   continue;
> + }
>  
>   if (!IS_ERR_OR_NULL(dst))
>   dst_release(dst);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH 1/1] tcp: Honor the eor bit in tcp_mtu_probe

2018-02-05 Thread Eric Dumazet
On Mon, 2018-02-05 at 17:11 +0200, Ilya Lesokhin wrote:
> Avoid SKB coalescing if eor bit is set in one of the relevant
> SKBs.
> 
> Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
> Signed-off-by: Ilya Lesokhin 
> ---

Reviewed-by: Eric Dumazet 

Thanks.



Re: [PATCH net v2] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Marcelo Ricardo Leitner
On Mon, Feb 05, 2018 at 03:33:11PM +0200, Tommi Rantala wrote:
> Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
> 410f03831 ("sctp: add routing output fallback"):
> 

Patch LGTM. Will wait the respin for Acking.
Thanks.


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-02-05 Thread Will Deacon
Hi all,

On Wed, Jan 10, 2018 at 07:47:33PM +, Will Deacon wrote:
> On Tue, Jan 09, 2018 at 10:21:29AM +, Will Deacon wrote:
> > On Mon, Jan 08, 2018 at 10:49:01AM -0800, Linus Torvalds wrote:
> > > In this particular case, we should be very much aware of future CPU's
> > > being more _constrained_, because CPU vendors had better start taking
> > > this thing into account.
> > > 
> > > So the masking approach is FUNDAMENTALLY SAFER than the "let's try to
> > > limit control speculation".
> > > 
> > > If somebody can point to a CPU that actually speculates across an
> > > address masking operation, I will be very surprised. And unless you
> > > can point to that, then stop trying to dismiss the masking approach.
> > 
> > Whilst I agree with your comments about future CPUs, this stuff is further
> > out of academia than you might think. We're definitely erring on the
> > belt-and-braces side of things at the moment, so let me go check what's
> > *actually* been built and I suspect we'll be able to make the masking work.
> > 
> > Stay tuned...
> 
> I can happily confirm that there aren't any (ARM architecture) CPUs where
> the masking approach is not sufficient, so there's no need to worry about
> value speculation breaking this.

Unfortunately, thanks to some internal miscommunication, my previous assertion
that no implementations of the Armv8 architecture utilise data value prediction
has turned out to be incorrect. I received confirmation last week that this has
been deployed in production silicon and has shipped widely in various products,
so the horse really has bolted and this isn't confined to academia as was
suggested previously.

We're still investigating whether this affects the mask-based mitigation used
by eBPF, but we'll definitely be adding a CSDB-based sequence to our nospec
helpers to ensure that array_index_nospec is robust for arm64: the CSDB
instruction follows the generation of the mask (see patches at [1]).

In the meantime, I wanted to correct my previous claim in case anybody else
is using that as a basis to push ahead with the bare masking approach
elsewhere for arm64.

Sorry for the confusion,

Will

[1] 
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557825.html


Re: [BUG iproute2] ip tuntap show

2018-02-05 Thread Serhey Popovych
David Ahern wrote:
> On 1/31/18 10:21 AM, Serhey Popovych wrote:
>> Eric Dumazet wrote:
>>> ip tuntap enumerates devices using /sys/class/net which is unusual.
>>>
>>> Should we replace this enumeration using /proc/net/dev like "ip tunnel" ?
>>>
>>> After "unshare -n" maybe mounting /sys should not be required for
>>> proper iproute2 behavior.
>>>
>>> At least ip command should adopt a common enumeration method.
>>>
>>> What do you think ?
>>
>> It seems main reason for using /sys/class/net is to get additional
>> information for netdev like "owner", "group" and "tun_flags".
>>
>> On the other hand at least iptunnel and ip6tunnel uses nearly identical
>> code to parse /proc/net/dev.
>>
>> Having single routine that reads /proc/net/dev and calls implementation
>> specific callback function with given network device name is good idea.
>>
>> I can try to prepare v1 for this, if no one objects this.
>>
> 
> pid_name function needs help too. comm is allocated via sprintf, freed,
> and used again and then returned to caller.
> 

Sorry David, but I do no see problem with this function right now:

  1. It allocates string using malloc() in asprintf()
  2. free() it *after* fopen() to prevent memory leak if fopen() fails
  3. or if fopen() succeeded and we fscanf() with %ms that will allcate
 buffer we return.

I agree this is GNU extensions, but it does not look as user after free
to me. Should I get rid of these extensions? They are not last as I can
find.

Correct if I miss something.



signature.asc
Description: OpenPGP digital signature


Re: PA Semi PWRficient Gigabit Ethernet doesn't work anymore since the first networking updates for the kernel 4.16

2018-02-05 Thread Christian Zigotzky

Yes, you can.

Christian

On 05 February 2018 at 3:29PM, Andrew Lunn wrote:

On Mon, Feb 05, 2018 at 10:38:34AM +0100, Christian Zigotzky wrote:

Hello Andrew,

Many thanks for your patch. I compiled the latest git kernel today and the
PA Semi PWRficient Gigabit Ethernet works with your patch.

Great.

Can i add a tested-by:

Thanks
Andrew





Re: [PATCH] net: phy-micrel: remove redundant initialization of pointer of_node

2018-02-05 Thread Andrew Lunn
On Mon, Feb 05, 2018 at 12:47:59PM +, Colin King wrote:
> From: Colin Ian King 
> 
> Pointer of_node is initialized with a value that is never read, of_node
> is later updated with a new value instead, hence the initialization is
> redundant and can be removed.  Also remove unused pointer dev and
> remove an empty line.
> 
> Cleans up clang warnings:
> drivers/net/phy/micrel.c:393:28: warning: Value stored to 'of_node'
> during its initialization is never read
> drivers/net/phy/micrel.c:532:28: warning: Value stored to 'of_node'
> during its initialization is never read
> 
> Signed-off-by: Colin Ian King 

Hi Colin

You probably need to repost once netdev re-opens.

Reviewed-by: Andrew Lunn 

Andrew


[PATCH 1/1] tcp: Honor the eor bit in tcp_mtu_probe

2018-02-05 Thread Ilya Lesokhin
Avoid SKB coalescing if eor bit is set in one of the relevant
SKBs.

Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
Signed-off-by: Ilya Lesokhin 
---
 net/ipv4/tcp_output.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e9f985e42405..c8cc679f1779 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2099,6 +2099,17 @@ static int tcp_mtu_probe(struct sock *sk)
return 0;
}
 
+   len = probe_size;
+   tcp_for_write_queue_from_safe(skb, next, sk) {
+   if (len <= skb->len)
+   break;
+
+   if (unlikely(TCP_SKB_CB(skb)->eor))
+   return -1;
+
+   len -= skb->len;
+   }
+
/* We're allowed to probe.  Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
-- 
2.15.0.317.g14c63a9



Re: [PATCH net] dwc-xlgmac: remove Jie Deng as co-maintainer

2018-02-05 Thread David Miller
From: Jie Deng 
Date: Mon, 5 Feb 2018 11:31:27 +0800

> Jose Abreu is working on this driver and I will leave Synopsys soon.
> Thus it does not seem appropriate for me to be a co-maintainer anymore.
> 
> Signed-off-by: Jie Deng 

Applied.


Re: [PATCH] doc: Change the min default value of tcp_wmem/tcp_rmem.

2018-02-05 Thread David Miller
From: Tonghao Zhang 
Date: Sun,  4 Feb 2018 18:07:10 -0800

> The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096. And the
> tcp_wmem/tcp_rmem min default values are 4096.
> 
> Fixes: bd68a2a854ad ("net: set SK_MEM_QUANTUM to 4096")
> Cc: Eric Dumazet 
> Signed-off-by: Tonghao Zhang 

Applied, thanks.


Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-02-05 Thread Björn Töpel
2018-01-31 14:53 GMT+01:00 Björn Töpel :
> From: Björn Töpel 
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>

We realized, a bit late maybe, that 24 patches is a bit mouthful, so
let me try to make it more palatable.

Patch 1 to 7 introduce AF_XDP socket support with copy semantics
(require no driver changes). Patch 8 adds XDP_REDIRECT support to i40e
and patch 9 is the test application.

The rest of the patches are enabling zero-copy support, and they're
messier. So, if you don't really care about zero-copy, just have a
look at 1 to 7.

We'd really appreciate your thoughts on the user space APIs (including
the bpf APIs).

For the next review, we'll keep the set smaller, and introduce many of
the i40e patches as pre-patches.


Björn


> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool t

Re: [PATCH v2] socket: Provide put_cmsg_whitelist() for constant size copies

2018-02-05 Thread David Miller
From: Kees Cook 
Date: Fri, 2 Feb 2018 02:27:49 -0800

> @@ -343,6 +343,14 @@ struct ucred {
>  
>  extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct 
> sockaddr_storage *kaddr);
>  extern int put_cmsg(struct msghdr*, int level, int type, int len, void 
> *data);
> +/*
> + * Provide a bounce buffer for copying cmsg data to userspace when the
> + * target memory isn't already whitelisted for hardened usercopy.
> + */
> +#define put_cmsg_whitelist(_msg, _level, _type, _ptr) ({ \
> + typeof(*(_ptr)) _val = *(_ptr); \
> + put_cmsg(_msg, _level, _type, sizeof(_val), &_val); \
> + })

I understand what you are trying to achieve, but it's at a real cost
here.  Some of these objects are structures, for example the struct
sock_extended_err is 16 bytes.

And now we're going to copy it twice, once into the on-stack copy,
and then once again into the CMSG blob.

Please find a way to make hardened user copy happy without adding
new overhead.

Thank you.


Re: [PATCH net v2] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread David Miller
From: Tommi Rantala 
Date: Mon,  5 Feb 2018 15:33:11 +0200

> Fixes: 410f03831 ("sctp: add routing output fallback")
> Signed-off-by: Tommi Rantala 

I would add another Fixes tag mentioning commit:

0ca50d12fe46cdf3c0dc9ec5ca98607a52afdc62
("sctp: fix src address selection if using secondary addresses")

because it is another origin for this dst leak.

Otherwise your patch looks great, please respin it with a second
Fixes: tag added for the above commit and I'll apply this.

Thanks!


Re: PA Semi PWRficient Gigabit Ethernet doesn't work anymore since the first networking updates for the kernel 4.16

2018-02-05 Thread Andrew Lunn
On Mon, Feb 05, 2018 at 10:38:34AM +0100, Christian Zigotzky wrote:
> Hello Andrew,
> 
> Many thanks for your patch. I compiled the latest git kernel today and the
> PA Semi PWRficient Gigabit Ethernet works with your patch.

Great.

Can i add a tested-by:

Thanks
Andrew


Re: possible deadlock in rtnl_lock (2)

2018-02-05 Thread Florian Westphal
syzbot  wrote:

#syz-fix: netfilter: on sockopt() acquire sock lock only in the required scope


Re: RFC(V3): Audit Kernel Container IDs

2018-02-05 Thread Simo Sorce
On Fri, 2018-02-02 at 18:24 -0500, Paul Moore wrote:
> On Fri, Feb 2, 2018 at 5:19 PM, Simo Sorce  wrote:
> > On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote:
> > > On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs  
> > > wrote:
> > > > On 2018-01-09 11:18, Simo Sorce wrote:
> > > > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote:
> 
> ...
> 
> > > > Paul, can you justify this somewhat larger inconvenience for some
> > > > relatively minor convenience on our part?
> > > 
> > > Done in direct response to Simo.
> > 
> > Sorry but your response sounds more like waving away then addressing
> > them, the excuse being: we can't please everyone, so we are going to
> > please no one.
> 
> I obviously disagree with the take on my comments but you're free to
> your opinion.

The I misunderstood your comments, I am not interested in putting words
in your mouth.

> I believe saying we are pleasing no one isn't really fair now is it?

Well, of course you are going to please the audit subsystem, I
understand that. I think there is a problem of expectations. Some
people, me included, hoped to have a way to identify a container with
the help of the kernel.

> Is there any type of audit container ID now?  How would you go about
> associating audit events with containers now?

We do not have a good way, there are some dirty tricks like inferring
the container identity via cgroup names, but that is ... eww.
This is why, given audit has the same need of user space, there was
some hope we could agree on an identifier that could be used by both.
It would make correlating audit logs and other cluster-wide events
simpler. That is all.

>  (spoiler alert: it ain't
> pretty, and there are gaps I don't believe you can cover)  This
> proposal provides a mechanism to do this in a way that isn't tied to
> any one particular concept of a container and is manageable inside the
> kernel.

I like the proposal for the most part, we are just discussing on the
nature of the identifier, which is a minor detail in the end.

> If you have a need to track audit events for containers, I find it
> extremely hard to believe that you are not at least partially pleased
> by the solutions presented here.  It may not be everything on your
> wishlist, but when did you ever get *everything* on your wishlist?

It is true, and I am sorry if I came out demanding or abrasive. It was
not my intention. Of course a u64 that has to be mapped is still better
than nothing. It does cause a lot more work in user space, but it is
not impossible to deal with.

> > > But to be clear Richard, we've talked about this a few times, it's not
> > > a "minor convenience" on our part, it's a pretty big convenience once
> > > we starting having to route audit events and make decisions based on
> > > the audit container ID information.  Audit performance is less than
> > > awesome now, I'm working hard to not make it worse.
> > 
> > Sounds like a security vs performance trade off to me.
> 
> Welcome to software development.  It's generally a pretty terrible
> hobby and/or occupation, but we make up for it with long hours and
> endless frustration.

Tell me more about that, not! ;-)

> > > > u64 vs u128 is easy for us to
> > > > accomodate in terms of scalar comparisons.  It doubles the information
> > > > in every container id field we print in audit records.
> > > 
> > > ... and slows down audit container ID checks.
> > 
> > Are you saying a cmp on a u128 is slower than a comparison on a u64 and
> > this is something that will be noticeable ?
> 
> Do you have a 128 bit system?

no, but all 64bit systems have an instruction that allow you to do
atomic 128 compare and swap (IIRC ?).

>   I don't.  I've got a bunch of 64 bit
> systems, and a couple of 32 bit systems too.  People that use audit
> have a tendency to really hammer on it, to the point that we get
> performance complaints on a not infrequent basis.  I don't know the
> exact number of times we are going to need to check the audit
> container ID, but it's reasonable to think that we'll expose it as a
> filter-able field which adds a few checks, we'll use it for record
> routing so that's a few more, and if we're running multiple audit
> daemons we will probably want to include LSM checks which could result
> in a few more audit container ID checks.  If it was one comparison I
> wouldn't be too worried about it, but the point I'm trying to make is
> that we don't know what the implementation is going to look like yet
> and I suspect this ID is going to be leveraged in several places in
> the audit subsystem and I would much rather start small to save
> headaches later.
> 
> We can always expand the ID to a larger integer at a later date, but
> we can't make it smaller.

Well looking through the history of in kernel identifiers I know it is
hard also to increase size, because userspace will end up depending on
a specific size ... and this is the only reason I am really debating
this. If it were really e

Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect

2018-02-05 Thread Jesper Dangaard Brouer
On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel  wrote:

> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
> socket bound to the receiving queue (if any).

As I explained in-person at FOSDEM, my suggestion is to use the
bpf-map infrastructure for AF_XDP redirect, but people on this
upstream mailing also need a chance to validate my idea ;-)

The important thing to keep in-mind is how we can still maintain a
SPSC (Single producer Single Consumer) relationship between an
RX-queue and a userspace consumer-process.

This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
limit each RX-queue to service a single xdp_sock.  It sounds good from
a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
the flexibility to select among multiple xdp_sock'ets (via XDP
pre-filter selecting a different map), and still maintain a SPSC
relationship.  The RX-queue will just have several SPSC relationships
with the individual xdp_sock's.

This is true for the AF_XDP-copy mode, and require less driver change.
For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
anyhow, and in ZC case we will have to disallow this multiple
xdp_sock's, which is simply achieved by checking if the xdp_sock
pointer returned from the map lookup match the one that userspace
requested driver to register it's memory for RX-rings from.

The "xdp_sock_map" is an array, where the index correspond to the
queue_index.  The bpf_redirect_map() ignore the specified index and
instead use xdp_rxq_info->queue_index in the lookup.

Notice that a bpf-map have no pinned relationship with the device or
XDP prog loaded.  Thus, userspace need to bind() this map to the
device before traffic can flow, like the proposed bind() on the
xdp_sock.  This is to establish the SPSC binding.  My proposal is that
userspace insert the xdp_sock file-descriptor(s) in the map at the
queue-index, and the map (which is also just a file-descriptor) is
bound maybe via bind() to a specific device (via the ifindex).  Kernel
side will walk the map and do required actions xdp_sock's in find in
map.

TX-side is harder, as now multiple xdp_sock's can have the same
queue-pair ID with the same net_device. But Magnus propose that this
can be solved with hardware. As newer NICs have many TX-queue, and the
queue-pair ID is just an external visible number, while the kernel
internal structure can point to a dedicated TX-queue per xdp_sock.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 1/3] net: stmmac: dwmac-sun8i: drop V3s compatible and add V3 one

2018-02-05 Thread Maxime Ripard
Hi,

On Sat, Feb 03, 2018 at 03:23:28PM +0800, Icenowy Zheng wrote:
> 于 2018年2月3日 GMT+08:00 上午6:13:01, Maxime Ripard  写到:
> >On Sat, Feb 03, 2018 at 02:04:54AM +0800, Icenowy Zheng wrote:
> >> The V3s is just a differently packaged version of the V3 chip, which
> >has
> >> a MAC with the same capability with H3. The V3s just doesn't wire out
> >> the external MII/RMII/RGMII bus. (V3 wired out it).
> >> 
> >> Drop the compatible string of V3s in the dwmac-sun8i driver, and add
> >a
> >> V3 compatible string, which has all capabilities.
> >> 
> >> Signed-off-by: Icenowy Zheng 
> >
> >This breaks the DT ABI, so NAK.
> 
> I have asked this at IRC.

One more reason why no one should ask questions like this on IRC.

> The V3s compatible string is never used in any mainline
> kernel, even not in any RC version.

$ git grep allwinner,sun8i-v3s-emac v4.15 | wc -l 
5

It is there already, and the fact that we have or don't have a DT in
tree that use it doesn't matter. One could very well have written a DT
for it and never submitted it.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
http://bootlin.com


signature.asc
Description: PGP signature


[PATCH] qlcnic: remove redundant initializations to pointers vh and nf

2018-02-05 Thread Colin King
From: Colin Ian King 

Pointer vh is initialized however the value is never read as it
is re-assigned inside an if-statement.  Move the declaration to
inside the if-statement and remove the extraneous initialization.

Similarly, pointer nf is being initialized and the value is never
read and is re-assigned later, so this is redundant and can also
be removed.

Cleans up clang warnings:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:306:22: warning:
Value stored to 'vh' during its initialization is never read
drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:627:26: warning:
Value stored to 'nf' during its initialization is never read

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
index 84dd83031a1b..db0267a8eec4 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
@@ -303,7 +303,6 @@ static void qlcnic_send_filter(struct qlcnic_adapter 
*adapter,
   struct cmd_desc_type0 *first_desc,
   struct sk_buff *skb)
 {
-   struct vlan_ethhdr *vh = (struct vlan_ethhdr *)(skb->data);
struct ethhdr *phdr = (struct ethhdr *)(skb->data);
u16 protocol = ntohs(skb->protocol);
struct qlcnic_filter *fil, *tmp_fil;
@@ -318,6 +317,8 @@ static void qlcnic_send_filter(struct qlcnic_adapter 
*adapter,
 
if (adapter->flags & QLCNIC_VLAN_FILTERING) {
if (protocol == ETH_P_8021Q) {
+   struct vlan_ethhdr *vh;
+
vh = (struct vlan_ethhdr *)skb->data;
vlan_id = ntohs(vh->h_vlan_TCI);
} else if (skb_vlan_tag_present(skb)) {
@@ -624,7 +625,7 @@ static int qlcnic_map_tx_skb(struct pci_dev *pdev, struct 
sk_buff *skb,
 static void qlcnic_unmap_buffers(struct pci_dev *pdev, struct sk_buff *skb,
 struct qlcnic_cmd_buffer *pbuf)
 {
-   struct qlcnic_skb_frag *nf = &pbuf->frag_array[0];
+   struct qlcnic_skb_frag *nf;
int i, nr_frags = skb_shinfo(skb)->nr_frags;
 
for (i = 0; i < nr_frags; i++) {
-- 
2.15.1



[PATCH net v2] sctp: fix dst refcnt leak in sctp_v4_get_dst

2018-02-05 Thread Tommi Rantala
Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):

When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.

The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst refcnt leak.

Testcase:
  ip netns add TEST
  ip netns exec TEST ip link set lo up
  ip link add dummy0 type dummy
  ip link add dummy1 type dummy
  ip link add dummy2 type dummy
  ip link set dev dummy0 netns TEST
  ip link set dev dummy1 netns TEST
  ip link set dev dummy2 netns TEST
  ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
  ip netns exec TEST ip link set dummy0 up
  ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
  ip netns exec TEST ip link set dummy1 up
  ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
  ip netns exec TEST ip link set dummy2 up
  ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 2 
-s -B 192.168.1.3
  ip netns del TEST

In 4.4 and 4.9 kernels this results to:
  [  354.179591] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  364.419674] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  374.663664] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  384.903717] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  395.143724] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  [  405.383645] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
  ...

Fixes: 410f03831 ("sctp: add routing output fallback")
Signed-off-by: Tommi Rantala 
---

v2: unconditional dst_release() before breaking out of the loop:
 - if dst == NULL, then dst_release() is no-op.
 - if dst != &rt->dst, then drop reference to the first dst, we do not
   need it after all.
 - if dst == &rt->dst, then the refcnt was incremented twice for this
   dst, so drop the second ref.

 net/sctp/protocol.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index ba28d270eb24..aba2381d3cf9 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -509,22 +509,20 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
if (IS_ERR(rt))
continue;
 
-   if (!dst)
-   dst = &rt->dst;
-
/* Ensure the src address belongs to the output
 * interface.
 */
odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
 false);
if (!odev || odev->ifindex != fl4->flowi4_oif) {
-   if (&rt->dst != dst)
+   if (!dst)
+   dst = &rt->dst;
+   else
dst_release(&rt->dst);
continue;
}
 
-   if (dst != &rt->dst)
-   dst_release(dst);
+   dst_release(dst);
dst = &rt->dst;
break;
}
-- 
2.15.1



Re: [PATCH iproute2-next 08/10] rdma: Add QP resource tracking information

2018-02-05 Thread Leon Romanovsky
On Thu, Feb 01, 2018 at 02:05:08PM -0600, Steve Wise wrote:
> Hey Leon,

<...>

>
> > +static int res_qp_parse_cb(const struct nlmsghdr *nlh, void *data)
> > +{

<...>

> > +
> > +   mnl_attr_for_each_nested(nla_entry, nla_table) {
> > +   struct nlattr *nla_line[RDMA_NLDEV_ATTR_MAX] = {};
> > +   uint32_t lqpn, rqpn = 0, rq_psn = 0, sq_psn;
> > +   uint8_t type, state, path_mig_state = 0;
> > +   uint32_t port = 0, pid = 0;
> > +   char *comm = NULL;

<...>

> > +
> > +   if (rd_check_is_filtered(rd, "pid", pid))
> > +   continue;
>
> Is comm leaked here when ATTR_RES_PID is present?
>
>
> > +
> > +   if (nla_line[RDMA_NLDEV_ATTR_RES_KERN_NAME])
> > +   /* discard const from mnl_attr_get_str */
> > +   comm = (char
> > *)mnl_attr_get_str(nla_line[RDMA_NLDEV_ATTR_RES_KERN_NAME]);
>
> And also here if the kernel ever passes up both PID and KERN_NAME (which it
> isn't supposed to).

Yes, you are right, and the bad thing that I prepared everything to call
free() unconditionally by setting comm to be NULL.

Thanks

>
>
> Steve.
>


signature.asc
Description: PGP signature


[PATCH] net: phy-micrel: remove redundant initialization of pointer of_node

2018-02-05 Thread Colin King
From: Colin Ian King 

Pointer of_node is initialized with a value that is never read, of_node
is later updated with a new value instead, hence the initialization is
redundant and can be removed.  Also remove unused pointer dev and
remove an empty line.

Cleans up clang warnings:
drivers/net/phy/micrel.c:393:28: warning: Value stored to 'of_node'
during its initialization is never read
drivers/net/phy/micrel.c:532:28: warning: Value stored to 'of_node'
during its initialization is never read

Signed-off-by: Colin Ian King 
---
 drivers/net/phy/micrel.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 0f45310300f6..a86ecccde924 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -389,8 +389,7 @@ static int ksz9021_load_values_from_of(struct phy_device 
*phydev,
 
 static int ksz9021_config_init(struct phy_device *phydev)
 {
-   const struct device *dev = &phydev->mdio.dev;
-   const struct device_node *of_node = dev->of_node;
+   const struct device_node *of_node;
const struct device *dev_walker;
 
/* The Micrel driver has a deprecated option to place phy OF
@@ -401,7 +400,6 @@ static int ksz9021_config_init(struct phy_device *phydev)
do {
of_node = dev_walker->of_node;
dev_walker = dev_walker->parent;
-
} while (!of_node && dev_walker);
 
if (of_node) {
@@ -528,8 +526,7 @@ static int ksz9031_enable_edpd(struct phy_device *phydev)
 
 static int ksz9031_config_init(struct phy_device *phydev)
 {
-   const struct device *dev = &phydev->mdio.dev;
-   const struct device_node *of_node = dev->of_node;
+   const struct device_node *of_node;
static const char *clk_skews[2] = {"rxc-skew-ps", "txc-skew-ps"};
static const char *rx_data_skews[4] = {
"rxd0-skew-ps", "rxd1-skew-ps",
-- 
2.15.1



Re: [PATCH] sctp: fix dst refcnt leak in sctp_v6_get_dst()

2018-02-05 Thread Neil Horman
On Mon, Feb 05, 2018 at 03:10:35PM +0300, Alexey Kodanev wrote:
> When going through the bind address list in sctp_v6_get_dst() and
> the previously found address is better ('matchlen > bmatchlen'),
> the code continues to the next iteration without releasing currently
> held destination.
> 
> Fix it by releasing 'bdst' before continue to the next iteration, and
> instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
> move the already existed one right after ip6_dst_lookup_flow(), i.e. we
> shouldn't proceed further if we get an error for the route lookup.
> 
> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary 
> addresses for ipv6")
> Signed-off-by: Alexey Kodanev 
> ---
>  net/sctp/ipv6.c |   10 +++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 5d4c15b..e35d4f7 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -326,8 +326,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &final);
>   bdst = ip6_dst_lookup_flow(sk, fl6, final_p);
>  
> - if (!IS_ERR(bdst) &&
> - ipv6_chk_addr(dev_net(bdst->dev),
> + if (IS_ERR(bdst))
> + continue;
> +
> + if (ipv6_chk_addr(dev_net(bdst->dev),
> &laddr->a.v6.sin6_addr, bdst->dev, 1)) {
>   if (!IS_ERR_OR_NULL(dst))
>   dst_release(dst);
> @@ -336,8 +338,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
>   }
>  
>   bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> - if (matchlen > bmatchlen)
> + if (matchlen > bmatchlen) {
> + dst_release(bdst);
>   continue;
> + }
>  
>   if (!IS_ERR_OR_NULL(dst))
>   dst_release(dst);
> -- 
> 1.7.1
> 
> 
Acked-by: Neil Horman 


[PATCH] sctp: fix dst refcnt leak in sctp_v6_get_dst()

2018-02-05 Thread Alexey Kodanev
When going through the bind address list in sctp_v6_get_dst() and
the previously found address is better ('matchlen > bmatchlen'),
the code continues to the next iteration without releasing currently
held destination.

Fix it by releasing 'bdst' before continue to the next iteration, and
instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
move the already existed one right after ip6_dst_lookup_flow(), i.e. we
shouldn't proceed further if we get an error for the route lookup.

Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary 
addresses for ipv6")
Signed-off-by: Alexey Kodanev 
---
 net/sctp/ipv6.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 5d4c15b..e35d4f7 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -326,8 +326,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &final);
bdst = ip6_dst_lookup_flow(sk, fl6, final_p);
 
-   if (!IS_ERR(bdst) &&
-   ipv6_chk_addr(dev_net(bdst->dev),
+   if (IS_ERR(bdst))
+   continue;
+
+   if (ipv6_chk_addr(dev_net(bdst->dev),
  &laddr->a.v6.sin6_addr, bdst->dev, 1)) {
if (!IS_ERR_OR_NULL(dst))
dst_release(dst);
@@ -336,8 +338,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
}
 
bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
-   if (matchlen > bmatchlen)
+   if (matchlen > bmatchlen) {
+   dst_release(bdst);
continue;
+   }
 
if (!IS_ERR_OR_NULL(dst))
dst_release(dst);
-- 
1.7.1



PA Semi PWRficient Gigabit Ethernet doesn't work anymore since the first networking updates for the kernel 4.16

2018-02-05 Thread Christian Zigotzky

Hello Andrew,

Many thanks for your patch. I compiled the latest git kernel today and 
the PA Semi PWRficient Gigabit Ethernet works with your patch.


Have a nice week!

Thanks,
Christian


On 04 February 2018 at 6:16PM, Andrew Lunn wrote:
> On Sun, Feb 04, 2018 at 05:47:03PM +0100, Christian Zigotzky wrote:
>> Hello,
>>
>> The PA Semi PWRficient Gigabit Ethernet doesn't work anymore since 
the first

>> networking updates [1] for the kernel 4.16.
>>
>> Error messages:
>>
>> [    0.634241] libphy: pasemi gpio mdio bus: probed
>> [    0.634749] pasemi gpio mdio bus: Cannot register as MDIO bus, 
err -38

>
> -38 is ENOSYS.
>
>> --- a/drivers/net/phy/mdio_bus.c    2018-02-03 17:34:46.973045321 +0100
>> +++ b/drivers/net/phy/mdio_bus.c    2018-02-04 11:03:14.909093360 +0100
>> @@ -47,41 +47,11 @@
>>
>>  #include "mdio-boardinfo.h"
>>
>> -static int mdiobus_register_gpiod(struct mdio_device *mdiodev)
>> -{
>> -    struct gpio_desc *gpiod = NULL;
>> -
>> -    /* Deassert the optional reset signal */
>> -    if (mdiodev->dev.of_node)
>> -        gpiod = fwnode_get_named_gpiod(&mdiodev->dev.of_node->fwnode,
>> -                       "reset-gpios", 0, GPIOD_OUT_LOW,
>> -                       "PHY reset");
>
> So i think you don't have GPIOLIB enabled. Hence you are hitting
>
> 
http://elixir.free-electrons.com/linux/latest/source/include/linux/gpio/consumer.h#L470

>
> static inline
> struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode,
>                      const char *propname, int index,
>                      enum gpiod_flags dflags,
>                      const char *label)
> {
>     return ERR_PTR(-ENOSYS);
> }
>
> So rather than just deleting all this code, breaking other platforms
> that need this gpio, lets try a real fix. Please try this. If it
> works, i will formally submit it.
>
>    Andrew
>
> >From a4210ba306948497d7360927c1e532eb903c58b2 Mon Sep 17 00:00:00 2001
> From: Andrew Lunn 
> Date: Sun, 4 Feb 2018 11:09:20 -0600
> Subject: [PATCH] net: phy: Handle not having GPIO enabled in the kernel
>
> If CONFIG_GPIOLIB is disabled, fwnode_get_named_gpiod() becomes a stub
> function, which return -ENOSYS. Handle this in the same way as
> -ENOENT, i.e. assume there is no GPIO used to reset the PHYs.
>
> Reported-by: Christian Zigotzky 
> Signed-off-by: Andrew Lunn 
> ---
>  drivers/net/phy/mdio_bus.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 88272b3ac2e2..24b5511222c8 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -56,7 +56,8 @@ static int mdiobus_register_gpiod(struct 
mdio_device *mdiodev)

>          gpiod = fwnode_get_named_gpiod(&mdiodev->dev.of_node->fwnode,
>                         "reset-gpios", 0, GPIOD_OUT_LOW,
>                         "PHY reset");
> -    if (PTR_ERR(gpiod) == -ENOENT)
> +    if (PTR_ERR(gpiod) == -ENOENT ||
> +        PTR_ERR(gpiod) == -ENOSYS)
>          gpiod = NULL;
>      else if (IS_ERR(gpiod))
>          return PTR_ERR(gpiod);





[PATCH net v3] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Paolo Abeni
Li Shuang reported an Oops with cls_u32 due to an use-after-free
in u32_destroy_key(). The use-after-free can be triggered with:

dev=lo
tc qdisc add dev $dev root handle 1: htb default 10
tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
 10.0.0.0/8 hashkey mask 0xff00 at 16 link 1:
tc qdisc del dev $dev root

Which causes the following kasan splat:

 ==
 BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 
[cls_u32]
 Read of size 4 at addr 881b83dae618 by task kworker/u48:5/571

 CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
 Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
 Call Trace:
  dump_stack+0xd6/0x182
  ? dma_virt_map_sg+0x22e/0x22e
  print_address_description+0x73/0x290
  kasan_report+0x277/0x360
  ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
  process_one_work+0xae0/0x1c80
  ? sched_clock+0x5/0x10
  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? check_noncircular+0x20/0x20
  ? firmware_map_remove+0x73/0x73
  ? find_held_lock+0x39/0x1c0
  ? worker_thread+0x434/0x1820
  ? lock_contended+0xee0/0xee0
  ? lock_release+0x1100/0x1100
  ? init_rescuer.part.16+0x150/0x150
  ? retint_kernel+0x10/0x10
  worker_thread+0x216/0x1820
  ? process_one_work+0x1c80/0x1c80
  ? lock_acquire+0x1a5/0x540
  ? lock_downgrade+0x6b0/0x6b0
  ? sched_clock+0x5/0x10
  ? lock_release+0x1100/0x1100
  ? compat_start_thread+0x80/0x80
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? kmem_cache_alloc_trace+0x143/0x320
  ? firmware_map_remove+0x73/0x73
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1c0
  ? schedule+0xf3/0x3b0
  ? lock_downgrade+0x6b0/0x6b0
  ? __schedule+0x1ee0/0x1ee0
  ? do_wait_intr_irq+0x340/0x340
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irqrestore+0x32/0x60
  ? process_one_work+0x1c80/0x1c80
  ? process_one_work+0x1c80/0x1c80
  kthread+0x312/0x3d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x3a/0x50

 Allocated by task 1688:
  kasan_kmalloc+0xa0/0xd0
  __kmalloc+0x162/0x380
  u32_change+0x1220/0x3c9e [cls_u32]
  tc_ctl_tfilter+0x1ba6/0x2f80
  rtnetlink_rcv_msg+0x4f0/0x9d0
  netlink_rcv_skb+0x124/0x320
  netlink_unicast+0x430/0x600
  netlink_sendmsg+0x8fa/0xd60
  sock_sendmsg+0xb1/0xe0
  ___sys_sendmsg+0x678/0x980
  __sys_sendmsg+0xc4/0x210
  do_syscall_64+0x232/0x7f0
  return_from_SYSCALL_64+0x0/0x75

 Freed by task 112:
  kasan_slab_free+0x71/0xc0
  kfree+0x114/0x320
  rcu_process_callbacks+0xc3f/0x1600
  __do_softirq+0x2bf/0xc06

 The buggy address belongs to the object at 881b83dae600
  which belongs to the cache kmalloc-4096 of size 4096
 The buggy address is located 24 bytes inside of
  4096-byte region [881b83dae600, 881b83daf600)
 The buggy address belongs to the page:
 page:ea006e0f6a00 count:1 mapcount:0 mapping:  (null) index:0x0 
compound_mapcount: 0
 flags: 0x17c0008100(slab|head)
 raw: 0017c0008100   000100070007
 raw: dead0100 dead0200 880187c0e600 
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ^
  881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ==

The problem is that the htnode is freed before the linked knodes and the
latter will try to access the first at u32_destroy_key() time.
This change addresses the issue using the htnode refcnt to guarantee
the correct free order. While at it also add a RCU annotation,
to keep sparse happy.

v1 -> v2: use rtnl_derefence() instead of RCU read locks
v2 -> v3:
  - don't check refcnt in u32_destroy_hnode()
  - cleaned-up u32_destroy() implementation
  - cleaned-up code comment

Reported-by: Li Shuang 
Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
Signed-off-by: Paolo Abeni 
---

Re: [PATCH net] sctp: fix dst reference leak in sctp_v4_get_dst

2018-02-05 Thread Tommi Rantala

On 05.02.2018 03:01, Marcelo Ricardo Leitner wrote:

Hi,

On Sun, Feb 04, 2018 at 11:02:39AM +0200, Tommi Rantala wrote:

Fix dst reference leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):

When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.

The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst leak.


Ugh, I think this patch still does not get the refcounting right, please 
don't apply.


For example:
- on first iteration of the loop:
  "odev" check fails, and we set dst with refcnt=1.
- on second iteration:
  the same dst refcnt is incremented to 2 via ip_route_output_key().
  "odev" check succeeds.
  dst == &rt->dst, so we do not touch the refcnt.
  break out of the loop

=> oops, dst has refcnt=2


Testcase:
   ip netns add TEST
   ip netns exec TEST ip link set lo up
   ip link add dummy0 type dummy
   ip link add dummy1 type dummy
   ip link add dummy2 type dummy
   ip link set dev dummy0 netns TEST
   ip link set dev dummy1 netns TEST
   ip link set dev dummy2 netns TEST
   ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
   ip netns exec TEST ip link set dummy0 up
   ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
   ip netns exec TEST ip link set dummy1 up
   ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
   ip netns exec TEST ip link set dummy2 up
   ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 2 
-s -B 192.168.1.3
   ip netns del TEST


Thanks for the detailed steps. I'll adapt and include it on
github/sctp/sctp-tests


ok, good!


In 4.4 and 4.9 kernels this results to:


I wonder why you couldn't reproduce this on newer kernels. Maybe
something masked it?


I didn't check what exactly changed here.

commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 was mentioned elsewhere.

Also the commits by Wei Wang changed the dst handling, see e.g.
5b7c9a8ff828287af5aebe93e707271bf1a82cc3



   [  354.179591] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   [  364.419674] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   [  374.663664] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   [  384.903717] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   [  395.143724] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   [  405.383645] unregister_netdevice: waiting for lo to become free. Usage 
count = 1
   ...

Fixes: 410f03831 ("sctp: add routing output fallback")
Acked-by: Neil Horman 
Cc: Marcelo Ricardo Leitner 
Cc: Alexey Kodanev 
Signed-off-by: Tommi Rantala 
---
  net/sctp/protocol.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 8b4ff315695e..aadb9e7f60e2 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -508,21 +508,20 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
if (IS_ERR(rt))
continue;
  
-		if (!dst)

-   dst = &rt->dst;
-
/* Ensure the src address belongs to the output
 * interface.
 */
odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
 false);
if (!odev || odev->ifindex != fl4->flowi4_oif) {
-   if (&rt->dst != dst)
+   if (!dst)
+   dst = &rt->dst;
+   else
dst_release(&rt->dst);
continue;
}
  
-		if (dst != &rt->dst)

+   if (dst && dst != &rt->dst)
dst_release(dst);
dst = &rt->dst;
break;
--
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html