SSE instructions for fast packet copy?

2017-05-04 Thread Tom Herbert
Hi,

I am thinking about the possibility of using SSE in kernel for
speeding up the kernel memcpy particularly for copy to userspace
emeory, and maybe even using the string instructions (like if we
supported regex in something like eBPF). AFAIK we don't use SSE in
kernel because of xmm register state needing to be saved across
context switch. However, if we start busy-polling a CPU in kernel on
network queues then there might not be any context switches to worry
about. In this model we'd want to enable SSE per CPU.

Has this ever been tried before? Is this at all feasible? :-) Is it
possible to enable SSE for kernel for just one CPU? (I found CPUID
will return SSE supported, but don't see how to enable other than
-msse for compiling).

Thanks,
Tom


Re: Bug in skb_gro_receive - possible bad page state problems?

2017-05-04 Thread Eric Dumazet
On Fri, 2017-05-05 at 08:57 +0530, Anand H. Krishnan wrote:
> Hello,
> 
> Is skb_gro_receive doing the right thing for cloned packets?
> 
> When we are merging fragments, we do not seem to be taking a reference
> to the underlying page. To me, it looks like it should work fine for 
> non-cloned
> packets. However, for cloned packets, when the gro-ed packet is eventually
> freed (because the original skb was not cloned and hence reference was 1),
> the merged skb's frags also get freed (put_page-ed) without taking into 
> account
> the other references that were held for the fragments (dataref).
> 
> We saw crashes because of this behavior. Our setup had a third party kernel
> forwarding module which uses GRO (napi_gro_receive). Doing iperf3 with small
> packets and doing tcpdump on the receiving tap interface results in the 
> problem.
> With DEBUG_VM enabled, put page crashes. Without DEBUG_VM, bad page
> state results.

Yep, GRO must not be used with cloned skb.

This is why gro_cells_receive() has this check :

if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev))
return netif_rx(skb);

(But not the main napi_gro_receive() that is supposed to be used by
driver before any tap)




Bug in skb_gro_receive - possible bad page state problems?

2017-05-04 Thread Anand H. Krishnan
Hello,

Is skb_gro_receive doing the right thing for cloned packets?

When we are merging fragments, we do not seem to be taking a reference
to the underlying page. To me, it looks like it should work fine for non-cloned
packets. However, for cloned packets, when the gro-ed packet is eventually
freed (because the original skb was not cloned and hence reference was 1),
the merged skb's frags also get freed (put_page-ed) without taking into account
the other references that were held for the fragments (dataref).

We saw crashes because of this behavior. Our setup had a third party kernel
forwarding module which uses GRO (napi_gro_receive). Doing iperf3 with small
packets and doing tcpdump on the receiving tap interface results in the problem.
With DEBUG_VM enabled, put page crashes. Without DEBUG_VM, bad page
state results.

Your thoughts (please CC me, since I am not part of this list).

Thanks,
Anand


Re: FEC on i.MX 7 transmit queue timeout

2017-05-04 Thread Andy Duan
On 2017年05月05日 10:09, Stefan Agner wrote:
> On 2017-05-04 19:03, Andy Duan wrote:
>> On 2017年05月05日 05:36, Stefan Agner wrote:
>>> On 2017-05-03 20:08, Andy Duan wrote:
 From: Stefan Agner  Sent: Thursday, May 04, 2017 9:22 AM
> To: Andy Duan 
> Cc: fugang.d...@freescale.com; feste...@gmail.com;
> netdev@vger.kernel.org; netdev-ow...@vger.kernel.org
> Subject: Re: FEC on i.MX 7 transmit queue timeout
>
> Hi Andy,
>
> On 2017-04-20 19:48, Andy Duan wrote:
>> On 2017年04月20日 07:15, Stefan Agner wrote:
>>> I tested again with imx6sx-fec compatible string. I could reproduce
>>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>>> whether queue 2 is counting up or not. Just after boot, I check
>>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>>
>>> But if only queue 0 is mostly in use, then it seems to work just fine.
>> If your case is only running best effort like tcp/udp, you can re-set
>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>> Other two queues are for AVB audio/video queues, they have high
>> priority than queue 0. If running iperf tcp test on the three queues,
>> then the tcp segment may be out-of-order that cause net watchdog
> timeout.
>>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>>> reboot 3 times, then queue 2 was counting:
>>> 57:  8 GIC-0 150 Level 30be.ethernet
>>> 58:  20137 GIC-0 151 Level 30be.ethernet
>>> 59:   9269 GIC-0 152 Level 30be.ethernet
>>>
>>> It took me about 40 minutes on Sabre until it happened, and I had to
>>> force it using iperf, but then I got the ring dumps:
>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>> not running iperf.
>> I am testing with iperf.
> Any update on this issue?
>
> When using iperf (server) on the board with Linux 4.11 the issue appears
> within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that 
> matters)...
>
 I don’t know whether you received my last mail. (maybe failed due to I
 received some rejection mails)
>>> I think I did not... The last email I received was Fri, 21 Apr 2017
>>> 02:48:23 UTC.
>>>
>>>
 If your case is only running best effort like tcp/udp, you can re-set
 the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
 file.
>>> I did test that, and it seems to work fine with those properties set to
>>> 1.
>> So it can fix your problem after long time test?
> Yes, seems to work fine after more than 2 hours.
>
 Other two queues are for AVB audio/video queues, they have high
 priority than queue 0. If running iperf tcp test on the three queues,
 then the tcp segment may be out-of-order that cause net watchdog
 timeout.
>>> Okay. A single event would be understandable, but it seems to enter some
>>> kind of loop after that (continuously printing "fec 30be.ethernet
>>> eth0: TX ring dump ...").
>>>
>>> In a quick test I commented out the fec_dump call, with that it seems to
>>> print only once and continues working afterwards (although, speed starts
>>> to decrease, so something is not good at that point).
>> The test base on above change ? One queue still bring watchdog timeout ?
> No, sorry for the confusion: This was without the fix above. So use
> multiple queues, and disable fec_dump... I was just wondering, because
> disabling the multiple queues seems to me somewhat a workaround for
> now... :-)
>
No, it is not workaround. As i said, quque1 and queue2 are for AVB paths 
have higher priority in transmition.
It bring the trouble for your case. I will submit one patch to fix it 
that best effort go queue0, AVB streaming go
quque1 and queue2.

>
 In fsl kernel tree, there have one patch that only select the queue0
 for best effort like tcp/udp. Pls test again in your board, if no
 problem I will upstream the patch.
>>> That sounds like a reasonable fix.
>>>
>>> IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
>>> should "just work" by default.
>>>
>>> --
>>> Stefan

RE: [net-next] net: remove duplicate add_device_randomness() call

2017-05-04 Thread 张胜举
> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, May 04, 2017 11:13 PM
> To: zhangshen...@cmss.chinamobile.com
> Cc: netdev@vger.kernel.org; eduma...@google.com
> Subject: Re: [net-next] net: remove duplicate add_device_randomness() call
> 
> From: Zhang Shengju 
> Date: Thu,  4 May 2017 11:40:42 +0800
> 
> > Since register_netdevice() already call add_device_randomness() and
> > dev_set_mac_address() will call it after mac address change.
> > It's not necessary to call at device UP.
> >
> > Signed-off-by: Zhang Shengju 
> 
> The net-next tree is closed, please resubmit this when the net-next tree
> opens back up.

Okay, thanks David.

BRs,
ZSJ





[PATCH v2 net] tcp: randomize timestamps on syncookies

2017-05-04 Thread Eric Dumazet
From: Eric Dumazet 

Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.

In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock()

Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each 
connection")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Yuchung Cheng 
---
 include/net/secure_seq.h |   10 ++
 include/net/tcp.h|5 +++--
 net/core/secure_seq.c|   31 +++
 net/ipv4/syncookies.c|   12 ++--
 net/ipv4/tcp_input.c |8 +++-
 net/ipv4/tcp_ipv4.c  |   31 +++
 net/ipv6/syncookies.c|   10 +-
 net/ipv6/tcp_ipv6.c  |   32 +++-
 8 files changed, 88 insertions(+), 51 deletions(-)

diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
index 
fe236b3429f0d8caeb1adc367b5b4a20591c848b..b94006f6fbdde0d78fe33b9c2d86159e291c30cf
 100644
--- a/include/net/secure_seq.h
+++ b/include/net/secure_seq.h
@@ -6,10 +6,12 @@
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport);
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-__be16 sport, __be16 dport, u32 *tsoff);
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-  __be16 sport, __be16 dport, u32 *tsoff);
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+  __be16 sport, __be16 dport);
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr);
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+__be16 sport, __be16 dport);
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr);
 u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
__be16 sport, __be16 dport);
 u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
270e5cc43c99e7030e95af218095cf9f283950bc..8c0e5a901d6424fbd01233cd3adfdce52076f7a9
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -470,7 +470,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct 
sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 struct request_sock *req,
-struct dst_entry *dst);
+struct dst_entry *dst, u32 tsoff);
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
  u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
@@ -1822,7 +1822,8 @@ struct tcp_request_sock_ops {
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
   const struct request_sock *req);
-   __u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
+   u32 (*init_seq)(const struct sk_buff *skb);
+   u32 (*init_ts_off)(const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
   struct tcp_fastopen_cookie *foc,
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 
6bd2f8fb0476baabf507557fc0d06b6787511c70..ae35cce3a40d70387bee815798933aa43a0e6d84
 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -24,9 +24,13 @@ static siphash_key_t ts_secret __read_mostly;
 
 static __always_inline void net_secret_init(void)
 {
-   net_get_random_once(_secret, sizeof(ts_secret));
net_get_random_once(_secret, sizeof(net_secret));
 }
+
+static __always_inline void ts_secret_init(void)
+{
+   net_get_random_once(_secret, sizeof(ts_secret));
+}
 #endif
 
 #ifdef CONFIG_INET
@@ -47,7 +51,7 @@ static u32 seq_scale(u32 seq)
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
-static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 {
const struct {
struct in6_addr saddr;
@@ -60,12 +64,14 @@ static u32 secure_tcpv6_ts_off(const __be32 *saddr, const 
__be32 *daddr)
if (sysctl_tcp_timestamps != 1)
return 0;
 
+   ts_secret_init();
return siphash(, 

Re: FEC on i.MX 7 transmit queue timeout

2017-05-04 Thread Stefan Agner
On 2017-05-04 19:03, Andy Duan wrote:
> On 2017年05月05日 05:36, Stefan Agner wrote:
>> On 2017-05-03 20:08, Andy Duan wrote:
>>> From: Stefan Agner  Sent: Thursday, May 04, 2017 9:22 AM
 To: Andy Duan 
 Cc: fugang.d...@freescale.com; feste...@gmail.com;
 netdev@vger.kernel.org; netdev-ow...@vger.kernel.org
 Subject: Re: FEC on i.MX 7 transmit queue timeout

 Hi Andy,

 On 2017-04-20 19:48, Andy Duan wrote:
> On 2017年04月20日 07:15, Stefan Agner wrote:
>> I tested again with imx6sx-fec compatible string. I could reproduce
>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>> whether queue 2 is counting up or not. Just after boot, I check
>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>
>> But if only queue 0 is mostly in use, then it seems to work just fine.
> If your case is only running best effort like tcp/udp, you can re-set
> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
> Other two queues are for AVB audio/video queues, they have high
> priority than queue 0. If running iperf tcp test on the three queues,
> then the tcp segment may be out-of-order that cause net watchdog
 timeout.
>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>> reboot 3 times, then queue 2 was counting:
>>57:  8 GIC-0 150 Level 30be.ethernet
>>58:  20137 GIC-0 151 Level 30be.ethernet
>>59:   9269 GIC-0 152 Level 30be.ethernet
>>
>> It took me about 40 minutes on Sabre until it happened, and I had to
>> force it using iperf, but then I got the ring dumps:
> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
> not running iperf.
> I am testing with iperf.
 Any update on this issue?

 When using iperf (server) on the board with Linux 4.11 the issue appears
 within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that 
 matters)...

>>> I don’t know whether you received my last mail. (maybe failed due to I
>>> received some rejection mails)
>> I think I did not... The last email I received was Fri, 21 Apr 2017
>> 02:48:23 UTC.
>>
>>
>>> If your case is only running best effort like tcp/udp, you can re-set
>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
>>> file.
>> I did test that, and it seems to work fine with those properties set to
>> 1.
> So it can fix your problem after long time test?

Yes, seems to work fine after more than 2 hours.

>>> Other two queues are for AVB audio/video queues, they have high
>>> priority than queue 0. If running iperf tcp test on the three queues,
>>> then the tcp segment may be out-of-order that cause net watchdog
>>> timeout.
>> Okay. A single event would be understandable, but it seems to enter some
>> kind of loop after that (continuously printing "fec 30be.ethernet
>> eth0: TX ring dump ...").
>>
>> In a quick test I commented out the fec_dump call, with that it seems to
>> print only once and continues working afterwards (although, speed starts
>> to decrease, so something is not good at that point).
> The test base on above change ? One queue still bring watchdog timeout ?

No, sorry for the confusion: This was without the fix above. So use
multiple queues, and disable fec_dump... I was just wondering, because
disabling the multiple queues seems to me somewhat a workaround for
now... :-)

--
Stefan

>>> In fsl kernel tree, there have one patch that only select the queue0
>>> for best effort like tcp/udp. Pls test again in your board, if no
>>> problem I will upstream the patch.
>> That sounds like a reasonable fix.
>>
>> IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
>> should "just work" by default.
>>
>> --
>> Stefan


Re: FEC on i.MX 7 transmit queue timeout

2017-05-04 Thread Andy Duan
On 2017年05月05日 05:36, Stefan Agner wrote:
> On 2017-05-03 20:08, Andy Duan wrote:
>> From: Stefan Agner  Sent: Thursday, May 04, 2017 9:22 AM
>>> To: Andy Duan 
>>> Cc: fugang.d...@freescale.com; feste...@gmail.com;
>>> netdev@vger.kernel.org; netdev-ow...@vger.kernel.org
>>> Subject: Re: FEC on i.MX 7 transmit queue timeout
>>>
>>> Hi Andy,
>>>
>>> On 2017-04-20 19:48, Andy Duan wrote:
 On 2017年04月20日 07:15, Stefan Agner wrote:
> I tested again with imx6sx-fec compatible string. I could reproduce
> it on a Colibri with i.MX 7Dual. But not always: It really depends
> whether queue 2 is counting up or not. Just after boot, I check
> /proc/interrupts twice, if queue 2 is counting it will happen!
>
> But if only queue 0 is mostly in use, then it seems to work just fine.
 If your case is only running best effort like tcp/udp, you can re-set
 the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
 Other two queues are for AVB audio/video queues, they have high
 priority than queue 0. If running iperf tcp test on the three queues,
 then the tcp segment may be out-of-order that cause net watchdog
>>> timeout.
> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
> reboot 3 times, then queue 2 was counting:
>57:  8 GIC-0 150 Level 30be.ethernet
>58:  20137 GIC-0 151 Level 30be.ethernet
>59:   9269 GIC-0 152 Level 30be.ethernet
>
> It took me about 40 minutes on Sabre until it happened, and I had to
> force it using iperf, but then I got the ring dumps:
 My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
 not running iperf.
 I am testing with iperf.
>>> Any update on this issue?
>>>
>>> When using iperf (server) on the board with Linux 4.11 the issue appears
>>> within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that 
>>> matters)...
>>>
>> I don’t know whether you received my last mail. (maybe failed due to I
>> received some rejection mails)
> I think I did not... The last email I received was Fri, 21 Apr 2017
> 02:48:23 UTC.
>
>   
>> If your case is only running best effort like tcp/udp, you can re-set
>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
>> file.
> I did test that, and it seems to work fine with those properties set to
> 1.
So it can fix your problem after long time test?
>> Other two queues are for AVB audio/video queues, they have high
>> priority than queue 0. If running iperf tcp test on the three queues,
>> then the tcp segment may be out-of-order that cause net watchdog
>> timeout.
> Okay. A single event would be understandable, but it seems to enter some
> kind of loop after that (continuously printing "fec 30be.ethernet
> eth0: TX ring dump ...").
>
> In a quick test I commented out the fec_dump call, with that it seems to
> print only once and continues working afterwards (although, speed starts
> to decrease, so something is not good at that point).
The test base on above change ? One queue still bring watchdog timeout ?
>> In fsl kernel tree, there have one patch that only select the queue0
>> for best effort like tcp/udp. Pls test again in your board, if no
>> problem I will upstream the patch.
> That sounds like a reasonable fix.
>
> IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
> should "just work" by default.
>
> --
> Stefan

Re: [PATCH net] tcp: randomize timestamps on syncookies

2017-05-04 Thread Eric Dumazet
On Fri, 2017-05-05 at 02:32 +0200, Florian Westphal wrote:
> Florian Westphal  wrote:
> [..]
> > This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a 
> > tsoff
> > for readjustment.
> > 
> > We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().
> 
> This small delta makes things work for me:
> 

Hi Florian, thanks for looking at this.

One comment :

> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index 895ff650db43..eb96825d6340 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -143,6 +144,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
> sk_buff *skb)
>   int mss;
>   struct dst_entry *dst;
>   __u8 rcv_wscale;
> + u32 tsoff;
>  
>   if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
>   goto out;
> @@ -162,6 +164,12 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
> sk_buff *skb)
>   memset(_opt, 0, sizeof(tcp_opt));
>   tcp_parse_options(skb, _opt, 0, NULL);
>  
> + tsoff = 0;
> + if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
> + tsoff = secure_tcpv6_ts_off(_hdr(skb)->daddr, 
> _hdr(skb)->saddr);

I will use the ipv6_hdr(skb)->daddr.s6_addr32 and
ipv6_hdr(skb)->saddr.s6_addr32 if you agree ;)

> + tcp_opt.rcv_tsecr -= tsoff;
> + }
> +

Thanks !




Re: [PATCH net] tcp: randomize timestamps on syncookies

2017-05-04 Thread Florian Westphal
Florian Westphal  wrote:
[..]
> This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a tsoff
> for readjustment.
> 
> We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().

This small delta makes things work for me:

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1be353fc5cb1..8c0e5a901d64 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -470,7 +470,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct 
sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 struct request_sock *req,
-struct dst_entry *dst);
+struct dst_entry *dst, u32 tsoff);
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
  u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 496b97e17aaf..39bfdc94bf44 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -203,7 +204,7 @@ EXPORT_SYMBOL_GPL(__cookie_v4_check);
 
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 struct request_sock *req,
-struct dst_entry *dst)
+struct dst_entry *dst, u32 tsoff)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct sock *child;
@@ -213,6 +214,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
sk_buff *skb,
 NULL, _req);
if (child) {
atomic_set(>rsk_refcnt, 1);
+   tcp_sk(child)->tsoffset = tsoff;
sock_rps_save_rxhash(child, skb);
inet_csk_reqsk_queue_add(sk, req, child);
} else {
@@ -292,6 +294,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
struct rtable *rt;
__u8 rcv_wscale;
struct flowi4 fl4;
+   u32 tsoff;
 
if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
goto out;
@@ -311,6 +314,12 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
memset(_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, _opt, 0, NULL);
 
+   tsoff = 0;
+   if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+   tsoff = secure_tcp_ts_off(ip_hdr(skb)->daddr, 
ip_hdr(skb)->saddr);
+   tcp_opt.rcv_tsecr -= tsoff;
+   }
+
if (!cookie_timestamp_decode(_opt))
goto out;
 
@@ -381,7 +390,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
ireq->rcv_wscale  = rcv_wscale;
ireq->ecn_ok = cookie_ecn_ok(_opt, sock_net(sk), >dst);
 
-   ret = tcp_get_cookie_sock(sk, skb, req, >dst);
+   ret = tcp_get_cookie_sock(sk, skb, req, >dst, tsoff);
/* ip_queue_xmit() depends on our flow being setup
 * Normal sockets get it right from inet_csk_route_child_sock()
 */
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 895ff650db43..eb96825d6340 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -143,6 +144,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
sk_buff *skb)
int mss;
struct dst_entry *dst;
__u8 rcv_wscale;
+   u32 tsoff;
 
if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
goto out;
@@ -162,6 +164,12 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
sk_buff *skb)
memset(_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, _opt, 0, NULL);
 
+   tsoff = 0;
+   if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+   tsoff = secure_tcpv6_ts_off(_hdr(skb)->daddr, 
_hdr(skb)->saddr);
+   tcp_opt.rcv_tsecr -= tsoff;
+   }
+
if (!cookie_timestamp_decode(_opt))
goto out;
 
@@ -242,7 +250,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
sk_buff *skb)
ireq->rcv_wscale = rcv_wscale;
ireq->ecn_ok = cookie_ecn_ok(_opt, sock_net(sk), dst);
 
-   ret = tcp_get_cookie_sock(sk, skb, req, dst);
+   ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
 out:
return ret;
 out_free:


Re: [PATCH iproute2] vxlan: Add support for modifying vxlan device attributes

2017-05-04 Thread Girish Moodalbail

On 5/4/17 5:07 PM, Stephen Hemminger wrote:

On Thu,  4 May 2017 14:46:34 -0700
Girish Moodalbail  wrote:


Ability to change vxlan device attributes was added to kernel through
commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
cannot do the same through ip(8) command.  Changing the allowed vxlan
device attributes using 'ip link set dev  type vxlan
' currently fails with 'operation not supported'
error.  This failure is due to the incorrect rtnetlink message
construction for the 'ip link set' operation.

The vxlan_parse_opt() callback function is called for parsing options
for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
down default values for those attributes that were not provided as CLI
options. However, for the 'set' case we should be only passing down the
explicitly provided attributes and not any other (default) attributes.

Signed-off-by: Girish Moodalbail 
---


All these foo_set variables are ugly. This looks almost like machine
generated code. It doesn't read well.


I thought about it, however I wasn't sure if refactoring that whole routine will 
be well received so I decided to follow the current model that already existed 
in iplink_vxlan.c. I will re-submit a patch cleaning up that whole routine.


thanks,
~Girish



Re: [PATCH net] tcp: randomize timestamps on syncookies

2017-05-04 Thread Florian Westphal
Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> Whole point of randomization was to hide server uptime, but an attacker
> can simply start a syn flood and TCP generates 'old style' timestamps,
> directly revealing server jiffies value.
> 
> Also, TSval sent by the server to a particular remote address vary depending
> on syncookies being sent or not, potentially triggering PAWS drops for
> innocent clients.
> 
> Lets implement proper randomization, including for SYNcookies.
> 
> Also we do not need to export sysctl_tcp_timestamps, it is not used from
> a module.

I like the direction, but this is incomplete.

>   if (want_cookie) {
>   isn = cookie_init_sequence(af_ops, sk, skb, >mss);
> - tcp_rsk(req)->ts_off = 0;

This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a tsoff
for readjustment.

We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().

Other than this, this patch looks good to me, thanks!


Re: [PATCH iproute2] vxlan: Add support for modifying vxlan device attributes

2017-05-04 Thread Stephen Hemminger
On Thu,  4 May 2017 14:46:34 -0700
Girish Moodalbail  wrote:

> Ability to change vxlan device attributes was added to kernel through
> commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
> cannot do the same through ip(8) command.  Changing the allowed vxlan
> device attributes using 'ip link set dev  type vxlan
> ' currently fails with 'operation not supported'
> error.  This failure is due to the incorrect rtnetlink message
> construction for the 'ip link set' operation.
> 
> The vxlan_parse_opt() callback function is called for parsing options
> for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
> down default values for those attributes that were not provided as CLI
> options. However, for the 'set' case we should be only passing down the
> explicitly provided attributes and not any other (default) attributes.
> 
> Signed-off-by: Girish Moodalbail 
> ---

All these foo_set variables are ugly. This looks almost like machine
generated code. It doesn't read well.


Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Stephen Hemminger
On Thu, 04 May 2017 16:45:58 -0400
Doug Ledford  wrote:

> On Thu, 2017-05-04 at 15:26 -0400, Dennis Dalessandro wrote:
> > On 05/04/2017 02:45 PM, Leon Romanovsky wrote:  
> > > 
> > > On Thu, May 04, 2017 at 06:30:27PM +, Bart Van Assche wrote:  
> > > > 
> > > > On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:  
> > > > > 
> > > > > On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche
> > > > > wrote:  
> > > > > > 
> > > > > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:  
> > > > > > > 
> > > > > > > Following our discussion both in mailing list [1] and at
> > > > > > > the LPC 2016 [2],
> > > > > > > we would like to propose this RDMA tool to be part of
> > > > > > > iproute2 package
> > > > > > > and finally improve this situation.  
> > > > > > 
> > > > > > Hello Leon,
> > > > > > 
> > > > > > Although I really appreciate your work: can you clarify why
> > > > > > you would like to
> > > > > > add *RDMA* functionality to an *IP routing* tool? I haven't
> > > > > > found any motivation
> > > > > > for adding RDMA functionality to iproute2 in [1].  
> > > > > 
> > > > > We are planning to reuse the same infrastructure provided by
> > > > > iproute2,
> > > > > like netlink parsing, access to distributions, same CLI and
> > > > > same standards.
> > > > > 
> > > > > Right now, RDMA is already tightened to netdev: iWARP, RoCE,
> > > > > IPoIB, HFI-VNIC.
> > > > > Many drivers (mlx, qed, i40, cxgb) are sharing code between net
> > > > > and
> > > > > RDMA.
> > > > > 
> > > > > I do expect that iproute2 will be installed on every machine
> > > > > with any
> > > > > type of connection, including IB and OPA.
> > > > > 
> > > > > So I think that it is enough to be part of that suite and don't
> > > > > invent
> > > > > our own for one specific tool.  
> > > > 
> > > > Hello Leon,
> > > > 
> > > > Sorry but to me that sounds like a weak argument for including
> > > > RDMA functionality
> > > > in iproute2. There is already a library for communication over
> > > > netlink sockets,
> > > > namely libnl. Is there functionality that is in iproute2 but not
> > > > in libnl and
> > > > that is needed for the new tool? If so, have you considered to
> > > > create a new
> > > > library for that functionality?  
> > > 
> > > It is not hard to create new tool, the hardest part is to ensure
> > > that it is
> > > part of the distributions. Did you count how many months we are
> > > trying to
> > > add rdma-core to debian?  
> > 
> > I do agree that it is a strange pairing and am not really a fan.
> > However 
> > at the end of the day it's just a name for a repo/package. If the 
> > iproute folks are fine to include rdma in their repo/package, great
> > we 
> > can leverage their code for CLI and other common stuff.  
> 
> If you look into the iproute2 package, it becomes clear that the name
> iproute2 is historical and not really accurate any more.  It contains
> things like the bridge control software, tc for controlling send
> queues, and many things network related but not routing related.  The
> rdma tool is a perfectly fine fit in the sense that it is an additional
> network management tool IMO.
> 
> For reference, here's the list of stuff already in iproute on my Fedora
> 24 box:
> 
> /usr/sbin/arpd
> /usr/sbin/bridge
> /usr/sbin/cbq
> /usr/sbin/ctstat
> /usr/sbin/genl
> /usr/sbin/ifcfg
> /usr/sbin/ifstat
> /usr/sbin/ip
> /usr/sbin/lnstat
> /usr/sbin/nstat
> /usr/sbin/routef
> /usr/sbin/routel
> /usr/sbin/rtacct
> /usr/sbin/rtmon
> /usr/sbin/rtpr
> /usr/sbin/rtstat
> /usr/sbin/ss
> /usr/sbin/tc
> /usr/sbin/tipc
> 
> And in fact, if you check, tipc is almost similar to RDMA ;-)  So, I
> suggest people not get hung up on the name iproute2, the fit is fine
> when you look deeper into the nature of the package.
> 

Iproute2 is a collection like busybox. It has bridging, devlink and tipc 
already.


Re: [PATCH net-next] selftests/bpf: get rid of -D__x86_64__

2017-05-04 Thread Alexei Starovoitov

On 5/4/17 6:37 AM, David Miller wrote:

From: Alexei Starovoitov 
Date: Wed, 3 May 2017 20:30:22 -0700


I would buy that debian folks indeed care about multi-arch, but
what above does is making #include  to be a nop
for any cross-compiler on sparc that included it.


No, if you installed cross compiler for arch X it would add
another stanza doing that "ifdef __ARCH__, include blah, endif"
dance.


You're right that we cannot assume much about /usr/include craziness.
In that sense adding __native_arch__ macro is also wrong, since
it assumes sane /usr/include without inline asm or other things
that clang for bpf arch can consume.


You can assume that it's for the ARCH we are trying to run tests
for, which needs to be in the family of the kernel arch.


In that sense the only way to be independent from arch dependent
things in /usr/include is to put all arch specific headers
into our own dir in tools/selftests/ (or may be tools/bpf/include)
and point clang to that. I think the list of .h in there will be
limited. Only things like linux/types.h and gnu/stubs.h,
so it will be manageable.
Thoughts?


No, this way lies madness.

If you want to get the kernel headers, set up the proper environment
instead of constantly trying to fight it.


We don't want to get kernel headers.
We made this mistake with samples/bpf/, since tracing actually needs
the headers to be able to call bpf_probe_read() with correct offsets.
This stuff just doesn't work on arm and not clear whether it's working
on other archs beyond x86.
For arm we had this -D__ASM_SYSREG_H hack, but it stopped working
and Andy tried to address it in [1], but it didn't go far.
So today samples/bpf/ are completely broken on arm and it's making
people believe that xdp and networking programs also need kernel
headers and also cannot work on arm, which is not the case at all.
Hence I don't want testing/selftests/bpf/ to have anything to do
with kernel headers and arch specific headers too.
All xdp programs are 90% arch independent. The only difference is
big vs little endian and clang solves this for us automatically,
since selftests's Makefile is using 'clang -target bpf' which
picks native endianness.
All headers included by tools/testing/selftests/bpf/test_*.c programs
shouldn't use anything arch specific.
They #include  to get 'struct tcphdr' and things like
IPPROTO_IPIP and AF_INET6.
These headers should work seamlessly on all archs, but since such
headers typically do #include  which is arch dependent
we get into the issue we're discussing.
We don't need native . Moreover it's incorrect to
use native types.h, since we're compiling to bpf bytecode which
is 64-bit and needs to see types.h with sizeof(void*)==8.
When we're compiling xdp programs on x86 we're compiling them
into bpf bytecode with little endian flavor. Nothing x86 specific
about it. The same bytecode will run on arm64.
That is the case for all networking programs.
Hence I think the cleanest solution is to have bpf arch's types.h
either installed with llvm/gcc or picked from selftests's dir.

Tracing side is quite different.
For example: samples/bpf/offwaketime_kern.c does:
struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
u32 pid;
bpf_probe_read(, sizeof(pid), >pid);

The '>pid' offset is arch and kernel specific, hence
not only we need kernel headers for the given architecture,
but autoconf.h of that specific kernel with correct kernel version too.

[1]
https://www.spinics.net/lists/arm-kernel/msg567602.html



Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses

2017-05-04 Thread महेश बंडेवार
On Thu, May 4, 2017 at 9:43 AM, Jiri Benc  wrote:
> On Thu, 4 May 2017 09:37:00 +, Chiappero, Marco wrote:
>> This looks conceptually wrong. Yes, ipvlan works at L3 (which is an
>> implementation detail anyway), but slaves are Ethernet interfaces and
>> should behave as much as possible as such regardless, with an
>> individual MAC address assigned.
>
> Isn't the proper fix then converting ipvlan interfaces to be L3 only
> interfaces? I.e., ARPHRD_NONE? There's not much ipvlan can do with
> arbitrary Ethernet frames anyway. Of course, a flag to switch to the
> new behavior would be needed in order to preserve backwards
> compatibility.
>
There is mode = L3/L3s for that.

> This patchset looks very wrong. For proper support of multiple MAC
> addresses, we have macvlan and it's pointless to add that to ipvlan.
> And doing some kind of weird MAC NAT in ipvlan just to satisfy broken
> tools that can't cope with multiple interfaces with the same MAC address
> is wrong, too. Those tools are already broken anyway, there's nothing
> preventing anyone to set the same MAC address to multiple interfaces.
> I suppose those tools don't work with bonding and bridge, either?
>
+1

>> So, either we fix this by forcing slaves to stay in sync with master,
>
> Yes, that's the correct behavior. Well, at least as correct as one can
> get with the ipvlan broken design of pretending that an interface is L2
> when in fact, it is not.
>
conceptually view it as a single link (one L2) but mux/demux @ L3 for
multi-ns world with different routing needs without needing additional
packet processing.

>  Jiri


Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses

2017-05-04 Thread महेश बंडेवार
On Thu, May 4, 2017 at 2:37 AM, Chiappero, Marco
 wrote:
>> -Original Message-
>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>> On Behalf Of Dan Williams
>> Sent: Tuesday, May 2, 2017 5:09 PM
>> To: Chiappero, Marco ; netdev@vger.kernel.org
>> Cc: David S . Miller ; Kirsher, Jeffrey T
>> ; Duyck, Alexander H
>> ; Grandhi, Sainath
>> ; Mahesh Bandewar 
>> Subject: Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
>>
>> On Tue, 2017-05-02 at 15:08 +, Chiappero, Marco wrote:
>> > > -Original Message-
>> > > From: Dan Williams [mailto:d...@redhat.com] On Thu, 2017-04-27 at
>> > > 11:20 -0500, Dan Williams wrote:
>> > > > On Thu, 2017-04-27 at 15:51 +0100, Marco Chiappero wrote:
>> > > > > Currently all the slave devices belonging to the same port
>> > > > > inherit their MAC address from its master device. This patch
>> > > > > removes this limitation and allows every slave device to obtain
>> > > > > a unique MAC address, by default randomly generated at creation
>> > > > > time.
>> > > > >
>> > > > > Moreover it is now possible to correctly modify the MAC address
>> > > > > at any time, fixing an existing bug as MAC address changes on
>> > > > > the master were not reflected on the slaves. It also avoids
>> > > > > multiple interfaces sharing the same IPv6 link-local address.
>> > > >
>> > > > How is this different than macvlan now?
>> >
>> > The same way it was before. The purpose of the patch is to make
>> > possible to change the MAC address on slaves, not to change the
>> > external behavior of ipvlan: ipvlan will still behave as ipvlan,
>> > macvlan will still behave as macvlan.
>>
>> Ok, it was completely unclear from the commit message that the "internal" MAC
>> addresses of the ipvlan interfaces were not reflected "on the wire", but 
>> that this
>> was essentially (as you say below) MAC NAT.
>
> Sorry about that, I'll fix it in V2.
>
>> I think everyone agrees that being able to change the MAC is useful and was a
>> bug.
>>
>> What I'm still not clear on is, if IPv6 is already solved, why is it useful 
>> to have
>> assign ipvlan interface a unique MAC address?  Is it only to make interface
>> lookups via MAC easier?
>
> The main motivation is that some higher level management software expect 
> interfaces on the same L2 broadcast domain to obviously have different MAC 
> addresses, either out-of-the-box or via address change - or both.  Instead 
> with ipvlan:
> - there is no real/formal segmentation between slaves
segmentation between slaves is at L3. If you want segmentation at L2
then use Macvlan

> - slaves share the same L2 address
Yes, that's by design!

> This looks conceptually wrong.
Why is it hard to view a device that does mux/demux using L3 while
keeping L2 same. Namespaces within physical box have different L3 and
have different routing needs too while all this enclosed inside one
host exposing one external L2. If some mgmt software doesn't like it
then either IPvlan is not the best fit for that solution or that
software needs an update.

>Yes, ipvlan works at L3 (which is an implementation detail anyway), but slaves 
>are Ethernet interfaces and should behave as much as possible as such 
>regardless, with an individual MAC address assigned.
>
> Additionally there is another related bug as it's currently possible to work 
> around this limitation, although breaking the whole thing, by:
> 1) changing the MAC address on master from X to Y
> 2) creating a slave, receiving address Y
> 3) restoring the original MAC address X on master
>
> So, either we fix this by forcing slaves to stay in sync with master,
I have already mentioned that this is the only acceptable fix out of
this patchset but not by way of rewriting headers. The simplest fix is
to use the master->dev_addr in dev_hard_header()

> or correctly support independent MAC addresses, which would be IMO preferable 
> for the above reasons.
>
> Best Regards,
> Marco
> --
> Intel Research and Development Ireland Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
> Registered Number: 308263
>
>
> This e-mail and any attachments may contain confidential material for the sole
> use of the intended recipient(s). Any review or distribution by others is
> strictly prohibited. If you are not the intended recipient, please contact the
> sender and delete all copies.


[PATCH net] tcp: randomize timestamps on syncookies

2017-05-04 Thread Eric Dumazet
From: Eric Dumazet 

Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary depending
on syncookies being sent or not, potentially triggering PAWS drops for
innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, it is not used from
a module.

Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each 
connection")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Yuchung Cheng 
---
 include/net/secure_seq.h |   10 ++
 include/net/tcp.h|3 ++-
 net/core/secure_seq.c|   31 +++
 net/ipv4/tcp_input.c |8 +++-
 net/ipv4/tcp_ipv4.c  |   31 +++
 net/ipv6/tcp_ipv6.c  |   32 +++-
 6 files changed, 68 insertions(+), 47 deletions(-)

diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
index 
fe236b3429f0d8caeb1adc367b5b4a20591c848b..b94006f6fbdde0d78fe33b9c2d86159e291c30cf
 100644
--- a/include/net/secure_seq.h
+++ b/include/net/secure_seq.h
@@ -6,10 +6,12 @@
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport);
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-__be16 sport, __be16 dport, u32 *tsoff);
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-  __be16 sport, __be16 dport, u32 *tsoff);
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+  __be16 sport, __be16 dport);
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr);
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+__be16 sport, __be16 dport);
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr);
 u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
__be16 sport, __be16 dport);
 u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
270e5cc43c99e7030e95af218095cf9f283950bc..1be353fc5cb1c313e8fec350d8a0a9ccc8f770ee
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1822,7 +1822,8 @@ struct tcp_request_sock_ops {
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
   const struct request_sock *req);
-   __u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
+   u32 (*init_seq)(const struct sk_buff *skb);
+   u32 (*init_ts_off)(const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
   struct tcp_fastopen_cookie *foc,
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 
6bd2f8fb0476baabf507557fc0d06b6787511c70..ae35cce3a40d70387bee815798933aa43a0e6d84
 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -24,9 +24,13 @@ static siphash_key_t ts_secret __read_mostly;
 
 static __always_inline void net_secret_init(void)
 {
-   net_get_random_once(_secret, sizeof(ts_secret));
net_get_random_once(_secret, sizeof(net_secret));
 }
+
+static __always_inline void ts_secret_init(void)
+{
+   net_get_random_once(_secret, sizeof(ts_secret));
+}
 #endif
 
 #ifdef CONFIG_INET
@@ -47,7 +51,7 @@ static u32 seq_scale(u32 seq)
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
-static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 {
const struct {
struct in6_addr saddr;
@@ -60,12 +64,14 @@ static u32 secure_tcpv6_ts_off(const __be32 *saddr, const 
__be32 *daddr)
if (sysctl_tcp_timestamps != 1)
return 0;
 
+   ts_secret_init();
return siphash(, offsetofend(typeof(combined), daddr),
   _secret);
 }
+EXPORT_SYMBOL(secure_tcpv6_ts_off);
 
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-  __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+__be16 sport, __be16 dport)
 {
const struct {
struct in6_addr saddr;
@@ -78,14 +84,14 @@ u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const 
__be32 *daddr,
.sport = sport,
.dport = dport
};
-   u64 hash;
+   u32 hash;
+
net_secret_init();
hash = siphash(, offsetofend(typeof(combined), dport),

[PATCH iproute2] vxlan: Add support for modifying vxlan device attributes

2017-05-04 Thread Girish Moodalbail
Ability to change vxlan device attributes was added to kernel through
commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
cannot do the same through ip(8) command.  Changing the allowed vxlan
device attributes using 'ip link set dev  type vxlan
' currently fails with 'operation not supported'
error.  This failure is due to the incorrect rtnetlink message
construction for the 'ip link set' operation.

The vxlan_parse_opt() callback function is called for parsing options
for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
down default values for those attributes that were not provided as CLI
options. However, for the 'set' case we should be only passing down the
explicitly provided attributes and not any other (default) attributes.

Signed-off-by: Girish Moodalbail 
---
 ip/iplink_vxlan.c | 73 ---
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index b4ebb13..c8959aa 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -72,16 +72,25 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
struct in6_addr daddr6 = IN6ADDR_ANY_INIT;
unsigned int link = 0;
__u8 tos = 0;
+   bool tos_set = false;
__u8 ttl = 0;
+   bool ttl_set = false;
__u32 label = 0;
+   bool label_set = false;
__u8 learning = 1;
+   bool learning_set = false;
__u8 proxy = 0;
+   bool proxy_set = false;
__u8 rsc = 0;
+   bool rsc_set = false;
__u8 l2miss = 0;
+   bool l2miss_set = false;
__u8 l3miss = 0;
+   bool l3miss_set = false;
__u8 noage = 0;
__u32 age = 0;
__u32 maxaddr = 0;
+   bool maxaddr_set = false;
__u16 dstport = 0;
__u8 udpcsum = 0;
bool udpcsum_set = false;
@@ -90,12 +99,17 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
__u8 udp6zerocsumrx = 0;
bool udp6zerocsumrx_set = false;
__u8 remcsumtx = 0;
+   bool remcsumtx_set = false;
__u8 remcsumrx = 0;
+   bool remcsumrx_set = false;
__u8 metadata = 0;
+   bool metadata_set = false;
__u8 gbp = 0;
__u8 gpe = 0;
int dst_port_set = 0;
struct ifla_vxlan_port_range range = { 0, 0 };
+   bool set_op = (n->nlmsg_type == RTM_NEWLINK &&
+  !(n->nlmsg_flags & NLM_F_CREATE));
 
while (argc > 0) {
if (!matches(*argv, "id") ||
@@ -152,6 +166,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
invarg("TTL must be <= 255", *argv);
ttl = uval;
}
+   ttl_set = true;
} else if (!matches(*argv, "tos") ||
   !matches(*argv, "dsfield")) {
__u32 uval;
@@ -163,6 +178,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
tos = uval;
} else
tos = 1;
+   tos_set = true;
} else if (!matches(*argv, "label") ||
   !matches(*argv, "flowlabel")) {
__u32 uval;
@@ -172,6 +188,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
(uval & ~LABEL_MAX_MASK))
invarg("invalid flowlabel", *argv);
label = htonl(uval);
+   label_set = true;
} else if (!matches(*argv, "ageing")) {
NEXT_ARG();
if (strcmp(*argv, "none") == 0)
@@ -184,6 +201,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
maxaddr = 0;
else if (get_u32(, *argv, 0))
invarg("max addresses", *argv);
+   maxaddr_set = true;
} else if (!matches(*argv, "port") ||
   !matches(*argv, "srcport")) {
NEXT_ARG();
@@ -199,24 +217,34 @@ static int vxlan_parse_opt(struct link_util *lu, int 
argc, char **argv,
dst_port_set = 1;
} else if (!matches(*argv, "nolearning")) {
learning = 0;
+   learning_set = true;
} else if (!matches(*argv, "learning")) {
learning = 1;
+   learning_set = true;
} else if (!matches(*argv, "noproxy")) {
proxy = 0;
+   proxy_set = true;
} else if (!matches(*argv, "proxy")) {
proxy = 1;
+   proxy_set = 

[Patch net] ipv4: restore rt->fi for reference counting

2017-05-04 Thread Cong Wang
IPv4 dst could use fi->fib_metrics to store metrics but fib_info
itself is refcnt'ed, so without taking a refcnt fi and
fi->fib_metrics could be freed while dst metrics still points to
it. This triggers use-after-free as reported by Andrey twice.

This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
restore this reference counting. It is a quick fix for -net and
-stable, for -net-next, as Eric suggested, we can consider doing
reference counting for metrics itself instead of relying on fib_info.

IPv6 is very different, it copies or steals the metrics from mx6_config
in fib6_commit_metrics() so probably doesn't need a refcnt.

Decnet has already done the refcnt'ing, see dn_fib_semantic_match().

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Reported-by: Andrey Konovalov 
Tested-by: Andrey Konovalov 
Signed-off-by: Cong Wang 
---
 include/net/route.h |  1 +
 net/ipv4/route.c| 18 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
struct list_headrt_uncached;
struct uncached_list*rt_uncached_list;
+   struct fib_info *fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
struct rtable *rt = (struct rtable *) dst;
 
+   if (rt->fi) {
+   fib_info_put(rt->fi);
+   rt->fi = NULL;
+   }
+
if (!list_empty(>rt_uncached)) {
struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+   if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+   fib_info_hold(fi);
+   rt->fi = fi;
+   }
+
+   dst_init_metrics(>dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
   const struct fib_result *res,
   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 
daddr,
rt->rt_gateway = nh->nh_gw;
rt->rt_uses_gateway = 1;
}
-   dst_init_metrics(>dst, fi->fib_metrics, true);
+   rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
rt->rt_gateway = 0;
rt->rt_uses_gateway = 0;
rt->rt_table_id = 0;
+   rt->fi = NULL;
INIT_LIST_HEAD(>rt_uncached);
 
rt->dst.output = ip_output;
-- 
2.5.5



Re: FEC on i.MX 7 transmit queue timeout

2017-05-04 Thread Stefan Agner
On 2017-05-03 20:08, Andy Duan wrote:
> From: Stefan Agner  Sent: Thursday, May 04, 2017 9:22 AM
>>To: Andy Duan 
>>Cc: fugang.d...@freescale.com; feste...@gmail.com;
>>netdev@vger.kernel.org; netdev-ow...@vger.kernel.org
>>Subject: Re: FEC on i.MX 7 transmit queue timeout
>>
>>Hi Andy,
>>
>>On 2017-04-20 19:48, Andy Duan wrote:
>>> On 2017年04月20日 07:15, Stefan Agner wrote:
 I tested again with imx6sx-fec compatible string. I could reproduce
 it on a Colibri with i.MX 7Dual. But not always: It really depends
 whether queue 2 is counting up or not. Just after boot, I check
 /proc/interrupts twice, if queue 2 is counting it will happen!

 But if only queue 0 is mostly in use, then it seems to work just fine.
>>> If your case is only running best effort like tcp/udp, you can re-set
>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>>> Other two queues are for AVB audio/video queues, they have high
>>> priority than queue 0. If running iperf tcp test on the three queues,
>>> then the tcp segment may be out-of-order that cause net watchdog
>>timeout.

 I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
 reboot 3 times, then queue 2 was counting:
   57:  8 GIC-0 150 Level 30be.ethernet
   58:  20137 GIC-0 151 Level 30be.ethernet
   59:   9269 GIC-0 152 Level 30be.ethernet

 It took me about 40 minutes on Sabre until it happened, and I had to
 force it using iperf, but then I got the ring dumps:
>>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>>> not running iperf.
>>> I am testing with iperf.
>>
>>Any update on this issue?
>>
>>When using iperf (server) on the board with Linux 4.11 the issue appears
>>within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that 
>>matters)...
>>
> I don’t know whether you received my last mail. (maybe failed due to I
> received some rejection mails)

I think I did not... The last email I received was Fri, 21 Apr 2017
02:48:23 UTC.

 
> If your case is only running best effort like tcp/udp, you can re-set
> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
> file.

I did test that, and it seems to work fine with those properties set to
1.

> Other two queues are for AVB audio/video queues, they have high
> priority than queue 0. If running iperf tcp test on the three queues,
> then the tcp segment may be out-of-order that cause net watchdog
> timeout.

Okay. A single event would be understandable, but it seems to enter some
kind of loop after that (continuously printing "fec 30be.ethernet
eth0: TX ring dump ...").

In a quick test I commented out the fec_dump call, with that it seems to
print only once and continues working afterwards (although, speed starts
to decrease, so something is not good at that point).

> In fsl kernel tree, there have one patch that only select the queue0
> for best effort like tcp/udp. Pls test again in your board, if no
> problem I will upstream the patch.

That sounds like a reasonable fix.

IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
should "just work" by default.

--
Stefan


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-04 Thread Casey Leedom
| From: Alexander Duyck 
| Sent: Wednesday, May 3, 2017 9:02 AM
| ...
| It sounds like we are more or less in agreement. My only concern is
| really what we default this to. On x86 I would say we could probably
| default this to disabled for existing platforms since my understanding
| is that relaxed ordering doesn't provide much benefit on what is out
| there right now when performing DMA through the root complex. As far
| as peer-to-peer I would say we should probably look at enabling the
| ability to have Relaxed Ordering enabled for some channels but not
| others. In those cases the hardware needs to be smart enough to allow
| for you to indicate you want it disabled by default for most of your
| DMA channels, and then enabled for the select channels that are
| handling the peer-to-peer traffic.

  Yes, I think that we are mostly in agreement.  I had just wanted to make
sure that whatever scheme was developed would allow for simultaneously
supporting non-Relaxed Ordering for some PCIe End Points and Relaxed
Ordering for others within the same system.  I.e. not simply
enabling/disabling/etc.  based solely on System Platform Architecture.

  By the way, I've started our QA folks off looking at what things look like
in Linux Virtual Machines under different Hypervisors to see what
information they may provide to the VM in the way of what Root Complex Port
is being used, etc.  So far they've got Windows HyperV done and there
there's no PCIe Fabric exposed in any way: just the attached device.  I'll
have to see what pci_find_pcie_root_port() returns in that environment.
Maybe NULL?

  With your reservations (which I also share), I think that it probably
makes sense to have a per-architecture definition of the "Can I Use Relaxed
Ordering With TLPs Directed At This End Point" predicate, with the default
being "No" for any architecture which doesn't implement the predicate.  And
if the specified (struct pci_dev *) End Node is NULL, it ought to return
False for that as well.  I can't see any reason to pass in the Source End
Node but I may be missing something.

  At this point, this is pretty far outside my level of expertise.  I'm
happy to give it a go, but I'd be even happier if someone with a lot more
experience in the PCIe Infrastructure were to want to carry the ball
forward.  I'm not super familiar with the Linux Kernel "Rules Of
Engagement", so let me know what my next step should be.  Thanks.

Casey


Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Doug Ledford
On Thu, 2017-05-04 at 15:26 -0400, Dennis Dalessandro wrote:
> On 05/04/2017 02:45 PM, Leon Romanovsky wrote:
> > 
> > On Thu, May 04, 2017 at 06:30:27PM +, Bart Van Assche wrote:
> > > 
> > > On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:
> > > > 
> > > > On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche
> > > > wrote:
> > > > > 
> > > > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > > > > 
> > > > > > Following our discussion both in mailing list [1] and at
> > > > > > the LPC 2016 [2],
> > > > > > we would like to propose this RDMA tool to be part of
> > > > > > iproute2 package
> > > > > > and finally improve this situation.
> > > > > 
> > > > > Hello Leon,
> > > > > 
> > > > > Although I really appreciate your work: can you clarify why
> > > > > you would like to
> > > > > add *RDMA* functionality to an *IP routing* tool? I haven't
> > > > > found any motivation
> > > > > for adding RDMA functionality to iproute2 in [1].
> > > > 
> > > > We are planning to reuse the same infrastructure provided by
> > > > iproute2,
> > > > like netlink parsing, access to distributions, same CLI and
> > > > same standards.
> > > > 
> > > > Right now, RDMA is already tightened to netdev: iWARP, RoCE,
> > > > IPoIB, HFI-VNIC.
> > > > Many drivers (mlx, qed, i40, cxgb) are sharing code between net
> > > > and
> > > > RDMA.
> > > > 
> > > > I do expect that iproute2 will be installed on every machine
> > > > with any
> > > > type of connection, including IB and OPA.
> > > > 
> > > > So I think that it is enough to be part of that suite and don't
> > > > invent
> > > > our own for one specific tool.
> > > 
> > > Hello Leon,
> > > 
> > > Sorry but to me that sounds like a weak argument for including
> > > RDMA functionality
> > > in iproute2. There is already a library for communication over
> > > netlink sockets,
> > > namely libnl. Is there functionality that is in iproute2 but not
> > > in libnl and
> > > that is needed for the new tool? If so, have you considered to
> > > create a new
> > > library for that functionality?
> > 
> > It is not hard to create new tool, the hardest part is to ensure
> > that it is
> > part of the distributions. Did you count how many months we are
> > trying to
> > add rdma-core to debian?
> 
> I do agree that it is a strange pairing and am not really a fan.
> However 
> at the end of the day it's just a name for a repo/package. If the 
> iproute folks are fine to include rdma in their repo/package, great
> we 
> can leverage their code for CLI and other common stuff.

If you look into the iproute2 package, it becomes clear that the name
iproute2 is historical and not really accurate any more.  It contains
things like the bridge control software, tc for controlling send
queues, and many things network related but not routing related.  The
rdma tool is a perfectly fine fit in the sense that it is an additional
network management tool IMO.

For reference, here's the list of stuff already in iproute on my Fedora
24 box:

/usr/sbin/arpd
/usr/sbin/bridge
/usr/sbin/cbq
/usr/sbin/ctstat
/usr/sbin/genl
/usr/sbin/ifcfg
/usr/sbin/ifstat
/usr/sbin/ip
/usr/sbin/lnstat
/usr/sbin/nstat
/usr/sbin/routef
/usr/sbin/routel
/usr/sbin/rtacct
/usr/sbin/rtmon
/usr/sbin/rtpr
/usr/sbin/rtstat
/usr/sbin/ss
/usr/sbin/tc
/usr/sbin/tipc

And in fact, if you check, tipc is almost similar to RDMA ;-)  So, I
suggest people not get hung up on the name iproute2, the fit is fine
when you look deeper into the nature of the package.

-- 
Doug Ledford 
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD



Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-04 Thread Phil Sutter
Hi,

On Thu, May 04, 2017 at 09:43:56AM -0700, Stephen Hemminger wrote:
> On Thu, 04 May 2017 10:41:03 -0400 (EDT)
> David Miller  wrote:
> 
> > From: David Ahern 
> > Date: Thu, 4 May 2017 08:27:35 -0600
> > 
> > > On 5/4/17 3:36 AM, Daniel Borkmann wrote:  
> > >> What is the clear benefit/rationale of outsourcing this to
> > >> libmnl? I always was the impression we should strive for as little
> > >> dependencies as possible?  
> > > 
> > > +1  
> > 
> > Agreed, all else being equal iproute2 should be as self contained
> > as possible since it is such a fundamental tool.
> 
> Sorry, the old netlink code is more difficult to understand than libmnl.
> Having dependency on a library is not a problem. There already is
> an alternative implementation of ip commands in busybox for those
> people trying to work in small environments.

I second that. If you can't afford the extra ~24KB of libmnl on your
system, you much rather can't afford the 20 times bigger ip binary,
either.

Regarding conversion to libmnl, which I investigated and started working
on once: My gut feeling back then was that it's not quite worth the
effor since iproute2 requires an intermediate layer of functions anyway.
Another detail which I didn't like that much was libmnl's idiom of
creating netlink messages on base of just a plain buffer and using
mnl_nlmsg_put_header() et al. to populate it with data. I'm probably a
bit biased since I did the conversion to c99-style initializers for the
various struct req data types, but I didn't like the added run-time
overhead to achieve just the same.

So in summary, given that very little change happens to iproute2's
internal libnetlink, I don't see much urge to make it use libmnl as
backend. In my opinion it just adds another potential source of errors.

Eventually this should be a maintainer level decision, though. :)

Cheers, Phil


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread David Arcari
On 05/04/2017 04:10 PM, Pavel Belous wrote:
> From: Pavel Belous 
> 
> This patch fixes the crash that happens when driver tries to collect 
> statistics
> from already released "aq_vec" object.
> If adapter is in "down" state we still allow user to see statistics from HW.
> 
> V2: fixed braces around "aq_vec_free".
> 
> Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
> Signed-off-by: Pavel Belous 
> ---
>  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
> b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> index cdb0299..9ee1c50 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
>   count = 0U;
>  
>   for (i = 0U, aq_vec = self->aq_vec[0];
> - self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
> + aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>   data += count;
>   aq_vec_get_sw_stats(aq_vec, data, );
>   }
> @@ -959,8 +959,10 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>   goto err_exit;
>  
>   for (i = AQ_DIMOF(self->aq_vec); i--;) {
> - if (self->aq_vec[i])
> + if (self->aq_vec[i]) {
>   aq_vec_free(self->aq_vec[i]);
> + self->aq_vec[i] = NULL;
> + }
>   }
>  
>  err_exit:;
> 

Resolves the ethtool crash.

Tested-by: David Arcari 



[PATCH 4/6] cxgb4: Replace seven seq_puts() calls by seq_putc()

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 21:40:54 +0200

Seven single characters (line breaks) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 24 +++---
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 1fa34b009891..2bc40d89f874 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -278,7 +278,7 @@ static int cim_ma_la_show(struct seq_file *seq, void *v, 
int idx)
const u32 *p = v;
 
if (v == SEQ_START_TOKEN) {
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
} else if (idx < CIM_MALA_SIZE) {
seq_printf(seq, "%02x%08x%08x%08x%08x\n",
   p[4], p[3], p[2], p[1], p[0]);
@@ -1196,7 +1196,7 @@ static int mboxlog_show(struct seq_file *seq, void *v)
 
seq_printf(seq, "  %08x %08x", hi, lo);
}
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
return 0;
 }
 
@@ -2112,9 +2112,7 @@ static int rss_config_show(struct seq_file *seq, void *v)
HASHTOEPLITZ_F));
seq_printf(seq, "  Udp4En:%3s\n", yesno(rssconf & UDPENABLE_F));
seq_printf(seq, "  Disable:   %3s\n", yesno(rssconf & DISABLE_F));
-
-   seq_puts(seq, "\n");
-
+   seq_putc(seq, '\n');
rssconf = t4_read_reg(adapter, TP_RSS_CONFIG_TNL_A);
seq_printf(seq, "TP_RSS_CONFIG_TNL: %#x\n", rssconf);
seq_printf(seq, "  MaskSize:  %3d\n", MASKSIZE_G(rssconf));
@@ -2126,25 +2124,19 @@ static int rss_config_show(struct seq_file *seq, void 
*v)
   yesno(rssconf & HASHETH_F));
}
seq_printf(seq, "  UseWireCh: %3s\n", yesno(rssconf & USEWIRECH_F));
-
-   seq_puts(seq, "\n");
-
+   seq_putc(seq, '\n');
rssconf = t4_read_reg(adapter, TP_RSS_CONFIG_OFD_A);
seq_printf(seq, "TP_RSS_CONFIG_OFD: %#x\n", rssconf);
seq_printf(seq, "  MaskSize:  %3d\n", MASKSIZE_G(rssconf));
seq_printf(seq, "  RRCplMapEn:%3s\n", yesno(rssconf &
RRCPLMAPEN_F));
seq_printf(seq, "  RRCplQueWidth: %3d\n", RRCPLQUEWIDTH_G(rssconf));
-
-   seq_puts(seq, "\n");
-
+   seq_putc(seq, '\n');
rssconf = t4_read_reg(adapter, TP_RSS_CONFIG_SYN_A);
seq_printf(seq, "TP_RSS_CONFIG_SYN: %#x\n", rssconf);
seq_printf(seq, "  MaskSize:  %3d\n", MASKSIZE_G(rssconf));
seq_printf(seq, "  UseWireCh: %3s\n", yesno(rssconf & USEWIRECH_F));
-
-   seq_puts(seq, "\n");
-
+   seq_putc(seq, '\n');
rssconf = t4_read_reg(adapter, TP_RSS_CONFIG_VRT_A);
seq_printf(seq, "TP_RSS_CONFIG_VRT: %#x\n", rssconf);
if (CHELSIO_CHIP_VERSION(adapter->params.chip) > CHELSIO_T5) {
@@ -2170,9 +2162,7 @@ static int rss_config_show(struct seq_file *seq, void *v)
seq_printf(seq, "  VfWrEn:%3s\n", yesno(rssconf & VFWREN_F));
seq_printf(seq, "  KeyWrEn:   %3s\n", yesno(rssconf & KEYWREN_F));
seq_printf(seq, "  KeyWrAddr: %3d\n", KEYWRADDR_G(rssconf));
-
-   seq_puts(seq, "\n");
-
+   seq_putc(seq, '\n');
rssconf = t4_read_reg(adapter, TP_RSS_CONFIG_CNG_A);
seq_printf(seq, "TP_RSS_CONFIG_CNG: %#x\n", rssconf);
seq_printf(seq, "  ChnCount3: %3s\n", yesno(rssconf & CHNCOUNT3_F));
-- 
2.12.2



[PATCH 6/6] cxgb4: Combine substrings for two messages

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 22:16:57 +0200

The script "checkpatch.pl" pointed information out like the following.

WARNING: quoted string split across lines

Thus fix two source code places.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 32add8dfc253..f9384d8b680d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -1353,8 +1353,9 @@ static int mps_trc_show(struct seq_file *seq, void *v)
if (tp.port < 8) {
i = adap->chan_map[tp.port & 3];
if (i >= MAX_NPORTS) {
-   dev_err(adap->pdev_dev, "tracer %u is assigned "
-   "to non-existing port\n", trcidx);
+   dev_err(adap->pdev_dev,
+   "tracer %u is assigned to non-existing port\n",
+   trcidx);
return -EINVAL;
}
seq_printf(seq, "tracer is capturing %s %s, ",
@@ -1798,11 +1799,11 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
  FW_LDST_CMD_IDX_V(idx));
ret = t4_wr_mbox(adap, adap->mbox, _cmd,
 sizeof(ldst_cmd), _cmd);
-   if (ret)
-   dev_warn(adap->pdev_dev, "Can't read MPS "
-"replication map for idx %d: %d\n",
+   if (ret) {
+   dev_warn(adap->pdev_dev,
+"Can't read MPS replication map for 
idx %d: %d\n",
 idx, -ret);
-   else {
+   } else {
mps_rplc = ldst_cmd.u.mps.rplc;
rplc[0] = ntohl(mps_rplc.rplc31_0);
rplc[1] = ntohl(mps_rplc.rplc63_32);
-- 
2.12.2



[PATCH 5/6] cxgb4: Use seq_puts() in cim_qcfg_show()

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 21:52:32 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 2bc40d89f874..32add8dfc253 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -357,9 +357,8 @@ static int cim_qcfg_show(struct seq_file *seq, void *v)
return i;
 
t4_read_cimq_cfg(adap, base, size, thres);
-
-   seq_printf(seq,
-  "  Queue  Base  Size Thres  RdPtr WrPtr  SOP  EOP Avail\n");
+   seq_puts(seq,
+"  Queue  Base  Size Thres  RdPtr WrPtr  SOP  EOP Avail\n");
for (i = 0; i < CIM_NUM_IBQ; i++, p += 4)
seq_printf(seq, "%7s %5x %5u %5u %6x  %4x %4u %4u %5u\n",
   qname[i], base[i], size[i], thres[i],
-- 
2.12.2



[PATCH 3/6] cxgb4vf: Adjust five checks for null pointers

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 21:20:25 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written !…

Thus fix the affected source code places.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 9c2690aeb32b..682e844c5a7d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -491,7 +491,7 @@ static int fwevtq_handler(struct sge_rspq *rspq, const 
__be64 *rsp,
break;
}
tq = s->egr_map[eq_idx];
-   if (unlikely(tq == NULL)) {
+   if (unlikely(!tq)) {
dev_err(adapter->pdev_dev,
"Egress Update QID %d TXQ=NULL\n", qid);
break;
@@ -2939,7 +2939,7 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 */
netdev = alloc_etherdev_mq(sizeof(struct port_info),
   MAX_PORT_QSETS);
-   if (netdev == NULL) {
+   if (!netdev) {
t4vf_free_vi(adapter, viid);
err = -ENOMEM;
goto err_free_dev;
@@ -3053,7 +3053,7 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
for_each_port(adapter, pidx) {
struct port_info *pi = netdev_priv(adapter->port[pidx]);
netdev = adapter->port[pidx];
-   if (netdev == NULL)
+   if (!netdev)
continue;
 
netif_set_real_num_tx_queues(netdev, pi->nqsets);
@@ -3120,7 +3120,7 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 err_free_dev:
for_each_port(adapter, pidx) {
netdev = adapter->port[pidx];
-   if (netdev == NULL)
+   if (!netdev)
continue;
pi = netdev_priv(netdev);
t4vf_free_vi(adapter, pi->viid);
@@ -3197,7 +3197,7 @@ static void cxgb4vf_pci_remove(struct pci_dev *pdev)
struct net_device *netdev = adapter->port[pidx];
struct port_info *pi;
 
-   if (netdev == NULL)
+   if (!netdev)
continue;
 
pi = netdev_priv(netdev);
-- 
2.12.2



[PATCH 2/6] cxgb4vf: Combine substrings for 24 messages

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 21:00:20 +0200

The script "checkpatch.pl" pointed information out like the following.

WARNING: quoted string split across lines

Thus fix the affected source code places.

Signed-off-by: Markus Elfring 
---
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c| 113 -
 1 file changed, 64 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 4ac9316f3081..9c2690aeb32b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -226,17 +226,20 @@ void t4vf_os_portmod_changed(struct adapter *adapter, int 
pidx)
dev_info(adapter->pdev_dev, "%s: %s port module inserted\n",
 dev->name, mod_str[pi->mod_type]);
else if (pi->mod_type == FW_PORT_MOD_TYPE_NOTSUPPORTED)
-   dev_info(adapter->pdev_dev, "%s: unsupported optical port "
-"module inserted\n", dev->name);
+   dev_info(adapter->pdev_dev,
+"%s: unsupported optical port module inserted\n",
+dev->name);
else if (pi->mod_type == FW_PORT_MOD_TYPE_UNKNOWN)
-   dev_info(adapter->pdev_dev, "%s: unknown port module inserted,"
-"forcing TWINAX\n", dev->name);
+   dev_info(adapter->pdev_dev,
+"%s: unknown port module inserted,forcing TWINAX\n",
+dev->name);
else if (pi->mod_type == FW_PORT_MOD_TYPE_ERROR)
dev_info(adapter->pdev_dev, "%s: transceiver module error\n",
 dev->name);
else
-   dev_info(adapter->pdev_dev, "%s: unknown module type %d "
-"inserted\n", dev->name, pi->mod_type);
+   dev_info(adapter->pdev_dev,
+"%s: unknown module type %d inserted\n",
+dev->name, pi->mod_type);
 }
 
 /*
@@ -2357,8 +2360,9 @@ static void size_nports_qsets(struct adapter *adapter)
 */
adapter->params.nports = vfres->nvi;
if (adapter->params.nports > MAX_NPORTS) {
-   dev_warn(adapter->pdev_dev, "only using %d of %d maximum"
-" allowed virtual interfaces\n", MAX_NPORTS,
+   dev_warn(adapter->pdev_dev,
+"only using %d of %d maximum allowed virtual 
interfaces\n",
+MAX_NPORTS,
 adapter->params.nports);
adapter->params.nports = MAX_NPORTS;
}
@@ -2370,9 +2374,9 @@ static void size_nports_qsets(struct adapter *adapter)
 */
pmask_nports = hweight32(adapter->params.vfres.pmask);
if (pmask_nports < adapter->params.nports) {
-   dev_warn(adapter->pdev_dev, "only using %d of %d provisioned"
-" virtual interfaces; limited by Port Access Rights"
-" mask %#x\n", pmask_nports, adapter->params.nports,
+   dev_warn(adapter->pdev_dev,
+"only using %d of %d provisioned virtual interfaces; 
limited by Port Access Rights mask %#x\n",
+pmask_nports, adapter->params.nports,
 adapter->params.vfres.pmask);
adapter->params.nports = pmask_nports;
}
@@ -2403,8 +2407,8 @@ static void size_nports_qsets(struct adapter *adapter)
adapter->sge.max_ethqsets = ethqsets;
 
if (adapter->sge.max_ethqsets < adapter->params.nports) {
-   dev_warn(adapter->pdev_dev, "only using %d of %d available"
-" virtual interfaces (too few Queue Sets)\n",
+   dev_warn(adapter->pdev_dev,
+"only using %d of %d available virtual interfaces (too 
few Queue Sets)\n",
 adapter->sge.max_ethqsets, adapter->params.nports);
adapter->params.nports = adapter->sge.max_ethqsets;
}
@@ -2448,38 +2452,44 @@ static int adap_init0(struct adapter *adapter)
 */
err = t4vf_get_dev_params(adapter);
if (err) {
-   dev_err(adapter->pdev_dev, "unable to retrieve adapter"
-   " device parameters: err=%d\n", err);
+   dev_err(adapter->pdev_dev,
+   "unable to retrieve adapter device parameters: 
err=%d\n",
+   err);
return err;
}
err = t4vf_get_vpd_params(adapter);
if (err) {
-   dev_err(adapter->pdev_dev, "unable to retrieve adapter"
-   " VPD parameters: err=%d\n", err);
+   dev_err(adapter->pdev_dev,
+   "unable to retrieve adapter VPD parameters: err=%d\n",
+   

[PATCH 1/6] cxgb4vf: Use seq_putc() in mboxlog_show()

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 20:02:04 +0200

A single character (line break) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index ac7a150c54e9..4ac9316f3081 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -1818,7 +1818,7 @@ static int mboxlog_show(struct seq_file *seq, void *v)
 
seq_printf(seq, "  %08x %08x", hi, lo);
}
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
return 0;
 }
 
-- 
2.12.2



[PATCH 0/6] cxgb4: Fine-tuning for some function implementations

2017-05-04 Thread SF Markus Elfring
From: Markus Elfring 
Date: Thu, 4 May 2017 22:23:45 +0200

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (6):
  Use seq_putc() in mboxlog_show()
  Combine substrings for 24 messages
  Adjust five checks for null pointers
  Replace seven seq_puts() calls by seq_putc()
  Use seq_puts() in cim_qcfg_show()
  Combine substrings for two messages

 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  42 +++
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c| 125 -
 2 files changed, 86 insertions(+), 81 deletions(-)

-- 
2.12.2



Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Joe Perches
On Thu, 2017-05-04 at 23:10 +0300, Pavel Belous wrote:
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
> b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
[]
> @@ -959,8 +959,10 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>   goto err_exit;
>  
>   for (i = AQ_DIMOF(self->aq_vec); i--;) {
> - if (self->aq_vec[i])
> + if (self->aq_vec[i]) {
>   aq_vec_free(self->aq_vec[i]);
> + self->aq_vec[i] = NULL;
> + }
>   }
>  
>  err_exit:;

unrelated style trivia:

err_exit:;

the error label then :; is pretty odd.

Also casting the return value to void is really odd.
A simple return instead of goto would be more common.

drivers/net/ethernet/aquantia/atlantic/aq_ring.c:311:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_ring.c:326:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_main.c:161:err_exit:;
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:277:err_exit:;
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:313:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_nic.c:304:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_nic.c:763:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_nic.c:951:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_nic.c:966:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_vec.c:281:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_vec.c:302:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c:251:err_exit:;
drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c:270:err_exit:;



[PATCH] net: ipv4: add code comment for clarification

2017-05-04 Thread Gustavo A. R. Silva
Add code comment to make it clear that the position of the arguments
req->id.idiag_dport and req->id.idiag_sport is a locked in behavior
and it should not be changed.

Addresses-Coverity-ID: 1357474
Cc: David Miller 
Cc: Joe Perches 
Signed-off-by: Gustavo A. R. Silva 
---
 net/ipv4/inet_diag.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 3828b3a..841800b 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -389,6 +389,12 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff 
*skb,
  nlmsg_flags, unlh, net_admin);
 }
 
+/*
+ * Ignore the position of the arguments req->id.idiag_dport and
+ * req->id.idiag_sport in both calls to inet_lookup() and inet6_lookup()
+ * functions, once this is a locked in behavior exposed to user space.
+ * Changing this will break things for people.
+ */
 struct sock *inet_diag_find_one_icsk(struct net *net,
 struct inet_hashinfo *hashinfo,
 const struct inet_diag_req_v2 *req)
-- 
2.5.0



[PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous
From: Pavel Belous 

This patch fixes the crash that happens when driver tries to collect statistics
from already released "aq_vec" object.
If adapter is in "down" state we still allow user to see statistics from HW.

V2: fixed braces around "aq_vec_free".

Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Pavel Belous 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb0299..9ee1c50 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
count = 0U;
 
for (i = 0U, aq_vec = self->aq_vec[0];
-   self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
+   aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
data += count;
aq_vec_get_sw_stats(aq_vec, data, );
}
@@ -959,8 +959,10 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
goto err_exit;
 
for (i = AQ_DIMOF(self->aq_vec); i--;) {
-   if (self->aq_vec[i])
+   if (self->aq_vec[i]) {
aq_vec_free(self->aq_vec[i]);
+   self->aq_vec[i] = NULL;
+   }
}
 
 err_exit:;
-- 
2.7.4



Re: [PATCH] iproute2: hide devices starting with period by default

2017-05-04 Thread David Ahern
On 5/4/17 1:10 PM, Florian Fainelli wrote:
> On 05/04/2017 09:37 AM, David Ahern wrote:
>> On 5/4/17 9:15 AM, Nicolas Dichtel wrote:
>>> Le 24/02/2017 à 16:52, David Ahern a écrit :
 On 2/23/17 8:12 PM, David Miller wrote:
> This really need to be a fundamental facility, so that it transparently
> works for NetworkManager, router daemons, everything.  Not just iproute2
> and "ls".

 I'll rebase my patch and send out as RFC.

>>> David, did you finally send those patches?
>>>
>>
>> No, but for a few reasons.
>>
>> It is easy to hide devices in a dump:
>>
>> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>
>> But I think those devices should also not exist in sysfs or procfs which
>> overlaps what I would like to see for lightweight netdevices:
>>
>> https://github.com/dsahern/linux/commit/70574be699cf252e77f71e3df11192438689f976
> 
> Interesting that does indeed solve the same problems as the L2 only
> patch set intended. I am not exactly sure if hiding the devices from
> procfs/sysfs would be appropriate in my case (dumb L2 only switch that
> only does 802.1q for instance), but why not.
> 
> 
>>
>>
>> and to be complete, hidden devices should not be allowed to have a
>> network address or transmit packets which is the L2 only intent from
>> Florian:
>> https://www.spinics.net/lists/netdev/msg340808.html
>>
> 
> Do you plan on submitting the LWT patch set at some point?

Definitely. Maybe I can find some time this weekend.


Re: [net-ipv4] question about arguments position

2017-05-04 Thread Gustavo A. R. Silva


Quoting Joe Perches :

[]

> > +/*
> > + * Ignore the position of the arguments req->id.idiag_dport and
> > + * req->id.idiag_sport in both calls to inet_lookup() and  
inet6_lookup()

> > + * functions, once this is a locked in behavior exposed to user space.
> > + * Changing this will break things for people.
> > + */
> >   struct sock *inet_diag_find_one_icsk(struct net *net,
> >   struct inet_hashinfo *hashinfo,
> >   const struct  
inet_diag_req_v2 *req)

> >
>
> Seems sensible.  Thanks.

Should I resend it in a full and proper format or it can taken from here?


If you want it applied, it should be resent as a full patch
with your sign-off.


I'll send it shortly.

Thanks for clarifying
--
Gustavo A. R. Silva








Patch "netlink: Allow direct reclaim for fallback allocation" has been added to the 4.4-stable tree

2017-05-04 Thread gregkh

This is a note to let you know that I've just added the patch titled

netlink: Allow direct reclaim for fallback allocation

to the 4.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 netlink-allow-direct-reclaim-for-fallback-allocation.patch
and it can be found in the queue-4.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From ross.lagerw...@citrix.com  Thu May  4 12:37:51 2017
From: Ross Lagerwall 
Date: Wed, 3 May 2017 09:44:19 +0100
Subject: netlink: Allow direct reclaim for fallback allocation
To: 
Cc: Ross Lagerwall , "David S. Miller" 
, Greg Kroah-Hartman , Eric 
Dumazet , , 

Message-ID: <1493801059-2828-1-git-send-email-ross.lagerw...@citrix.com>

From: Ross Lagerwall 

The backport of d35c99ff77ec ("netlink: do not enter direct reclaim from
netlink_dump()") to the 4.4 branch (first in 4.4.32) mistakenly removed
direct claim from the initial large allocation _and_ the fallback
allocation which means that allocations can spuriously fail.
Fix the issue by adding back the direct reclaim flag to the fallback
allocation.

Fixes: 6d123f1d396b ("netlink: do not enter direct reclaim from netlink_dump()")
Signed-off-by: Ross Lagerwall 
Signed-off-by: Greg Kroah-Hartman 
---

Note that this is only for the 4.4 branch as the regression is only in
this branch. Consequently, there is no corresponding upstream commit.

I'm resending this to the linux-stable list since I now understand the
netdev maintainer only handles backports for the last couple of versions
of Linux.

 net/netlink/af_netlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2107,7 +2107,7 @@ static int netlink_dump(struct sock *sk)
if (!skb) {
alloc_size = alloc_min_size;
skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
-   (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM));
+   GFP_KERNEL);
}
if (!skb)
goto errout_skb;


Patches currently in stable-queue which might be from ross.lagerw...@citrix.com 
are

queue-4.4/netlink-allow-direct-reclaim-for-fallback-allocation.patch


Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Leon Romanovsky
On Thu, May 04, 2017 at 03:26:13PM -0400, Dennis Dalessandro wrote:
> On 05/04/2017 02:45 PM, Leon Romanovsky wrote:
> > On Thu, May 04, 2017 at 06:30:27PM +, Bart Van Assche wrote:
> > > On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:
> > > > On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche wrote:
> > > > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > > > > Following our discussion both in mailing list [1] and at the LPC 
> > > > > > 2016 [2],
> > > > > > we would like to propose this RDMA tool to be part of iproute2 
> > > > > > package
> > > > > > and finally improve this situation.
> > > > >
> > > > > Hello Leon,
> > > > >
> > > > > Although I really appreciate your work: can you clarify why you would 
> > > > > like to
> > > > > add *RDMA* functionality to an *IP routing* tool? I haven't found any 
> > > > > motivation
> > > > > for adding RDMA functionality to iproute2 in [1].
> > > >
> > > > We are planning to reuse the same infrastructure provided by iproute2,
> > > > like netlink parsing, access to distributions, same CLI and same 
> > > > standards.
> > > >
> > > > Right now, RDMA is already tightened to netdev: iWARP, RoCE, IPoIB, 
> > > > HFI-VNIC.
> > > > Many drivers (mlx, qed, i40, cxgb) are sharing code between net and
> > > > RDMA.
> > > >
> > > > I do expect that iproute2 will be installed on every machine with any
> > > > type of connection, including IB and OPA.
> > > >
> > > > So I think that it is enough to be part of that suite and don't invent
> > > > our own for one specific tool.
> > >
> > > Hello Leon,
> > >
> > > Sorry but to me that sounds like a weak argument for including RDMA 
> > > functionality
> > > in iproute2. There is already a library for communication over netlink 
> > > sockets,
> > > namely libnl. Is there functionality that is in iproute2 but not in libnl 
> > > and
> > > that is needed for the new tool? If so, have you considered to create a 
> > > new
> > > library for that functionality?
> >
> > It is not hard to create new tool, the hardest part is to ensure that it is
> > part of the distributions. Did you count how many months we are trying to
> > add rdma-core to debian?
>
> I do agree that it is a strange pairing and am not really a fan. However at
> the end of the day it's just a name for a repo/package. If the iproute folks
> are fine to include rdma in their repo/package, great we can leverage their
> code for CLI and other common stuff.
>
> Now if the interface was something like "ip -FlagForRdma ..." I would object
> to that, but the interface is "rdma ... " so from users perspective it's
> different tools. They don't need to care that it was sourced from a common
> git repo.
>
> Just as an aside this already works a bit with OPA:
>
>  $ ./rdma link
> 1/1: hfi1_0/1: ifname NONE cap_mask 0x00410022 lid 0x1 lid_mask_count 0
> link_layer InfiniBand
>  phys_state 5: LinkUp rate 100 Gb/sec (4X EDR) sm_lid 0x1 sm_sl 0
> state 4: ACTIVE
>
> Leon I'll get you more feedback and testing, I've just been really bogged
> down this week, sorry.

Thanks Denny,

Before you are starting to test it, can you please provide your feedback
on my initial questions? Usability and need of sysfs.


This is initial phase to understand if user experience for this tool fits
RDMA and netdev communities exepectations. Also I would like to get feedback
if it is really worth to provide legacy sysfs for old kernels, or maybe I should
implement netlink from the beginning and abandon sysfs completely.
-

P.S. I believe this will give you wrong output, because it parses IB port 
cap_mask.
$./rdma link show hfi1_0/1 cap_mask

Thanks

>
> -Denny
>
>
>


signature.asc
Description: PGP signature


Re: [Patch net v2] ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf

2017-05-04 Thread David Ahern
On 5/4/17 11:36 AM, Cong Wang wrote:
> For each netns (except init_net), we initialize its null entry
> in 3 places:
> 
> 1) The template itself, as we use kmemdup()
> 2) Code around dst_init_metrics() in ip6_route_net_init()
> 3) ip6_route_dev_notify(), which is supposed to initialize it after
>loopback registers
> 
> Unfortunately the last one still happens in a wrong order because
> we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
> net->loopback_dev's idev, so we have to do that after we add
> idev to it. However, this notifier has priority == 0 same as
> ipv6_dev_notf, and ipv6_dev_notf is registered after
> ip6_route_dev_notifier so it is called actually after
> ip6_route_dev_notifier.
> 
> Fix it by picking a smaller priority for ip6_route_dev_notifier.
> Also, we have to release the refcnt accordingly when unregistering
> loopback_dev because device exit functions are called before subsys
> exit functions.
> 
> Cc: David Ahern 
> Signed-off-by: Cong Wang 
> ---

Commit message needs a tie in to the problem that Andrey reported. It
solves the same problem for namespaces other than init_net.

Acked-by: David Ahern 
Tested-by: David Ahern 


Re: [PATCH RESEND 4.4-only] netlink: Allow direct reclaim for fallback allocation

2017-05-04 Thread Greg Kroah-Hartman
On Wed, May 03, 2017 at 09:44:19AM +0100, Ross Lagerwall wrote:
> The backport of d35c99ff77ec ("netlink: do not enter direct reclaim from
> netlink_dump()") to the 4.4 branch (first in 4.4.32) mistakenly removed
> direct claim from the initial large allocation _and_ the fallback
> allocation which means that allocations can spuriously fail.
> Fix the issue by adding back the direct reclaim flag to the fallback
> allocation.
> 
> Fixes: 6d123f1d396b ("netlink: do not enter direct reclaim from 
> netlink_dump()")
> Signed-off-by: Ross Lagerwall 
> ---
> 
> Note that this is only for the 4.4 branch as the regression is only in
> this branch. Consequently, there is no corresponding upstream commit.
> 
> I'm resending this to the linux-stable list since I now understand the
> netdev maintainer only handles backports for the last couple of versions
> of Linux.
> 

Many thanks for this fix, now queued up.

greg k-h


Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Dennis Dalessandro

On 05/04/2017 02:45 PM, Leon Romanovsky wrote:

On Thu, May 04, 2017 at 06:30:27PM +, Bart Van Assche wrote:

On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:

On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche wrote:

On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:

Following our discussion both in mailing list [1] and at the LPC 2016 [2],
we would like to propose this RDMA tool to be part of iproute2 package
and finally improve this situation.


Hello Leon,

Although I really appreciate your work: can you clarify why you would like to
add *RDMA* functionality to an *IP routing* tool? I haven't found any motivation
for adding RDMA functionality to iproute2 in [1].


We are planning to reuse the same infrastructure provided by iproute2,
like netlink parsing, access to distributions, same CLI and same standards.

Right now, RDMA is already tightened to netdev: iWARP, RoCE, IPoIB, HFI-VNIC.
Many drivers (mlx, qed, i40, cxgb) are sharing code between net and
RDMA.

I do expect that iproute2 will be installed on every machine with any
type of connection, including IB and OPA.

So I think that it is enough to be part of that suite and don't invent
our own for one specific tool.


Hello Leon,

Sorry but to me that sounds like a weak argument for including RDMA 
functionality
in iproute2. There is already a library for communication over netlink sockets,
namely libnl. Is there functionality that is in iproute2 but not in libnl and
that is needed for the new tool? If so, have you considered to create a new
library for that functionality?


It is not hard to create new tool, the hardest part is to ensure that it is
part of the distributions. Did you count how many months we are trying to
add rdma-core to debian?


I do agree that it is a strange pairing and am not really a fan. However 
at the end of the day it's just a name for a repo/package. If the 
iproute folks are fine to include rdma in their repo/package, great we 
can leverage their code for CLI and other common stuff.


Now if the interface was something like "ip -FlagForRdma ..." I would 
object to that, but the interface is "rdma ... " so from users 
perspective it's different tools. They don't need to care that it was 
sourced from a common git repo.


Just as an aside this already works a bit with OPA:

 $ ./rdma link
1/1: hfi1_0/1: ifname NONE cap_mask 0x00410022 lid 0x1 lid_mask_count 0 
link_layer InfiniBand
 phys_state 5: LinkUp rate 100 Gb/sec (4X EDR) sm_lid 0x1 sm_sl 
0 state 4: ACTIVE


Leon I'll get you more feedback and testing, I've just been really 
bogged down this week, sorry.


-Denny





Re: [net-ipv4] question about arguments position

2017-05-04 Thread Gustavo A. R. Silva

Hi Joe,

Quoting Joe Perches :


On Thu, 2017-05-04 at 12:46 -0400, David Miller wrote:

From: "Gustavo A. R. Silva" 
Date: Thu, 04 May 2017 11:07:54 -0500

> While looking into Coverity ID 1357474 I ran into the following piece
> of code at net/ipv4/inet_diag.c:392:

Because it's been this way since at least 2005, it doesn't matter if
the order is correct or not.  What's there is the locked in behavior
exposed to userspace and changing it will break things for people.


Adding a few comments around the code about why
it is this way will help avoid future questions.


In the case of Coverity, I already triaged and documented this issue.  
So people can ignore it in the future.


Regarding the code comments, what about the following patch:

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 3828b3a..7a56641 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -389,6 +389,12 @@ static int sk_diag_fill(struct sock *sk, struct  
sk_buff *skb,

  nlmsg_flags, unlh, net_admin);
 }

+/*
+ * Ignore the position of the arguments req->id.idiag_dport and
+ * req->id.idiag_sport in both calls to inet_lookup() and inet6_lookup()
+ * functions, once this is a locked in behavior exposed to user space.
+ * Changing this will break things for people.
+ */
 struct sock *inet_diag_find_one_icsk(struct net *net,
 struct inet_hashinfo *hashinfo,
 const struct inet_diag_req_v2 *req)

Thanks
--
Gustavo A. R. Silva






Re: [net-ipv4] question about arguments position

2017-05-04 Thread Joe Perches
On Thu, 2017-05-04 at 14:15 -0500, Gustavo A. R. Silva wrote:
> Quoting Joe Perches :
> 
> > On Thu, 2017-05-04 at 14:00 -0500, Gustavo A. R. Silva wrote:
> > > Regarding the code comments, what about the following patch:
> > 
> > []
> > > diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> > 
> > []
> > > @@ -389,6 +389,12 @@ static int sk_diag_fill(struct sock *sk, struct
> > > sk_buff *skb,
> > >nlmsg_flags, unlh, net_admin);
> > >   }
> > > 
> > > +/*
> > > + * Ignore the position of the arguments req->id.idiag_dport and
> > > + * req->id.idiag_sport in both calls to inet_lookup() and inet6_lookup()
> > > + * functions, once this is a locked in behavior exposed to user space.
> > > + * Changing this will break things for people.
> > > + */
> > >   struct sock *inet_diag_find_one_icsk(struct net *net,
> > >   struct inet_hashinfo *hashinfo,
> > >   const struct inet_diag_req_v2 *req)
> > > 
> > 
> > Seems sensible.  Thanks.
> 
> Should I resend it in a full and proper format or it can taken from here?

If you want it applied, it should be resent as a full patch
with your sign-off.




Re: [net-ipv4] question about arguments position

2017-05-04 Thread Gustavo A. R. Silva


Quoting Joe Perches :


On Thu, 2017-05-04 at 14:00 -0500, Gustavo A. R. Silva wrote:

Regarding the code comments, what about the following patch:

[]

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c

[]

@@ -389,6 +389,12 @@ static int sk_diag_fill(struct sock *sk, struct
sk_buff *skb,
   nlmsg_flags, unlh, net_admin);
  }

+/*
+ * Ignore the position of the arguments req->id.idiag_dport and
+ * req->id.idiag_sport in both calls to inet_lookup() and inet6_lookup()
+ * functions, once this is a locked in behavior exposed to user space.
+ * Changing this will break things for people.
+ */
  struct sock *inet_diag_find_one_icsk(struct net *net,
  struct inet_hashinfo *hashinfo,
  const struct inet_diag_req_v2 *req)



Seems sensible.  Thanks.


Should I resend it in a full and proper format or it can taken from here?

Thanks
--
Gustavo A. R. Silva








Re: [PATCH] iproute2: hide devices starting with period by default

2017-05-04 Thread Florian Fainelli
On 05/04/2017 09:37 AM, David Ahern wrote:
> On 5/4/17 9:15 AM, Nicolas Dichtel wrote:
>> Le 24/02/2017 à 16:52, David Ahern a écrit :
>>> On 2/23/17 8:12 PM, David Miller wrote:
 This really need to be a fundamental facility, so that it transparently
 works for NetworkManager, router daemons, everything.  Not just iproute2
 and "ls".
>>>
>>> I'll rebase my patch and send out as RFC.
>>>
>> David, did you finally send those patches?
>>
> 
> No, but for a few reasons.
> 
> It is easy to hide devices in a dump:
> 
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
> 
> 
> But I think those devices should also not exist in sysfs or procfs which
> overlaps what I would like to see for lightweight netdevices:
> 
> https://github.com/dsahern/linux/commit/70574be699cf252e77f71e3df11192438689f976

Interesting that does indeed solve the same problems as the L2 only
patch set intended. I am not exactly sure if hiding the devices from
procfs/sysfs would be appropriate in my case (dumb L2 only switch that
only does 802.1q for instance), but why not.


> 
> 
> and to be complete, hidden devices should not be allowed to have a
> network address or transmit packets which is the L2 only intent from
> Florian:
> https://www.spinics.net/lists/netdev/msg340808.html
> 

Do you plan on submitting the LWT patch set at some point?
-- 
Florian


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Lino Sanfilippo
On 04.05.2017 20:37, Pavel Belous wrote:

> 
> Yes, even adapter is in the down state user can still see statistics from the 
> HW.
> For example (adapter is down):
> 
> $ ethtool -S enp2s0
> NIC statistics:
>  InPackets: 3237727
>  InUCast: 3237214
>  InMCast: 391
>  InBCast: 122
>  InErrors: 0
>  OutPackets: 14157898
>  OutUCast: 14157089
>  OutMCast: 304
>  OutBCast: 505
>  InUCastOctects: 226714406
>  OutUCastOctects: 10463156
>  InMCastOctects: 58046
>  OutMCastOctects: 44817
>  InBCastOctects: 12857
>  OutBCastOctects: 41626
>  InOctects: 226785309
>  OutOctects: 10549599
>  InPacketsDma: 0
>  OutPacketsDma: 16
>  InOctetsDma: 0
>  OutOctetsDma: 2396
>  InDroppedDma: 0
>  Queue[0] InPackets: 0
>  Queue[0] OutPackets: 0
>  Queue[0] InJumboPackets: 0
>  Queue[0] InLroPackets: 0
>  Queue[0] InErrors: 0
>  Queue[1] InPackets: 0
>  Queue[1] OutPackets: 0
>  Queue[1] InJumboPackets: 0
>  Queue[1] InLroPackets: 0
>  Queue[1] InErrors: 0
>  Queue[2] InPackets: 0
>  Queue[2] OutPackets: 0
>  Queue[2] InJumboPackets: 0
>  Queue[2] InLroPackets: 0
>  Queue[2] InErrors: 0
>  Queue[3] InPackets: 0
>  Queue[3] OutPackets: 0
>  Queue[3] InJumboPackets: 0
>  Queue[3] InLroPackets: 0
>  Queue[3] InErrors: 0
> 
> Lino, David what do you think?
> If you agree I can re-submit the patch (with fixed braces).
> 

Well my objection was related to how the bug is fixed. If in the end we have a 
solution
as suggested by David it would be even better, of course. I dont think that 
this is too
hard to realize since it is only the queues stats that are missing. But if you 
prefer to 
solve it in two steps, sure, no objections from my side :)

Regards,
Lino 



Re: [net-ipv4] question about arguments position

2017-05-04 Thread Joe Perches
On Thu, 2017-05-04 at 14:00 -0500, Gustavo A. R. Silva wrote:
> Regarding the code comments, what about the following patch:
[]
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
[]
> @@ -389,6 +389,12 @@ static int sk_diag_fill(struct sock *sk, struct  
> sk_buff *skb,
>nlmsg_flags, unlh, net_admin);
>   }
> 
> +/*
> + * Ignore the position of the arguments req->id.idiag_dport and
> + * req->id.idiag_sport in both calls to inet_lookup() and inet6_lookup()
> + * functions, once this is a locked in behavior exposed to user space.
> + * Changing this will break things for people.
> + */
>   struct sock *inet_diag_find_one_icsk(struct net *net,
>   struct inet_hashinfo *hashinfo,
>   const struct inet_diag_req_v2 *req)
> 

Seems sensible.  Thanks.


[GIT] Networking

2017-05-04 Thread David Miller

1) The wireless rate info fix from Johannes Berg.

2) When a RAW socket is in hdrincl mode, we need to make sure that the
   user provided at least a minimally sized ipv4/ipv6 header.  Fix from
   Alexander Potapenko.

3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using
   nla_put_string() so that it is NULL terminated.

4) Fix a bug in TCP fastopen handling, wherein child sockets erroneously
   inherit the fastopen_req from the parent, and later can end up
   derefencing freed memory or doing a double free.  From Eric Dumazet.

5) Don't clear out netdev stats at close time in tg3 driver, from
   YueHaibing.

6) Fix refcount leak in xt_CT, from Gao Feng.

7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang.

8) Fix deadlock due to taking the expectation lock twice, also from
   Liping Zhang.

9) Make xt_socket work again with ipv6, from Peter Tirsek.

10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from
Paolo Abeni.

11) Make the BPF loader more flexible wrt. changes to the bpf MAP
entry layout.  From Jesper Dangaard Brouer.

12) Fix ethtool reported device name in aquantia driver, from Pavel
Belous.

13) Fix build failures due to the compile time size test not working
in netfilter conntrack.  From Geert Uytterhoeven.

Please pull, thanks a lot!

The following changes since commit 89c9fea3c8034cdb2fd745f551cde0b507fd6893:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial (2017-05-02 
19:09:35 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 842be75c77cb72ee546a2b19da9c285fb3ded660:

  cfg80211: make RATE_INFO_BW_20 the default (2017-05-04 13:15:28 -0400)


Alexander Potapenko (1):
  ipv4, ipv6: ensure raw socket message is big enough to hold an IP header

Alexei Starovoitov (1):
  selftests/bpf: get rid of -D__x86_64__

Colin Ian King (1):
  net/sched: remove redundant null check on head

Dan Carpenter (1):
  netfilter: x_tables: unlock on error in xt_find_table_lock()

Daniel Borkmann (1):
  xdp: use common helper for netlink extended ack reporting

Daniele Palmas (1):
  net: usb: qmi_wwan: add Telit ME910 support

Dave Johnson (1):
  netfilter: Wrong icmp6 checksum for ICMPV6_TIME_EXCEED in reverse SNATv6 
path

David Ahern (1):
  net: ipv6: Do not duplicate DAD on link up

David Cai (1):
  smsc911x: Adding support for Micochip LAN9250 Ethernet controller

David S. Miller (4):
  Merge branch 'sample-bpf-loader-fixes'
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch 'ibmvnic-Updated-reset-handler-andcode-fixes'
  Merge branch 'qed-fixes'

Eric Dumazet (1):
  tcp: do not inherit fastopen_req from parent

Gao Feng (1):
  netfilter: xt_CT: fix refcnt leak on error path

Geert Uytterhoeven (2):
  test_bpf: Use ULL suffix for 64-bit constants
  netfilter: conntrack: Force inlining of build check to prevent build 
failure

Jarno Rajahalme (1):
  openvswitch: Delete conntrack entry clashing with an expectation.

Jesper Dangaard Brouer (4):
  samples/bpf: adjust rlimit RLIMIT_MEMLOCK for traceex2, tracex3 and 
tracex4
  samples/bpf: make bpf_load.c code compatible with ELF maps section changes
  samples/bpf: load_bpf.c make callback fixup more flexible
  samples/bpf: export map_data[] for more info on maps

Johannes Berg (1):
  cfg80211: make RATE_INFO_BW_20 the default

Linus Lüssing (1):
  bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port

Liping Zhang (7):
  netfilter: nf_ct_helper: permit cthelpers with different names via 
nfnetlink
  netfilter: nft_set_bitmap: free dummy elements when destroy the set
  netfilter: ctnetlink: drop the incorrect cthelper module request
  netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice
  netfilter: ctnetlink: make it safer when updating ct->status
  netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj
  netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded

Michal Schmidt (1):
  rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string

Nathan Fontenot (10):
  ibmvnic: Move resource initialization to its own routine
  ibmvnic: Replace is_closed with state field
  ibmvnic: Updated reset handling
  ibmvnic: Delete napi's when releasing driver resources
  ibmvnic: Whitespace correction in release_rx_pools
  ibmvnic: Clean up tx pools when closing
  ibmvnic: Wait for any pending scrqs entries at driver close
  ibmvnic: Check for driver reset first in ibmvnic_xmit
  ibmvnic: Continue skb processing after skb completion error
  ibmvnic: Move queue restarting in ibmvnic_tx_complete

Pablo Neira Ayuso (3):
  Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/.../horms/ipvs
  

Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Leon Romanovsky
On Thu, May 04, 2017 at 06:30:27PM +, Bart Van Assche wrote:
> On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:
> > On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche wrote:
> > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > > Following our discussion both in mailing list [1] and at the LPC 2016 
> > > > [2],
> > > > we would like to propose this RDMA tool to be part of iproute2 package
> > > > and finally improve this situation.
> > >
> > > Hello Leon,
> > >
> > > Although I really appreciate your work: can you clarify why you would 
> > > like to
> > > add *RDMA* functionality to an *IP routing* tool? I haven't found any 
> > > motivation
> > > for adding RDMA functionality to iproute2 in [1].
> >
> > We are planning to reuse the same infrastructure provided by iproute2,
> > like netlink parsing, access to distributions, same CLI and same standards.
> >
> > Right now, RDMA is already tightened to netdev: iWARP, RoCE, IPoIB, 
> > HFI-VNIC.
> > Many drivers (mlx, qed, i40, cxgb) are sharing code between net and
> > RDMA.
> >
> > I do expect that iproute2 will be installed on every machine with any
> > type of connection, including IB and OPA.
> >
> > So I think that it is enough to be part of that suite and don't invent
> > our own for one specific tool.
>
> Hello Leon,
>
> Sorry but to me that sounds like a weak argument for including RDMA 
> functionality
> in iproute2. There is already a library for communication over netlink 
> sockets,
> namely libnl. Is there functionality that is in iproute2 but not in libnl and
> that is needed for the new tool? If so, have you considered to create a new
> library for that functionality?

It is not hard to create new tool, the hardest part is to ensure that it is
part of the distributions. Did you count how many months we are trying to
add rdma-core to debian?

I have enough headache with that and don't want another one.

Do you have situation in mind where you will have RDMA device without
iproute2 installed?

Thanks

>
> Thanks,
>
> Bart.


signature.asc
Description: PGP signature


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous



On 04.05.2017 21:17, Joe Perches wrote:

On Thu, 2017-05-04 at 20:08 +0300, Pavel Belous wrote:

I will prepare another patch with Lino and David M. comments.


I'm not submitting this because it'd just cause merge conflicts,
but
something you could do one day is remove the AQ_DIMOF macro
and just use ARRAY_SIZE directly.
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c  | 4 ++--
 drivers/net/ethernet/aquantia/atlantic/aq_utils.h| 2 --
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c| 2 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c| 2 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c | 2 +-
 5 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb02991f249..cffae53414ba 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -154,7 +154,7 @@ static void aq_nic_service_timer_cb(unsigned long param)

memset(_rx, 0U, sizeof(struct aq_ring_stats_rx_s));
memset(_tx, 0U, sizeof(struct aq_ring_stats_tx_s));
-   for (i = AQ_DIMOF(self->aq_vec); i--;) {
+   for (i = ARRAY_SIZE(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_add_stats(self->aq_vec[i], _rx, _tx);
}
@@ -958,7 +958,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
if (!self)
goto err_exit;

-   for (i = AQ_DIMOF(self->aq_vec); i--;) {
+   for (i = ARRAY_SIZE(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_free(self->aq_vec[i]);
}
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_utils.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
index f6012b34abe6..64a8c3c781ff 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
@@ -14,8 +14,6 @@

 #include "aq_common.h"

-#define AQ_DIMOF(_ARY_)  ARRAY_SIZE(_ARY_)
-
 struct aq_obj_s {
spinlock_t lock; /* spinlock for nic/rings processing */
atomic_t flags;
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
index 4ee15ff06a44..96c3360e7060 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
@@ -182,7 +182,7 @@ static int hw_atl_a0_hw_rss_set(struct aq_hw_s *self,
((i * 3U) & 0xFU));
}

-   for (i = AQ_DIMOF(bitary); i--;) {
+   for (i = ARRAY_SIZE(bitary); i--;) {
rpf_rss_redir_tbl_wr_data_set(self, bitary[i]);
rpf_rss_redir_tbl_addr_set(self, i);
rpf_rss_redir_wr_en_set(self, 1U);
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
index 42150708191d..5a19eba31786 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@@ -182,7 +182,7 @@ static int hw_atl_b0_hw_rss_set(struct aq_hw_s *self,
((i * 3U) & 0xFU));
}

-   for (i = AQ_DIMOF(bitary); i--;) {
+   for (i = ARRAY_SIZE(bitary); i--;) {
rpf_rss_redir_tbl_wr_data_set(self, bitary[i]);
rpf_rss_redir_tbl_addr_set(self, i);
rpf_rss_redir_wr_en_set(self, 1U);
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
index 8d6d8f5804da..922af5f36d37 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
@@ -385,7 +385,7 @@ int hw_atl_utils_get_mac_permanent(struct aq_hw_s *self,
aq_hw_read_reg(self, 0x0374U) +
(40U * 4U),
mac_addr,
-   AQ_DIMOF(mac_addr));
+   ARRAY_SIZE(mac_addr));
if (err < 0) {
mac_addr[0] = 0U;
mac_addr[1] = 0U;



Thank you,
I will do it, little bit later.

Regards,
Pavel


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous



On 04.05.2017 21:27, David Arcari wrote:

On 05/04/2017 01:09 PM, Pavel Belous wrote:



On 04.05.2017 19:51, David Miller wrote:

From: Lino Sanfilippo 
Date: Thu, 4 May 2017 18:48:12 +0200


Hi Pavel,

On 04.05.2017 18:33, Pavel Belous wrote:

From: Pavel Belous 

This patch fixes the crash that happens when driver tries to collect statistics
from already released "aq_vec" object.

Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Pavel Belous 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb0299..3a32573 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
 count = 0U;

 for (i = 0U, aq_vec = self->aq_vec[0];
-self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
+aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
 data += count;
 aq_vec_get_sw_stats(aq_vec, data, );
 }
@@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
 for (i = AQ_DIMOF(self->aq_vec); i--;) {
 if (self->aq_vec[i])
 aq_vec_free(self->aq_vec[i]);
+self->aq_vec[i] = NULL;
 }

 err_exit:;



if the driver does not support statistics when the interface is down, would
not it be clearer
to check if netif_running() in get_stats() instead?


Yes, much cleaner.

Much better would be to have a cached software copy so that statistics
can be reported regardless of whether the device is down or not.



Thank you.
I will think about how to do it better.


It appears that the adapter is still reporting the cumulative hardware stats
even while its down.  The user is just losing the per queue stats.

Although the loss of the per queue stats is not ideal, this patch still fixes a
crash.

It might be worthwhile to refactor this patch as a short term solution and then
subsequently produce a version that contains cached statistics.  Assuming that
is amenable to everyone of course.

-DA



Yes, even adapter is in the down state user can still see statistics 
from the HW.

For example (adapter is down):

$ ethtool -S enp2s0
NIC statistics:
 InPackets: 3237727
 InUCast: 3237214
 InMCast: 391
 InBCast: 122
 InErrors: 0
 OutPackets: 14157898
 OutUCast: 14157089
 OutMCast: 304
 OutBCast: 505
 InUCastOctects: 226714406
 OutUCastOctects: 10463156
 InMCastOctects: 58046
 OutMCastOctects: 44817
 InBCastOctects: 12857
 OutBCastOctects: 41626
 InOctects: 226785309
 OutOctects: 10549599
 InPacketsDma: 0
 OutPacketsDma: 16
 InOctetsDma: 0
 OutOctetsDma: 2396
 InDroppedDma: 0
 Queue[0] InPackets: 0
 Queue[0] OutPackets: 0
 Queue[0] InJumboPackets: 0
 Queue[0] InLroPackets: 0
 Queue[0] InErrors: 0
 Queue[1] InPackets: 0
 Queue[1] OutPackets: 0
 Queue[1] InJumboPackets: 0
 Queue[1] InLroPackets: 0
 Queue[1] InErrors: 0
 Queue[2] InPackets: 0
 Queue[2] OutPackets: 0
 Queue[2] InJumboPackets: 0
 Queue[2] InLroPackets: 0
 Queue[2] InErrors: 0
 Queue[3] InPackets: 0
 Queue[3] OutPackets: 0
 Queue[3] InJumboPackets: 0
 Queue[3] InLroPackets: 0
 Queue[3] InErrors: 0

Lino, David what do you think?
If you agree I can re-submit the patch (with fixed braces).

Regards,
Pavel



Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Bart Van Assche
On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:
> On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche wrote:
> > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > Following our discussion both in mailing list [1] and at the LPC 2016 [2],
> > > we would like to propose this RDMA tool to be part of iproute2 package
> > > and finally improve this situation.
> > 
> > Hello Leon,
> > 
> > Although I really appreciate your work: can you clarify why you would like 
> > to
> > add *RDMA* functionality to an *IP routing* tool? I haven't found any 
> > motivation
> > for adding RDMA functionality to iproute2 in [1].
> 
> We are planning to reuse the same infrastructure provided by iproute2,
> like netlink parsing, access to distributions, same CLI and same standards.
> 
> Right now, RDMA is already tightened to netdev: iWARP, RoCE, IPoIB, HFI-VNIC.
> Many drivers (mlx, qed, i40, cxgb) are sharing code between net and
> RDMA.
> 
> I do expect that iproute2 will be installed on every machine with any
> type of connection, including IB and OPA.
> 
> So I think that it is enough to be part of that suite and don't invent
> our own for one specific tool.

Hello Leon,

Sorry but to me that sounds like a weak argument for including RDMA 
functionality
in iproute2. There is already a library for communication over netlink sockets,
namely libnl. Is there functionality that is in iproute2 but not in libnl and
that is needed for the new tool? If so, have you considered to create a new
library for that functionality?

Thanks,

Bart.

Re: Why do we need MSG_SENDPAGE_NOTLAST?

2017-05-04 Thread Eric Dumazet
On Thu, 2017-05-04 at 17:03 +, Ilya Lesokhin wrote:
> I don't understand the need for MSG_SENDPAGE_NOTLAST and I'm hoping
> someone can enlighten me.
> 
> According to commit 35f9c09 ('tcp: tcp_sendpages() should call
> tcp_push() once'):
> "We need to call tcp_flush() at the end of the last page processed in
> tcp_sendpages(), or else transmits can be deferred and future sends
> stall."
> 
> I don't understand why we need to differentiate between the user
> setting MSG_MORE 
> and splice indicating that more data is going to be sent.
> if the user passed MSG_MORE and didn't push any extra data, isn't it
> the users fault? 
> Do we need it because poorly written applications were broken when 
> MSG_MORE was added to tcp_sendpage? Or is there a deeper reason?
> 

The answer lies to how splice() is working.

User can issue one splice without MSG_MORE semantic, right ?

Still, we want an implicit MORE behavior for all individual pages, but
the last one.


> The reason I'm asking is that we are working on a kernel TLS
> implementation 
> and I would like to know if we can coalesce multiple tls_sendpage
> calls with MSG_MORE into a single
> tls record or whether we must push out the record as soon as
> MSG_SENDPAGE_NOTLAST is cleared?

Make sure you handle partial writes (you want to coalesce 10 pages, but
stack will only take 5 of them)





Re: [linux-sunxi] Re: [PATCH 2/4] dt-bindings: add binding for RTL8211E Ethernet PHY

2017-05-04 Thread Florian Fainelli
On 05/04/2017 11:26 AM, Icenowy Zheng wrote:
> 
> 
> 于 2017年5月5日 GMT+08:00 上午2:21:29, Florian Fainelli  写到:
>> On 05/04/2017 11:10 AM, icen...@aosc.io wrote:
>>> 在 2017-04-22 08:22,Florian Fainelli 写道:
 On 04/21/2017 04:24 PM, Icenowy Zheng wrote:
> From: Icenowy Zheng 
>
> Some RTL8211E Ethernet PHY have an issue that needs a workaround
> indicated with device tree.
>
> Add the binding for a property that indicates this workaround.
>
> Signed-off-by: Icenowy Zheng 
> ---
>  .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22
> ++
>  1 file changed, 22 insertions(+)
>  create mode 100644
> Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
>
> diff --git
> a/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
> b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
> new file mode 100644
> index ..c1913301bfe8
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
> @@ -0,0 +1,22 @@
> +Realtek RTL8211E Ethernet PHY
> +
> +One batch of RTL8211E is slight broken, that needs some special
>> (and
> +full of magic numbers) tweaking in order to make GbE to operate
> properly.
> +The only well-known board that used the broken batch is Pine64+.
> +Configure it through an Ethernet OF device node.
> +
> +Optional properties:
> +
> +- realtek,disable-rx-delay:
> +  If set, RX delay will be completely disabled (according to
> Realtek). This
> +  will affect the performance on non-broken boards.
> +  default: do not disable RX delay.

 Please don't introduce custom properties to do that, instead correct
 specify the "phy-mode" such that it is e.g: "rgmii-txid" which
>> indicates
 that there should be no RX internal delay, but a TX internal delay
>> added
 by the PHY.
>>>
>>> Checked the document, the meaning of "rgmii-txid" is not correct
>> here.
>>>
>>> This doesn't effect the MAC, and the MAC should still add TX delay.
>>>
>>> The definition of "rgmii-txid" in
>>> Documentation/devicetree/binding/net/ethernet.txt is "RGMII with
>>> internal TX delay provided by the PHY, the MAC should not add an TX
>> delay
>>> in this case". However, this do not indicate that the MAC doesn't add
>> TX
>>> delay; in fact that just totally disabled the PHY to provide the RX
>> delay.
>>> MAC still should to add delay on both TX/RX, which is the semantic of
>>> standard "rgmii".
>>>
>>> So I cannot used "rgmii-txid" here, but should continue to use this
>>> custom property.
>>
>> This is absolutely not a correct understanding. The 'phy-mode' property
>> defines the contract between the MAC and PHY. It is defined from the
>> PHY's perspective of the delay, which means that the MAC has to either
>> also provide an adequate delay (RX or TX) or not (RX or TX). So if you
>> specified 'phy-mode' = "rgmii" this means that the MAC needs to adds
>> the
>> TX and RX delay, so implcitly this means that your MAC operates in
> 
> The MAC doesn't lose its responsibility to tweak RX/TX delays with this 
> property set.

No it does not but now there is no contract binding the MAC and the PHY
together was to what an appropriate delay configuration there should be.
This is why using phydev->interface (directly inherited from 'phy-mode')
is important because it binds the PHY and MAC on a contract.

> 
> This situation is that, the PHY's RX delay tweaking function is broken. But 
> it doesn't mean that the PHY can take over *all* responsibility to tweak TX, 
> it still needs MAC to tweak TX.

Correct, so what part of my answer was not clear in that sense?

> 
>> "rgmii-id", if the property was defined from the perspective of the
>> MAC,
>> which it is not.
>>
>> Both the Ethernet PHY driver and the MAC driver need to take care of
>> adjusting the delays based on the phydev->interface value.
>>
>> The property you are introducing here is absolutely not appropriate
>> because it is entirely redundant with what 'phy-mode' already defines,
>> except the latter also covers a lot more cases.


-- 
Florian


Re: [linux-sunxi] Re: [PATCH 2/4] dt-bindings: add binding for RTL8211E Ethernet PHY

2017-05-04 Thread Icenowy Zheng


于 2017年5月5日 GMT+08:00 上午2:21:29, Florian Fainelli  写到:
>On 05/04/2017 11:10 AM, icen...@aosc.io wrote:
>> 在 2017-04-22 08:22,Florian Fainelli 写道:
>>> On 04/21/2017 04:24 PM, Icenowy Zheng wrote:
 From: Icenowy Zheng 

 Some RTL8211E Ethernet PHY have an issue that needs a workaround
 indicated with device tree.

 Add the binding for a property that indicates this workaround.

 Signed-off-by: Icenowy Zheng 
 ---
  .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22
 ++
  1 file changed, 22 insertions(+)
  create mode 100644
 Documentation/devicetree/bindings/net/realtek,rtl8211e.txt

 diff --git
 a/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
 b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
 new file mode 100644
 index ..c1913301bfe8
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
 @@ -0,0 +1,22 @@
 +Realtek RTL8211E Ethernet PHY
 +
 +One batch of RTL8211E is slight broken, that needs some special
>(and
 +full of magic numbers) tweaking in order to make GbE to operate
 properly.
 +The only well-known board that used the broken batch is Pine64+.
 +Configure it through an Ethernet OF device node.
 +
 +Optional properties:
 +
 +- realtek,disable-rx-delay:
 +  If set, RX delay will be completely disabled (according to
 Realtek). This
 +  will affect the performance on non-broken boards.
 +  default: do not disable RX delay.
>>>
>>> Please don't introduce custom properties to do that, instead correct
>>> specify the "phy-mode" such that it is e.g: "rgmii-txid" which
>indicates
>>> that there should be no RX internal delay, but a TX internal delay
>added
>>> by the PHY.
>> 
>> Checked the document, the meaning of "rgmii-txid" is not correct
>here.
>> 
>> This doesn't effect the MAC, and the MAC should still add TX delay.
>> 
>> The definition of "rgmii-txid" in
>> Documentation/devicetree/binding/net/ethernet.txt is "RGMII with
>> internal TX delay provided by the PHY, the MAC should not add an TX
>delay
>> in this case". However, this do not indicate that the MAC doesn't add
>TX
>> delay; in fact that just totally disabled the PHY to provide the RX
>delay.
>> MAC still should to add delay on both TX/RX, which is the semantic of
>> standard "rgmii".
>> 
>> So I cannot used "rgmii-txid" here, but should continue to use this
>> custom property.
>
>This is absolutely not a correct understanding. The 'phy-mode' property
>defines the contract between the MAC and PHY. It is defined from the
>PHY's perspective of the delay, which means that the MAC has to either
>also provide an adequate delay (RX or TX) or not (RX or TX). So if you
>specified 'phy-mode' = "rgmii" this means that the MAC needs to adds
>the
>TX and RX delay, so implcitly this means that your MAC operates in

The MAC doesn't lose its responsibility to tweak RX/TX delays with this 
property set.

This situation is that, the PHY's RX delay tweaking function is broken. But it 
doesn't mean that the PHY can take over *all* responsibility to tweak TX, it 
still needs MAC to tweak TX.

>"rgmii-id", if the property was defined from the perspective of the
>MAC,
>which it is not.
>
>Both the Ethernet PHY driver and the MAC driver need to take care of
>adjusting the delays based on the phydev->interface value.
>
>The property you are introducing here is absolutely not appropriate
>because it is entirely redundant with what 'phy-mode' already defines,
>except the latter also covers a lot more cases.


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread David Arcari
On 05/04/2017 01:09 PM, Pavel Belous wrote:
> 
> 
> On 04.05.2017 19:51, David Miller wrote:
>> From: Lino Sanfilippo 
>> Date: Thu, 4 May 2017 18:48:12 +0200
>>
>>> Hi Pavel,
>>>
>>> On 04.05.2017 18:33, Pavel Belous wrote:
 From: Pavel Belous 

 This patch fixes the crash that happens when driver tries to collect 
 statistics
 from already released "aq_vec" object.

 Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific 
 code")
 Signed-off-by: Pavel Belous 
 ---
  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
 b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
 index cdb0299..3a32573 100644
 --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
 +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
 @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
  count = 0U;

  for (i = 0U, aq_vec = self->aq_vec[0];
 -self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
 +aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
  data += count;
  aq_vec_get_sw_stats(aq_vec, data, );
  }
 @@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
  for (i = AQ_DIMOF(self->aq_vec); i--;) {
  if (self->aq_vec[i])
  aq_vec_free(self->aq_vec[i]);
 +self->aq_vec[i] = NULL;
  }

  err_exit:;

>>>
>>> if the driver does not support statistics when the interface is down, would
>>> not it be clearer
>>> to check if netif_running() in get_stats() instead?
>>
>> Yes, much cleaner.
>>
>> Much better would be to have a cached software copy so that statistics
>> can be reported regardless of whether the device is down or not.
>>
> 
> Thank you.
> I will think about how to do it better.

It appears that the adapter is still reporting the cumulative hardware stats
even while its down.  The user is just losing the per queue stats.

Although the loss of the per queue stats is not ideal, this patch still fixes a
crash.

It might be worthwhile to refactor this patch as a short term solution and then
subsequently produce a version that contains cached statistics.  Assuming that
is amenable to everyone of course.

-DA


> 
> Regards,
> Pavel



Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Leon Romanovsky
On Thu, May 04, 2017 at 06:10:54PM +, Bart Van Assche wrote:
> On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > Following our discussion both in mailing list [1] and at the LPC 2016 [2],
> > we would like to propose this RDMA tool to be part of iproute2 package
> > and finally improve this situation.
>
> Hello Leon,
>
> Although I really appreciate your work: can you clarify why you would like to
> add *RDMA* functionality to an *IP routing* tool? I haven't found any 
> motivation
> for adding RDMA functionality to iproute2 in [1].

We are planning to reuse the same infrastructure provided by iproute2,
like netlink parsing, access to distributions, same CLI and same standards.

Right now, RDMA is already tightened to netdev: iWARP, RoCE, IPoIB, HFI-VNIC.
Many drivers (mlx, qed, i40, cxgb) are sharing code between net and
RDMA.

I do expect that iproute2 will be installed on every machine with any
type of connection, including IB and OPA.

So I think that it is enough to be part of that suite and don't invent
our own for one specific tool.

Thanks

>
> Thanks,
>
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH 2/4] dt-bindings: add binding for RTL8211E Ethernet PHY

2017-05-04 Thread Florian Fainelli
On 05/04/2017 11:10 AM, icen...@aosc.io wrote:
> 在 2017-04-22 08:22,Florian Fainelli 写道:
>> On 04/21/2017 04:24 PM, Icenowy Zheng wrote:
>>> From: Icenowy Zheng 
>>>
>>> Some RTL8211E Ethernet PHY have an issue that needs a workaround
>>> indicated with device tree.
>>>
>>> Add the binding for a property that indicates this workaround.
>>>
>>> Signed-off-by: Icenowy Zheng 
>>> ---
>>>  .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22
>>> ++
>>>  1 file changed, 22 insertions(+)
>>>  create mode 100644
>>> Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
>>>
>>> diff --git
>>> a/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
>>> b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
>>> new file mode 100644
>>> index ..c1913301bfe8
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
>>> @@ -0,0 +1,22 @@
>>> +Realtek RTL8211E Ethernet PHY
>>> +
>>> +One batch of RTL8211E is slight broken, that needs some special (and
>>> +full of magic numbers) tweaking in order to make GbE to operate
>>> properly.
>>> +The only well-known board that used the broken batch is Pine64+.
>>> +Configure it through an Ethernet OF device node.
>>> +
>>> +Optional properties:
>>> +
>>> +- realtek,disable-rx-delay:
>>> +  If set, RX delay will be completely disabled (according to
>>> Realtek). This
>>> +  will affect the performance on non-broken boards.
>>> +  default: do not disable RX delay.
>>
>> Please don't introduce custom properties to do that, instead correct
>> specify the "phy-mode" such that it is e.g: "rgmii-txid" which indicates
>> that there should be no RX internal delay, but a TX internal delay added
>> by the PHY.
> 
> Checked the document, the meaning of "rgmii-txid" is not correct here.
> 
> This doesn't effect the MAC, and the MAC should still add TX delay.
> 
> The definition of "rgmii-txid" in
> Documentation/devicetree/binding/net/ethernet.txt is "RGMII with
> internal TX delay provided by the PHY, the MAC should not add an TX delay
> in this case". However, this do not indicate that the MAC doesn't add TX
> delay; in fact that just totally disabled the PHY to provide the RX delay.
> MAC still should to add delay on both TX/RX, which is the semantic of
> standard "rgmii".
> 
> So I cannot used "rgmii-txid" here, but should continue to use this
> custom property.

This is absolutely not a correct understanding. The 'phy-mode' property
defines the contract between the MAC and PHY. It is defined from the
PHY's perspective of the delay, which means that the MAC has to either
also provide an adequate delay (RX or TX) or not (RX or TX). So if you
specified 'phy-mode' = "rgmii" this means that the MAC needs to adds the
TX and RX delay, so implcitly this means that your MAC operates in
"rgmii-id", if the property was defined from the perspective of the MAC,
which it is not.

Both the Ethernet PHY driver and the MAC driver need to take care of
adjusting the delays based on the phydev->interface value.

The property you are introducing here is absolutely not appropriate
because it is entirely redundant with what 'phy-mode' already defines,
except the latter also covers a lot more cases.
-- 
Florian


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Joe Perches
On Thu, 2017-05-04 at 20:08 +0300, Pavel Belous wrote:
> I will prepare another patch with Lino and David M. comments.

I'm not submitting this because it'd just cause merge conflicts,
but
something you could do one day is remove the AQ_DIMOF macro
and just use ARRAY_SIZE directly.
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c  | 4 ++--
 drivers/net/ethernet/aquantia/atlantic/aq_utils.h| 2 --
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c| 2 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c| 2 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c | 2 +-
 5 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb02991f249..cffae53414ba 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -154,7 +154,7 @@ static void aq_nic_service_timer_cb(unsigned long param)
 
memset(_rx, 0U, sizeof(struct aq_ring_stats_rx_s));
memset(_tx, 0U, sizeof(struct aq_ring_stats_tx_s));
-   for (i = AQ_DIMOF(self->aq_vec); i--;) {
+   for (i = ARRAY_SIZE(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_add_stats(self->aq_vec[i], _rx, _tx);
}
@@ -958,7 +958,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
if (!self)
goto err_exit;
 
-   for (i = AQ_DIMOF(self->aq_vec); i--;) {
+   for (i = ARRAY_SIZE(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_free(self->aq_vec[i]);
}
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_utils.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
index f6012b34abe6..64a8c3c781ff 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
@@ -14,8 +14,6 @@
 
 #include "aq_common.h"
 
-#define AQ_DIMOF(_ARY_)  ARRAY_SIZE(_ARY_)
-
 struct aq_obj_s {
spinlock_t lock; /* spinlock for nic/rings processing */
atomic_t flags;
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
index 4ee15ff06a44..96c3360e7060 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
@@ -182,7 +182,7 @@ static int hw_atl_a0_hw_rss_set(struct aq_hw_s *self,
((i * 3U) & 0xFU));
}
 
-   for (i = AQ_DIMOF(bitary); i--;) {
+   for (i = ARRAY_SIZE(bitary); i--;) {
rpf_rss_redir_tbl_wr_data_set(self, bitary[i]);
rpf_rss_redir_tbl_addr_set(self, i);
rpf_rss_redir_wr_en_set(self, 1U);
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
index 42150708191d..5a19eba31786 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@@ -182,7 +182,7 @@ static int hw_atl_b0_hw_rss_set(struct aq_hw_s *self,
((i * 3U) & 0xFU));
}
 
-   for (i = AQ_DIMOF(bitary); i--;) {
+   for (i = ARRAY_SIZE(bitary); i--;) {
rpf_rss_redir_tbl_wr_data_set(self, bitary[i]);
rpf_rss_redir_tbl_addr_set(self, i);
rpf_rss_redir_wr_en_set(self, 1U);
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
index 8d6d8f5804da..922af5f36d37 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
@@ -385,7 +385,7 @@ int hw_atl_utils_get_mac_permanent(struct aq_hw_s *self,
aq_hw_read_reg(self, 0x0374U) +
(40U * 4U),
mac_addr,
-   AQ_DIMOF(mac_addr));
+   ARRAY_SIZE(mac_addr));
if (err < 0) {
mac_addr[0] = 0U;
mac_addr[1] = 0U;



Re: [PATCH 2/4] dt-bindings: add binding for RTL8211E Ethernet PHY

2017-05-04 Thread icenowy

在 2017-04-22 08:22,Florian Fainelli 写道:

On 04/21/2017 04:24 PM, Icenowy Zheng wrote:

From: Icenowy Zheng 

Some RTL8211E Ethernet PHY have an issue that needs a workaround
indicated with device tree.

Add the binding for a property that indicates this workaround.

Signed-off-by: Icenowy Zheng 
---
 .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22 
++

 1 file changed, 22 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/realtek,rtl8211e.txt


diff --git 
a/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt 
b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt

new file mode 100644
index ..c1913301bfe8
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
@@ -0,0 +1,22 @@
+Realtek RTL8211E Ethernet PHY
+
+One batch of RTL8211E is slight broken, that needs some special (and
+full of magic numbers) tweaking in order to make GbE to operate 
properly.

+The only well-known board that used the broken batch is Pine64+.
+Configure it through an Ethernet OF device node.
+
+Optional properties:
+
+- realtek,disable-rx-delay:
+  If set, RX delay will be completely disabled (according to 
Realtek). This

+  will affect the performance on non-broken boards.
+  default: do not disable RX delay.


Please don't introduce custom properties to do that, instead correct
specify the "phy-mode" such that it is e.g: "rgmii-txid" which 
indicates
that there should be no RX internal delay, but a TX internal delay 
added

by the PHY.


Checked the document, the meaning of "rgmii-txid" is not correct here.

This doesn't effect the MAC, and the MAC should still add TX delay.

The definition of "rgmii-txid" in
Documentation/devicetree/binding/net/ethernet.txt is "RGMII with
internal TX delay provided by the PHY, the MAC should not add an TX 
delay

in this case". However, this do not indicate that the MAC doesn't add TX
delay; in fact that just totally disabled the PHY to provide the RX 
delay.

MAC still should to add delay on both TX/RX, which is the semantic of
standard "rgmii".

So I cannot used "rgmii-txid" here, but should continue to use this
custom property.


Re: [RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Bart Van Assche
On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> Following our discussion both in mailing list [1] and at the LPC 2016 [2],
> we would like to propose this RDMA tool to be part of iproute2 package
> and finally improve this situation.

Hello Leon,

Although I really appreciate your work: can you clarify why you would like to
add *RDMA* functionality to an *IP routing* tool? I haven't found any motivation
for adding RDMA functionality to iproute2 in [1].

Thanks,

Bart.

[RFC iproute2 8/8] rdma: Add link capability parsing

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Add parsing interface for the cap_mask

$./rdma/rdma link show mlx5_2/2 cap_mask
3/2: mlx5_2/2: sm off notice off trap on opt_ipd off auto_migr off sl_map on 
mkey_nvram off
pkey_nvram off led_info off sm_disabled off sys_image_guid on 
pkey_sw_ext_port_trap off
extended_speeds on cm on snmp_tunnel off reinit off device_mgmt off 
vendor_class on dr_notice off
cap_mask_notice on boot_mgmt off link_latency off client_reg on 
ip_based_gids on

Signed-off-by: Leon Romanovsky 
---
 rdma/link.c  | 74 +---
 rdma/rdma.h  |  4 
 rdma/utils.c |  4 ++--
 3 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/rdma/link.c b/rdma/link.c
index e86ff399..e9880914 100644
--- a/rdma/link.c
+++ b/rdma/link.c
@@ -14,13 +14,48 @@
 static int link_help(struct rdma *rd)
 {
pr_out("Usage: %s link show [ DEV | DEV/PORT ]\n", rd->filename);
+   pr_out("   %s link show [ DEV | DEV/PORT ] cap_mask\n", 
rd->filename);
pr_out("   %s link set DEV/PORT { type { eth | ib | auto } |\n", 
rd->filename);
pr_out("lb_unicast { on | off } |\n");
pr_out("lb_multicast { on | off } }\n");
return 0;
 }

-static void dev_one_show(const struct dev_map *dev_map, uint32_t 
port_idx_first, uint32_t port_idx_last)
+static void print_cap_mask(uint32_t cap_mask)
+{
+#define PRINT_PORT_CAP(name, val, offset)  (printf(" %s %s", name, 
(((val) >> (offset))&0x1)?"on":"off"))
+
+   /* Naive copy/paste from include/rdma/ib_verbs.h */
+   PRINT_PORT_CAP("sm", cap_mask, 1);
+   PRINT_PORT_CAP("notice", cap_mask, 2);
+   PRINT_PORT_CAP("trap", cap_mask, 3);
+   PRINT_PORT_CAP("opt_ipd", cap_mask, 4);
+   PRINT_PORT_CAP("auto_migr", cap_mask, 5);
+   PRINT_PORT_CAP("sl_map", cap_mask, 6);
+   PRINT_PORT_CAP("mkey_nvram", cap_mask, 7);
+   printf("\n\t");
+   PRINT_PORT_CAP("pkey_nvram", cap_mask, 8);
+   PRINT_PORT_CAP("led_info", cap_mask, 9);
+   PRINT_PORT_CAP("sm_disabled", cap_mask, 10);
+   PRINT_PORT_CAP("sys_image_guid", cap_mask, 11);
+   PRINT_PORT_CAP("pkey_sw_ext_port_trap", cap_mask, 12);
+   printf("\n\t");
+   PRINT_PORT_CAP("extended_speeds", cap_mask, 14);
+   PRINT_PORT_CAP("cm", cap_mask, 16);
+   PRINT_PORT_CAP("snmp_tunnel", cap_mask, 17);
+   PRINT_PORT_CAP("reinit", cap_mask, 18);
+   PRINT_PORT_CAP("device_mgmt", cap_mask, 19);
+   PRINT_PORT_CAP("vendor_class", cap_mask, 20);
+   PRINT_PORT_CAP("dr_notice", cap_mask, 21);
+   printf("\n\t");
+   PRINT_PORT_CAP("cap_mask_notice", cap_mask, 22);
+   PRINT_PORT_CAP("boot_mgmt", cap_mask, 23);
+   PRINT_PORT_CAP("link_latency", cap_mask, 24);
+   PRINT_PORT_CAP("client_reg", cap_mask, 25);
+   PRINT_PORT_CAP("ip_based_gids", cap_mask, 26);
+}
+static void dev_one_show(struct rdma *rd, const struct dev_map *dev_map,
+uint32_t port_idx_first, uint32_t port_idx_last)
 {
char *nodes[] = { "cap_mask",
  "lid",
@@ -35,23 +70,36 @@ static void dev_one_show(const struct dev_map *dev_map, 
uint32_t port_idx_first,

struct port_map *port_map;
char data[4096];
+   uint32_t cap_mask;
+   bool cap_mask_r = false;
int i, j;

+   rd_arg_inc(rd);
+   if (rd_argv_match(rd, "cap_mask"))
+   cap_mask_r = true;
+
for(j = port_idx_first ; j <= port_idx_last; j++) {
pr_out("%u/%u: %s/%u:", dev_map->idx, j, dev_map->dev_name, j);
-   list_for_each_entry(port_map, _map->port_map_list, list)
-   if (j == port_map->idx)
-  printf(" ifname %s", (port_map->ifname)?:"NONE");
+   if (cap_mask_r) {
+   rdma_sysfs_read_ib(dev_map->dev_name, 1, nodes[0], 
data);
+   cap_mask = strtoul(data, NULL, 16);
+   print_cap_mask(cap_mask);
+   }
+   else {
+   list_for_each_entry(port_map, _map->port_map_list, 
list)
+   if (j == port_map->idx)
+  printf(" ifname %s", 
(port_map->ifname)?:"NONE");

-   for (i = 0 ; nodes[i] ; i++) {
-   if (rdma_sysfs_read_ib(dev_map->dev_name, j, nodes[i], 
data))
-   continue;
+   for (i = 0 ; nodes[i] ; i++) {
+   if (rdma_sysfs_read_ib(dev_map->dev_name, j, 
nodes[i], data))
+   continue;

-   /* Split line before "phys_state" */
-   if (!strcmp(nodes[i], "phys_state"))
-   printf("\n\t");
+   /* Split 

[RFC iproute2 6/8] rdma: add stubs for future objects

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

The following objects (monitor, providers, stats and protocols) are not
implemented yet, however it is worth to place their stubs in the code.

This will serve as an initial starting point for other developers to
extend RDMA tool.

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile|  2 +-
 rdma/monitor.c   | 22 ++
 rdma/protocols.c | 22 ++
 rdma/providers.c | 28 
 rdma/rdma.c  |  6 +-
 rdma/rdma.h  |  4 
 rdma/stats.c | 22 ++
 7 files changed, 104 insertions(+), 2 deletions(-)
 create mode 100644 rdma/monitor.c
 create mode 100644 rdma/protocols.c
 create mode 100644 rdma/providers.c
 create mode 100644 rdma/stats.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 5cf0d29f..eb71da68 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -1,6 +1,6 @@
 include ../Config

-RDMA_OBJ = rdma.o utils.o dev.o link.o ipoib.o memory.o
+RDMA_OBJ = rdma.o utils.o dev.o link.o ipoib.o memory.o stats.o protocols.o 
providers.o monitor.o
 TARGETS=rdma

 all:   $(TARGETS) $(LIBS)
diff --git a/rdma/monitor.c b/rdma/monitor.c
new file mode 100644
index ..99d4b042
--- /dev/null
+++ b/rdma/monitor.c
@@ -0,0 +1,22 @@
+/*
+ * monitor.c   RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+int obj_monitor(struct rdma *rd)
+{
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   return 0;
+}
diff --git a/rdma/protocols.c b/rdma/protocols.c
new file mode 100644
index ..26de7d2b
--- /dev/null
+++ b/rdma/protocols.c
@@ -0,0 +1,22 @@
+/*
+ * protocols.c RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+int obj_protocols(struct rdma *rd)
+{
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   return 0;
+}
diff --git a/rdma/providers.c b/rdma/providers.c
new file mode 100644
index ..8d516cca
--- /dev/null
+++ b/rdma/providers.c
@@ -0,0 +1,28 @@
+/*
+ * providers.c RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static void providers_help(char *filename)
+{
+   pr_out("Usage: %s providers show [ DEV ]\n", filename);
+}
+
+int obj_providers(struct rdma *rd)
+{
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   providers_help(rd->filename);
+   return 0;
+}
diff --git a/rdma/rdma.c b/rdma/rdma.c
index 094d490d..a0a3ec81 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -17,7 +17,7 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
-  "where  OBJECT := { dev | link | ipoib | memory }\n"
+  "where  OBJECT := { dev | link | ipoib | memory | stats | 
protocols | providers | monitor }\n"
   "   OPTIONS := { -V[ersion] }\n", name);
 }

@@ -35,6 +35,10 @@ static int rd_cmd(struct rdma *rd)
{ "link",   obj_link },
{ "ipoib",  obj_ipoib },
{ "memory", obj_memory },
+   { "stats",  obj_stats },
+   { "providers",  obj_providers },
+   { "protocols",  obj_protocols },
+   { "monitor",obj_monitor },
{ "help",   obj_help },
{ 0 }
};
diff --git a/rdma/rdma.h b/rdma/rdma.h
index dcff066f..11d940d7 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -66,6 +66,10 @@ int obj_dev(struct rdma *rd);
 int obj_link(struct rdma *rd);
 int obj_ipoib(struct rdma *rd);
 int obj_memory(struct rdma *rd);
+int obj_protocols(struct rdma *rd);
+int obj_stats(struct rdma *rd);
+int obj_providers(struct rdma *rd);
+int obj_monitor(struct rdma *rd);

 /*
  * Parser interface
diff --git a/rdma/stats.c b/rdma/stats.c
new 

[RFC iproute2 V1 7/8] man: rdma.8: Document objects and commands

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Signed-off-by: Leon Romanovsky 
---
 man/man8/Makefile |   3 +-
 man/man8/rdma.8   | 109 ++
 2 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/rdma.8

diff --git a/man/man8/Makefile b/man/man8/Makefile
index f3318644..81979a07 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -19,7 +19,8 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8 tc-skbmod.8 tc-ife.8 \
tc-tunnel_key.8 tc-sample.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8 \
-   ifstat.8
+   ifstat.8 \
+   rdma.8

 all: $(TARGETS)

diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
new file mode 100644
index ..410b3d7d
--- /dev/null
+++ b/man/man8/rdma.8
@@ -0,0 +1,109 @@
+.TH RDMA 8 "28 Mar 2017" "iproute2" "Linux"
+.SH NAME
+rdma \- RDMA tool
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B rdma
+.RI "[ " OPTIONS " ] " OBJECT " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OBJECT " := { "
+.BR dev " | " link " | " protocol " | " stats " | " monitor " | " memory " | " 
ipoib " | " provider "}"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] }
+
+.SH OPTIONS
+
+.TP
+.BR "\-V" , " -Version"
+Print the version of the
+.B rdma
+tool and exit.
+
+.SS
+.I OBJECT
+
+.TP
+.B dev
+- RDMA device.
+
+.TP
+.B link
+- RDMA port related.
+
+.TP
+.B protocol
+- RDMA protocol.
+
+.TP
+.B monitor
+- watch for netlink messages.
+
+.TP
+.B memory
+- configure memory related operations.
+
+.TP
+.B ipoib
+- configure IPoIB.
+
+.TP
+.B provider
+- provider specific configurations.
+
+.PP
+The names of all objects may be written in full or
+abbreviated form, for example
+.B stats
+can be abbreviated as
+.B stat
+or just
+.B s.
+
+.SS
+.I COMMAND
+
+Specifies the action to perform on the object.
+The set of possible actions depends on the object type.
+As a rule, it is possible to
+.B show
+(or
+.B list
+) objects, but some objects do not allow all of these operations
+or have some additional commands. The
+.B help
+command is available for all objects. It prints
+out a list of available commands and argument syntax conventions.
+.sp
+If no command is given, some default command is assumed.
+Usually it is
+.B list
+or, if the objects of this class cannot be listed,
+.BR "help" .
+
+.SH EXIT STATUS
+Exit status is 0 if command was successful or a positive integer upon failure.
+
+.SH SEE ALSO
+.BR rdma-link (8),
+.BR rdma-protocol (8),
+.BR rdma-monitor (8),
+.BR rdma-stats (8),
+.br
+
+.SH REPORTING BUGS
+Report any bugs to the Linux RDMA mailing list
+.B 
+where the development and maintenance is primarily done.
+You do not have to be subscribed to the list to send a message there.
+
+.SH AUTHOR
+Leon Romanovsky 
--
2.12.2



[RFC iproute2 5/8] rdma: Add memory object

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Memory object gives to the user ability to manipulate over general
properties of memory for the specific devices. The memory properties
have broader usage than dev object can provide.

For example, on-demand-paging (ODP) configurations are mostly software related.

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |  2 +-
 rdma/memory.c | 30 ++
 rdma/rdma.c   |  3 ++-
 rdma/rdma.h   |  1 +
 4 files changed, 34 insertions(+), 2 deletions(-)
 create mode 100644 rdma/memory.c

diff --git a/rdma/Makefile b/rdma/Makefile
index dd702b9f..5cf0d29f 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -1,6 +1,6 @@
 include ../Config

-RDMA_OBJ = rdma.o utils.o dev.o link.o ipoib.o
+RDMA_OBJ = rdma.o utils.o dev.o link.o ipoib.o memory.o
 TARGETS=rdma

 all:   $(TARGETS) $(LIBS)
diff --git a/rdma/memory.c b/rdma/memory.c
new file mode 100644
index ..68fd5dd3
--- /dev/null
+++ b/rdma/memory.c
@@ -0,0 +1,30 @@
+/*
+ * memory.cRDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static void memory_help(char *filename)
+{
+   pr_out("Usage: %s memory show [ DEV ]\n", filename);
+   pr_out("   %s memory set DEV { odp { off | on } |\n", filename);
+   pr_out("   %s  memic SIZE }\n", filename);
+}
+
+int obj_memory(struct rdma *rd)
+{
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   memory_help(rd->filename);
+   return 0;
+}
diff --git a/rdma/rdma.c b/rdma/rdma.c
index ffd70899..094d490d 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -17,7 +17,7 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
-  "where  OBJECT := { dev | link | ipoib }\n"
+  "where  OBJECT := { dev | link | ipoib | memory }\n"
   "   OPTIONS := { -V[ersion] }\n", name);
 }

@@ -34,6 +34,7 @@ static int rd_cmd(struct rdma *rd)
{ "dev",obj_dev },
{ "link",   obj_link },
{ "ipoib",  obj_ipoib },
+   { "memory", obj_memory },
{ "help",   obj_help },
{ 0 }
};
diff --git a/rdma/rdma.h b/rdma/rdma.h
index 1fef4eb8..dcff066f 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -65,6 +65,7 @@ struct rdma_obj {
 int obj_dev(struct rdma *rd);
 int obj_link(struct rdma *rd);
 int obj_ipoib(struct rdma *rd);
+int obj_memory(struct rdma *rd);

 /*
  * Parser interface
--
2.12.2



[RFC iproute2 2/8] rdma: Add dev object

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Device (dev) object represents struct ib_device to user space.

The supported commands are show, set and help.

Print all devices:
 # rdma dev
1: mlx5_0: board_id MT_2190110032 fw_pages 261002 fw_ver 12.17.2046 hca_type 
MT4115 hw_rev 0
node_desc hpchead HCA-1 node_guid e41d:2d03:0066:dee6 node_type 1: CA 
reg_pages 0
sys_image_guid e41d:2d03:0066:dee6
2: mlx5_1: board_id MT_2190110032 fw_pages 250793 fw_ver 12.17.2046 hca_type 
MT4115 hw_rev 0
node_desc hpchead HCA-2 node_guid e41d:2d03:0066:dee7 node_type 1: CA 
reg_pages 0
sys_image_guid e41d:2d03:0066:dee6
3: mlx5_2: board_id MT_1210110019 fw_pages 68067 fw_ver 10.16.1020 hca_type 
MT4113 hw_rev 0
node_desc hpchead HCA-3 node_guid 0002:c903:0016:75b0 node_type 1: CA 
reg_pages 0
sys_image_guid 0002:c903:0016:75b0

Print specific device:
 # rdma dev show mlx5_1
2: mlx5_1: board_id MT_2190110032 fw_pages 250793 fw_ver 12.17.2046 hca_type 
MT4115 hw_rev 0
node_desc hpchead HCA-2 node_guid e41d:2d03:0066:dee7 node_type 1: CA 
reg_pages 0
sys_image_guid e41d:2d03:0066:dee6

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |   2 +-
 rdma/dev.c| 101 ++
 rdma/rdma.c   |   3 +-
 rdma/rdma.h   |   5 +++
 4 files changed, 109 insertions(+), 2 deletions(-)
 create mode 100644 rdma/dev.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 65248b31..67e349b0 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -1,6 +1,6 @@
 include ../Config

-RDMA_OBJ = rdma.o utils.o
+RDMA_OBJ = rdma.o utils.o dev.o
 TARGETS=rdma

 all:   $(TARGETS) $(LIBS)
diff --git a/rdma/dev.c b/rdma/dev.c
new file mode 100644
index ..e6d71035
--- /dev/null
+++ b/rdma/dev.c
@@ -0,0 +1,101 @@
+/*
+ * dev.c   RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static int dev_help(struct rdma *rd)
+{
+   pr_out("Usage: %s dev show [DEV]\n", rd->filename);
+   pr_out("   %s dev set DEV [ node_desc { DESCRIPTION } ]\n", 
rd->filename);
+   /* Add masking of device capabilities */
+   return 0;
+}
+
+static void dev_one_show(const struct dev_map *dev_map)
+{
+   char *nodes[] = { "board_id",
+ "fw_pages",
+ "fw_ver",
+ "hca_type",
+ "hw_rev",
+ "node_desc",
+ "node_guid",
+ "node_type",
+ "reg_pages",
+ "sys_image_guid",
+ /* hfi1 specific */
+ "nctxts",
+ "nfreectxts",
+ "serial",
+ "boardversion",
+ "tempsense",
+ NULL };
+
+   char data[4096];
+   int i;
+   pr_out("%u: %s:", dev_map->idx, dev_map->dev_name);
+   for (i = 0 ; nodes[i] ; i++) {
+   if (rdma_sysfs_read_ib(dev_map->dev_name, 0, nodes[i], data))
+   continue;
+
+   /* Split line before "node_desc" */
+   if (!strcmp(nodes[i], "node_desc") ||
+   !strcmp(nodes[i], "sys_image_guid"))
+   printf("\n\t");
+
+   pr_out(" %s %s", nodes[i], data);
+   }
+   pr_out("\n");
+}
+
+static int dev_show(struct rdma *rd)
+{
+   struct dev_map *dev_map;
+
+   if (rd_no_arg(rd)) {
+   list_for_each_entry(dev_map, >dev_map_list, list)
+   dev_one_show(dev_map);
+   }
+   else {
+   dev_map = dev_map_lookup(rd, false);
+   if (!dev_map) {
+   pr_err("Wrong device name\n");
+   return -ENOENT;
+   }
+   dev_one_show(dev_map);
+   }
+   return 0;
+}
+
+static int dev_set(struct rdma *rd)
+{
+   /* Not implemented yet */
+   return 0;
+}
+
+int obj_dev(struct rdma *rd)
+{
+   const struct rdma_obj objs[] = {
+   { NULL, dev_show },
+   { "show",   dev_show },
+   { "list",   dev_show },
+   { "set",dev_set },
+   { "help",   dev_help },
+   { 0 }
+   };
+
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   return rdma_exec_cmd(rd, objs, "dev command");
+}
diff --git a/rdma/rdma.c b/rdma/rdma.c
index bc7d1483..7c537c5e 100644
--- a/rdma/rdma.c

[RFC iproute2 3/8] rdma: Add link object

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

Link object represents port of struct ib_device.

Supported commands are show, set and help.

Print all links for all devices:
 # rdma link
1/1: mlx5_0/1: ifname ib0 cap_mask 0x2651e848 lid 0x13 lid_mask_count 0 
link_layer InfiniBand
phys_state 5: LinkUp rate 100 Gb/sec (4X EDR) sm_lid 0x2 sm_sl 0 state 
4: ACTIVE
2/1: mlx5_1/1: ifname ib1 cap_mask 0x2651e848 lid 0x lid_mask_count 0 
link_layer InfiniBand
phys_state 3: Disabled rate 10 Gb/sec (4X) sm_lid 0x0 sm_sl 0 state 1: 
DOWN
3/1: mlx5_2/1: ifname ib2 cap_mask 0x26516848 lid 0x1a lid_mask_count 0 
link_layer InfiniBand
phys_state 5: LinkUp rate 56 Gb/sec (4X FDR) sm_lid 0x2 sm_sl 0 state 
4: ACTIVE
3/2: mlx5_2/2: ifname ib3 cap_mask 0x26516848 lid 0x lid_mask_count 0 
link_layer InfiniBand
phys_state 3: Disabled rate 10 Gb/sec (4X) sm_lid 0x0 sm_sl 0 state 1: 
DOWN

Print all links for specific device:
 # rdma link show mlx5_2
3/1: mlx5_2/1: ifname ib2 cap_mask 0x26516848 lid 0x1a lid_mask_count 0 
link_layer InfiniBand
phys_state 5: LinkUp rate 56 Gb/sec (4X FDR) sm_lid 0x2 sm_sl 0 state 
4: ACTIVE
3/2: mlx5_2/2: ifname ib3 cap_mask 0x26516848 lid 0x lid_mask_count 0 
link_layer InfiniBand
phys_state 3: Disabled rate 10 Gb/sec (4X) sm_lid 0x0 sm_sl 0 state 1: 
DOWN

Print specific link:
 # rdma link show mlx5_2/2
3/2: mlx5_2/2: ifname ib3 cap_mask 0x26516848 lid 0x lid_mask_count 0 
link_layer InfiniBand
phys_state 3: Disabled rate 10 Gb/sec (4X) sm_lid 0x0 sm_sl 0 state 1: 
DOWN

Set parameter;
 # rdma link set mlx5_2/2 type auto lb_unicast off

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |   2 +-
 rdma/link.c   | 112 ++
 rdma/rdma.c   |   3 +-
 rdma/rdma.h   |   1 +
 4 files changed, 116 insertions(+), 2 deletions(-)
 create mode 100644 rdma/link.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 67e349b0..cf54ed36 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -1,6 +1,6 @@
 include ../Config

-RDMA_OBJ = rdma.o utils.o dev.o
+RDMA_OBJ = rdma.o utils.o dev.o link.o
 TARGETS=rdma

 all:   $(TARGETS) $(LIBS)
diff --git a/rdma/link.c b/rdma/link.c
new file mode 100644
index ..e86ff399
--- /dev/null
+++ b/rdma/link.c
@@ -0,0 +1,112 @@
+/*
+ * link.c  RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static int link_help(struct rdma *rd)
+{
+   pr_out("Usage: %s link show [ DEV | DEV/PORT ]\n", rd->filename);
+   pr_out("   %s link set DEV/PORT { type { eth | ib | auto } |\n", 
rd->filename);
+   pr_out("lb_unicast { on | off } |\n");
+   pr_out("lb_multicast { on | off } }\n");
+   return 0;
+}
+
+static void dev_one_show(const struct dev_map *dev_map, uint32_t 
port_idx_first, uint32_t port_idx_last)
+{
+   char *nodes[] = { "cap_mask",
+ "lid",
+ "lid_mask_count",
+ "link_layer",
+ "phys_state",
+ "rate",
+ "sm_lid",
+ "sm_sl",
+ "state",
+ NULL };
+
+   struct port_map *port_map;
+   char data[4096];
+   int i, j;
+
+   for(j = port_idx_first ; j <= port_idx_last; j++) {
+   pr_out("%u/%u: %s/%u:", dev_map->idx, j, dev_map->dev_name, j);
+   list_for_each_entry(port_map, _map->port_map_list, list)
+   if (j == port_map->idx)
+  printf(" ifname %s", (port_map->ifname)?:"NONE");
+
+   for (i = 0 ; nodes[i] ; i++) {
+   if (rdma_sysfs_read_ib(dev_map->dev_name, j, nodes[i], 
data))
+   continue;
+
+   /* Split line before "phys_state" */
+   if (!strcmp(nodes[i], "phys_state"))
+   printf("\n\t");
+
+   pr_out(" %s %s", nodes[i], data);
+   }
+   pr_out("\n");
+   }
+}
+
+static int link_show(struct rdma *rd)
+{
+   struct dev_map *dev_map;
+
+   if (rd_no_arg(rd)) {
+   list_for_each_entry(dev_map, >dev_map_list, list)
+   dev_one_show(dev_map, 1, dev_map->num_ports);
+   }
+   else {
+   uint32_t port_idx;
+   uint32_t num_ports;
+   dev_map = dev_map_lookup(rd, true);
+   port_idx = get_port_from_argv(rd);
+

[RFC iproute2 4/8] rdma: Add IPoIB object

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

IPoIB object allows configuration and presentation of information for
IP-over-Infiniband user level protocol.

Supported commands are show, set and help.

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |  2 +-
 rdma/ipoib.c  | 54 ++
 rdma/rdma.c   |  3 ++-
 rdma/rdma.h   |  1 +
 4 files changed, 58 insertions(+), 2 deletions(-)
 create mode 100644 rdma/ipoib.c

diff --git a/rdma/Makefile b/rdma/Makefile
index cf54ed36..dd702b9f 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -1,6 +1,6 @@
 include ../Config

-RDMA_OBJ = rdma.o utils.o dev.o link.o
+RDMA_OBJ = rdma.o utils.o dev.o link.o ipoib.o
 TARGETS=rdma

 all:   $(TARGETS) $(LIBS)
diff --git a/rdma/ipoib.c b/rdma/ipoib.c
new file mode 100644
index ..dd0d0285
--- /dev/null
+++ b/rdma/ipoib.c
@@ -0,0 +1,54 @@
+/*
+ * ipoib.c RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static int ipoib_help(struct rdma *rd)
+{
+   pr_out("Usage: %s ipoib show NAME | DEV | DEV/PORT\n", rd->filename);
+   pr_out("   %s ipoib set NAME [ accel { off | on } ]\n", 
rd->filename);
+   pr_out("   %s ipoib start NAME dev DEV\n", rd->filename);
+   pr_out("   %s ipoib stop NAME\n", rd->filename);
+   return 0;
+}
+
+static int ipoib_show(struct rdma *rd)
+{
+   if (rd_no_arg(rd))
+   ipoib_help(rd);
+
+   return 0;
+}
+
+static int ipoib_set(struct rdma *rd)
+{
+   /* Not supported yet */
+   return 0;
+}
+
+int obj_ipoib(struct rdma *rd)
+{
+   const struct rdma_obj objs[] = {
+   { NULL, ipoib_show },
+   { "show",   ipoib_show },
+   { "list",   ipoib_show },
+   { "set",ipoib_set },
+   { "help",   ipoib_help },
+   { 0 }
+   };
+
+   if (dev_map_init(rd)) {
+   pr_err("There are no RDMA devices\n");
+   return -ENOENT;
+   }
+
+   return rdma_exec_cmd(rd, objs, "Uknown ipoib command");
+}
diff --git a/rdma/rdma.c b/rdma/rdma.c
index 55cbf0e3..ffd70899 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -17,7 +17,7 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
-  "where  OBJECT := { dev | link }\n"
+  "where  OBJECT := { dev | link | ipoib }\n"
   "   OPTIONS := { -V[ersion] }\n", name);
 }

@@ -33,6 +33,7 @@ static int rd_cmd(struct rdma *rd)
{ NULL, obj_help },
{ "dev",obj_dev },
{ "link",   obj_link },
+   { "ipoib",  obj_ipoib },
{ "help",   obj_help },
{ 0 }
};
diff --git a/rdma/rdma.h b/rdma/rdma.h
index bdb77b5e..1fef4eb8 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -64,6 +64,7 @@ struct rdma_obj {
  */
 int obj_dev(struct rdma *rd);
 int obj_link(struct rdma *rd);
+int obj_ipoib(struct rdma *rd);

 /*
  * Parser interface
--
2.12.2



[RFC iproute2 0/8] RDMA tool

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 


This is initial phase to understand if user experience for this tool fits
RDMA and netdev communities exepectations. Also I would like to get feedback
if it is really worth to provide legacy sysfs for old kernels, or maybe I should
implement netlink from the beginning and abandon sysfs completely.
-

Hi,

Please find below, the patch set with initial implementation of configuration
tool for RDMA subsystem, which will be supplementary tool to already existed
tools in netdev community (ip, devlink, ethtool, ..).

In opposite to netdev community, where standard tools exist to configure and
present different devices abilities, RDMA subsystem historically lacked it.

Following our discussion both in mailing list [1] and at the LPC 2016 [2],
we would like to propose this RDMA tool to be part of iproute2 package
and finally improve this situation.

The development of tool was influenced by ip and devlink tools. This implies
to the object->command interface and naming convention.

In order to close object model, ensure reuse of existing code and make this
tool usable from day one, we decided to implement wrappers over legacy sysfs
prior to implementing netlink functionality. As a nice bonus, it will allow
to use this tool with old kernels too.

It is important to mention that any future extension will be required to be
done with netlink, so for already existing objects small conversion to netlink
will be unavoidable.

  # rdma -h
  Usage: rdma [ OPTIONS ] OBJECT { COMMAND | help }
  where  OBJECT := { dev | link | ipoib | memory | stats | protocols | 
providers | monitor }
  OPTIONS := { -V[ersion] }

* DEV object equals to CA in IBTA specification and will provide
  a way to configure/present settings relevant to specific struct ib_device.

* LINK object represents port in IBTA specification and will give access to
  struct ib_port_immutable. From the day one, It prints netdev name of
  the corresponding IB port that makes ibdev2netdev script redundant.

* IPoIB object is supposed to be specific for IP-over-Infiniband upper
  layer protocol [3]. This ULP was mainly configured by combination
  of various sysfs knobs together with ethtool. Such situation adds
  challenges to add new and expose old configuration settings due to
  the mix between different subsystems.

* MEMORY object will be used to configure memory related settings,
  e.g. on-demand-paging (ODP), force-mr (force usage of MRs for
  RDMA READ/WRITE operation).

* STATS object is needed for everything related to statistics
  (per-PID, per-QP, per-device etc.). Despite the fact that RDMA
  devices provide extensive set of counters, the decision was to
  implement it in netlink directly, because there is a need to add
  filter mechanism to them, which doesn't exist now.

* PROTOCOLS object is going to be used for device special treatment
  of global to protocol settings (e.g. set device in RoCEv2 mode as
  a default, instead of RoCEv1, instead of configfs).

* PROVIDERS objects gives ability to get specific to the device
  information, like supported kABI objects [4].

* MONITOR object is needed to debug netlink communication and will
  follow standard functionality, which exists in ip and devlink tools.

There are number of ULPs which are not covered by this tool yet:
 * HFI-VNIC - I have no access to the HW and believe that Intel
   will add native object support for it.
 * Other storage related ULPs (iSER and SRP) were not introduced too,
   because they have special tools (scci-target-utils) to configure them.
   However it will be pretty straightforward to introduce new object,
   if there is demand for it.

At the initial stage, we implemented infrastructure to read legacy
sysfs entries (Patch #1), initial man pages (Patch #7) and provided
future object examples (Patch #2-6) to allow parallel development.

Following patches will focus on cleaning user interface, parsing other
relevant entries in similar fashion to the link capability mask (Patch #8)
and providing netlink interface.

These patches were tested with two following setups:
 * Setup A:
   - Two Mellanox ConnectX-4 devices (one port)
   - One Mellanox Connect-IB device (two ports)
 * Setup B:
   - One Mellanox ConnectX-4 device (one port)
   - One Mellanox ConnectX-3 Pro device (two ports)

Please consider the inclusion of the RDMA tool into iproute2 package,
so other participants will be able to speed up development.

[1] https://www.mail-archive.com/netdev@vger.kernel.org/msg148523.html
[2] http://www.medkio.com/talks/lpc_debug.pdf
[3] https://tools.ietf.org/html/rfc4392
[4] http://marc.info/?l=linux-rdma=149261526916544=2

TODO: Add json output

Cc: Stephen Hemminger 
Cc: Doug Ledford 
Cc: Jiri Pirko 
Cc: Ariel Almog 
Cc: Dennis Dalessandro 
Cc: Ram Amrani 
Cc: Bart Van Assche 

[RFC iproute2 1/8] rdma: Add basic infrastructure for RDMA tool

2017-05-04 Thread Leon Romanovsky
From: Leon Romanovsky 

RDMA devices are cross-functional devices from one side,
but very tailored for the specific markets from another.

Such diversity caused to spread of RDMA related configuration
across various tools, e.g. devlink, ip, ethtool, ib specific and
vendor specific solutions.

This patch adds ability to fill device and port information
by reading sysfs entries (legacy).

All future configuration settings will be implemented in netlink format,
to be aligned with iproute2 package.

Signed-off-by: Leon Romanovsky 
---
 Makefile|   2 +-
 rdma/.gitignore |   1 +
 rdma/Makefile   |  15 +++
 rdma/rdma.c |  96 +
 rdma/rdma.h |  77 ++
 rdma/utils.c| 313 
 6 files changed, 503 insertions(+), 1 deletion(-)
 create mode 100644 rdma/.gitignore
 create mode 100644 rdma/Makefile
 create mode 100644 rdma/rdma.c
 create mode 100644 rdma/rdma.h
 create mode 100644 rdma/utils.c

diff --git a/Makefile b/Makefile
index 18de7dcb..c255063b 100644
--- a/Makefile
+++ b/Makefile
@@ -52,7 +52,7 @@ WFLAGS += -Wmissing-declarations -Wold-style-definition 
-Wformat=2
 CFLAGS := $(WFLAGS) $(CCOPTS) -I../include $(DEFINES) $(CFLAGS)
 YACCFLAGS = -d -t -v

-SUBDIRS=lib ip tc bridge misc netem genl tipc devlink man
+SUBDIRS=lib ip tc bridge misc netem genl tipc devlink rdma man

 LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
 LDLIBS += $(LIBNETLINK)
diff --git a/rdma/.gitignore b/rdma/.gitignore
new file mode 100644
index ..51fb172b
--- /dev/null
+++ b/rdma/.gitignore
@@ -0,0 +1 @@
+rdma
diff --git a/rdma/Makefile b/rdma/Makefile
new file mode 100644
index ..65248b31
--- /dev/null
+++ b/rdma/Makefile
@@ -0,0 +1,15 @@
+include ../Config
+
+RDMA_OBJ = rdma.o utils.o
+TARGETS=rdma
+
+all:   $(TARGETS) $(LIBS)
+
+rdma:  $(RDMA_OBJ)
+   $(QUIET_LINK)$(CC) $^ -o $@
+
+install: all
+   install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
+
+clean:
+   rm -f $(RDMA_OBJ) $(TARGETS)
diff --git a/rdma/rdma.c b/rdma/rdma.c
new file mode 100644
index ..bc7d1483
--- /dev/null
+++ b/rdma/rdma.c
@@ -0,0 +1,96 @@
+/*
+ * rdma.c  RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include 
+
+#include "rdma.h"
+#include "SNAPSHOT.h"
+
+static void help(char *name)
+{
+   pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
+  "where  OBJECT := { }\n"
+  "   OPTIONS := { -V[ersion] }\n", name);
+}
+
+static int obj_help(struct rdma *rd)
+{
+   help(rd->filename);
+   return 0;
+}
+
+static int rd_cmd(struct rdma *rd)
+{
+   const struct rdma_obj objs[] = {
+   { NULL, obj_help },
+   { "help",   obj_help },
+   { 0 }
+   };
+
+   return rdma_exec_cmd(rd, objs, "object");
+}
+
+static int rd_init(struct rdma *rd, int argc, char **argv, char *filename)
+{
+   rd->filename = filename;
+   rd->argc = argc;
+   rd->argv = argv;
+   INIT_LIST_HEAD(>dev_map_list);
+   return 0;
+}
+static void rd_free(struct rdma *rd)
+{
+   dev_map_cleanup(rd);
+}
+int main(int argc, char **argv)
+{
+   char *filename;
+   static const struct option long_options[] = {
+   { "version",no_argument,NULL, 'V' },
+   { "help",   no_argument,NULL, 'h' },
+   { NULL, 0, NULL, 0 }
+   };
+   struct rdma rd;
+   int opt;
+   int err;
+
+   filename = basename(argv[0]);
+
+   while ((opt = getopt_long(argc, argv, "Vh",
+ long_options, NULL)) >= 0) {
+
+   switch (opt) {
+   case 'V':
+   printf("%s utility, iproute2-ss%s\n", filename, 
SNAPSHOT);
+   return EXIT_SUCCESS;
+   case 'h':
+   help(filename);
+   return EXIT_SUCCESS;
+   default:
+   pr_err("Unknown option.\n");
+   help(filename);
+   return EXIT_FAILURE;
+   }
+   }
+
+   argc -= optind;
+   argv += optind;
+
+   err = rd_init(, argc, argv, filename);
+   if (err)
+   goto out;
+
+   err = rd_cmd();
+   /* Always cleanup */
+   rd_free();
+
+out:   return (err) ? EXIT_FAILURE:EXIT_SUCCESS;
+}
diff --git a/rdma/rdma.h b/rdma/rdma.h
new file mode 100644
index ..156bb74c
--- /dev/null
+++ b/rdma/rdma.h
@@ -0,0 +1,77 @@
+/*
+ * rdma.c  RDMA tool
+ *
+ *  This program 

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-04 Thread Leon Romanovsky
On Thu, May 04, 2017 at 09:45:58AM -0700, Stephen Hemminger wrote:
> On Thu, 4 May 2017 17:37:38 +0300
> Leon Romanovsky  wrote:
>
> > On Thu, May 04, 2017 at 11:36:36AM +0200, Daniel Borkmann wrote:
> > > On 05/04/2017 01:56 AM, Stephen Hemminger wrote:
> > > > Add support for extended ack error reporting via libmnl. This
> > > > is a better alternative to use existing library and not copy/paste
> > > > code from the kernel. Also make arguments const where possible.
> > > >
> > > > Add a new function rtnl_talk_extack that takes a callback as an input
> > > > arg. If a netlink response contains extack attributes, the callback is
> > > > is invoked with the the err string, offset in the message and a pointer
> > > > to the message returned by the kernel.
> > > >
> > > > Adding a new function allows commands to be moved over to the
> > > > extended error reporting over time.
> > > >
> > > > For feedback, compile tested only.
> > >
> > > Just out of curiosity, what is the plan regarding converting iproute2
> > > over to libmnl (ip, tc, ss, ...)? In 2015, tipc tool was the first
> > > user merged that requires libmnl, the only other user today in the
> > > tree is devlink, which even seems to define its own libmnl library
> > > helpers. What is the clear benefit/rationale of outsourcing this to
> > > libmnl? I always was the impression we should strive for as little
> > > dependencies as possible?
> >
> > And I would like to get direction for the RDMA tool [1] which I'm
> > working on it now.
> >
> > The overall decision was to use netlink and put it under iproute2
> > umbrella. Currently, I have working RFC which is based on
> > legacy sysfs interface to ensure that we are converging on
> > user-experience even before moving to actual netlink defines.
> >
> > An I would like to continue to work on netlink interface, but which lib 
> > interface
> > should I need to base rdmatool's netlink code?
> >
> > [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg148523.html
> >
> > >
> > > I don't really like that we make extended ack reporting now dependent
> > > on libmnl, which further diverts from iproute's native nl library vs
> > > requiring to install another nl library, making the current status
> > > quo even worse ... :/
> > >
> > > Thanks,
> > > Daniel
>
> I would prefer new code use libmnl, but using libnetlink would also be ok.
> Any later conversion to libmnl would be mostly automated anyway.

Thanks, I'm copy/pasting devlink variation of libmnl :)

>
> The real objection was copy/pasting in the kernel netlink parser.
> That was unnecessary bloat.




signature.asc
Description: PGP signature


Re: [net-ipv4] question about arguments position

2017-05-04 Thread Joe Perches
On Thu, 2017-05-04 at 12:46 -0400, David Miller wrote:
> From: "Gustavo A. R. Silva" 
> Date: Thu, 04 May 2017 11:07:54 -0500
> 
> > While looking into Coverity ID 1357474 I ran into the following piece
> > of code at net/ipv4/inet_diag.c:392:
> 
> Because it's been this way since at least 2005, it doesn't matter if
> the order is correct or not.  What's there is the locked in behavior
> exposed to userspace and changing it will break things for people.

Adding a few comments around the code about why
it is this way will help avoid future questions.


[Patch net v2] ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf

2017-05-04 Thread Cong Wang
For each netns (except init_net), we initialize its null entry
in 3 places:

1) The template itself, as we use kmemdup()
2) Code around dst_init_metrics() in ip6_route_net_init()
3) ip6_route_dev_notify(), which is supposed to initialize it after
   loopback registers

Unfortunately the last one still happens in a wrong order because
we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
net->loopback_dev's idev, so we have to do that after we add
idev to it. However, this notifier has priority == 0 same as
ipv6_dev_notf, and ipv6_dev_notf is registered after
ip6_route_dev_notifier so it is called actually after
ip6_route_dev_notifier.

Fix it by picking a smaller priority for ip6_route_dev_notifier.
Also, we have to release the refcnt accordingly when unregistering
loopback_dev because device exit functions are called before subsys
exit functions.

Cc: David Ahern 
Signed-off-by: Cong Wang 
---
 include/net/addrconf.h |  2 ++
 net/ipv6/addrconf.c|  1 +
 net/ipv6/route.c   | 13 +++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 1aeb25d..6c0ee3c 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -20,6 +20,8 @@
 #define ADDRCONF_TIMER_FUZZ(HZ / 4)
 #define ADDRCONF_TIMER_FUZZ_MAX(HZ)
 
+#define ADDRCONF_NOTIFY_PRIORITY   0
+
 #include 
 #include 
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 77a4bd5..8d297a7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3548,6 +3548,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
  */
 static struct notifier_block ipv6_dev_notf = {
.notifier_call = addrconf_notify,
+   .priority = ADDRCONF_NOTIFY_PRIORITY,
 };
 
 static void addrconf_type_change(struct net_device *dev, unsigned long event)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 2f11366..dc61b0b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3709,7 +3709,10 @@ static int ip6_route_dev_notify(struct notifier_block 
*this,
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct net *net = dev_net(dev);
 
-   if (event == NETDEV_REGISTER && (dev->flags & IFF_LOOPBACK)) {
+   if (!(dev->flags & IFF_LOOPBACK))
+   return NOTIFY_OK;
+
+   if (event == NETDEV_REGISTER) {
net->ipv6.ip6_null_entry->dst.dev = dev;
net->ipv6.ip6_null_entry->rt6i_idev = in6_dev_get(dev);
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
@@ -3718,6 +3721,12 @@ static int ip6_route_dev_notify(struct notifier_block 
*this,
net->ipv6.ip6_blk_hole_entry->dst.dev = dev;
net->ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_get(dev);
 #endif
+} else if (event == NETDEV_UNREGISTER) {
+   in6_dev_put(net->ipv6.ip6_null_entry->rt6i_idev);
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+   in6_dev_put(net->ipv6.ip6_prohibit_entry->rt6i_idev);
+   in6_dev_put(net->ipv6.ip6_blk_hole_entry->rt6i_idev);
+#endif
}
 
return NOTIFY_OK;
@@ -4024,7 +4033,7 @@ static struct pernet_operations ip6_route_net_late_ops = {
 
 static struct notifier_block ip6_route_dev_notifier = {
.notifier_call = ip6_route_dev_notify,
-   .priority = 0,
+   .priority = ADDRCONF_NOTIFY_PRIORITY - 10,
 };
 
 void __init ip6_route_init_special_entries(void)
-- 
2.5.5



Re: [PATCH] cfg80211: make RATE_INFO_BW_20 the default

2017-05-04 Thread Linus Torvalds
On Thu, May 4, 2017 at 8:22 AM, David Miller  wrote:
> From: Johannes Berg 
>>
>> I figured I'd give Linus to a chance to try or even apply it, but I
>> have no objection to you applying it either. I don't have anything else
>>   yet right now, and sending a pull request for just a single patch
>> would be quite pointless.
>
> Ok, let's give Linus a chance to test the patch.

I'm having trouble recreating the warning. I have no idea why. It only
happened during ten minutes yesterday, and nothing in my wireless
setup has changed.

I wonder if *normally* my setup ends up connecting with a 40MHz band
or something, and I just happened to see the default uninitialized
case once.

I see that Jens reported that the patch works, although I'm wondering
how repeatable it was for him.  The patch obviously looks simple and
seems like an obviously GoodThing(tm) regardless.

   Linus


[PATCH] mac80211: Create ieee80211_if_process_skb from ieee80211_iface_work

2017-05-04 Thread Joe Perches
This function is pretty long and the skb handling is a bit long too.
Create a new function just for the skb processing.

This isolates the code and reduces indentation a bit too.

No change in object size.

$ size net/mac80211/iface.o*
   textdata bss dec hex filename
  15736  24   0   157603d90 net/mac80211/iface.o.new
  15736  24   0   157603d90 net/mac80211/iface.o.old

Miscellanea:

o Use explicit casts to proper types instead of casts to (void *)
  and have the compiler do the implicit cast
o Rewrap comments

Signed-off-by: Joe Perches 
---
 net/mac80211/iface.c | 253 ++-
 1 file changed, 127 insertions(+), 126 deletions(-)

diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 3bd5b81f5d81..b51d3956feaa 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -1230,145 +1230,125 @@ static void ieee80211_if_setup_no_queue(struct 
net_device *dev)
dev->priv_flags |= IFF_NO_QUEUE;
 }
 
-static void ieee80211_iface_work(struct work_struct *work)
+static void ieee80211_if_process_skb(struct sk_buff *skb,
+struct ieee80211_sub_if_data *sdata,
+struct ieee80211_local *local)
 {
-   struct ieee80211_sub_if_data *sdata =
-   container_of(work, struct ieee80211_sub_if_data, work);
-   struct ieee80211_local *local = sdata->local;
-   struct sk_buff *skb;
-   struct sta_info *sta;
struct ieee80211_ra_tid *ra_tid;
struct ieee80211_rx_agg *rx_agg;
-
-   if (!ieee80211_sdata_running(sdata))
-   return;
-
-   if (test_bit(SCAN_SW_SCANNING, >scanning))
-   return;
-
-   if (!ieee80211_can_run_worker(local))
-   return;
-
-   /* first process frames */
-   while ((skb = skb_dequeue(>skb_queue))) {
-   struct ieee80211_mgmt *mgmt = (void *)skb->data;
-
-   if (skb->pkt_type == IEEE80211_SDATA_QUEUE_AGG_START) {
-   ra_tid = (void *)>cb;
-   ieee80211_start_tx_ba_cb(>vif, ra_tid->ra,
-ra_tid->tid);
-   } else if (skb->pkt_type == IEEE80211_SDATA_QUEUE_AGG_STOP) {
-   ra_tid = (void *)>cb;
-   ieee80211_stop_tx_ba_cb(>vif, ra_tid->ra,
-   ra_tid->tid);
-   } else if (skb->pkt_type == IEEE80211_SDATA_QUEUE_RX_AGG_START) 
{
-   rx_agg = (void *)>cb;
-   mutex_lock(>sta_mtx);
-   sta = sta_info_get_bss(sdata, rx_agg->addr);
-   if (sta)
-   __ieee80211_start_rx_ba_session(sta,
-   0, 0, 0, 1, rx_agg->tid,
-   IEEE80211_MAX_AMPDU_BUF,
-   false, true);
-   mutex_unlock(>sta_mtx);
-   } else if (skb->pkt_type == IEEE80211_SDATA_QUEUE_RX_AGG_STOP) {
-   rx_agg = (void *)>cb;
-   mutex_lock(>sta_mtx);
-   sta = sta_info_get_bss(sdata, rx_agg->addr);
-   if (sta)
-   __ieee80211_stop_rx_ba_session(sta,
-   rx_agg->tid,
-   WLAN_BACK_RECIPIENT, 0,
-   false);
-   mutex_unlock(>sta_mtx);
-   } else if (ieee80211_is_action(mgmt->frame_control) &&
-  mgmt->u.action.category == WLAN_CATEGORY_BACK) {
-   int len = skb->len;
-
-   mutex_lock(>sta_mtx);
-   sta = sta_info_get_bss(sdata, mgmt->sa);
-   if (sta) {
-   switch (mgmt->u.action.u.addba_req.action_code) 
{
-   case WLAN_ACTION_ADDBA_REQ:
-   ieee80211_process_addba_request(
-   local, sta, mgmt, len);
-   break;
-   case WLAN_ACTION_ADDBA_RESP:
-   ieee80211_process_addba_resp(local, sta,
-mgmt, len);
-   break;
-   case WLAN_ACTION_DELBA:
-   ieee80211_process_delba(sdata, sta,
+   struct sta_info *sta;
+   struct ieee80211_mgmt *mgmt = (void *)skb->data;
+
+   if (skb->pkt_type == IEEE80211_SDATA_QUEUE_AGG_START) {
+   ra_tid = (struct ieee80211_ra_tid *)>cb;
+   

Re: [Patch net] ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf

2017-05-04 Thread Cong Wang
On Thu, May 4, 2017 at 7:04 AM, David Ahern  wrote:
> On 5/3/17 11:07 PM, Cong Wang wrote:
>> For each netns (except init_net), we initialize its null entry
>> in 3 places:
>>
>> 1) The template itself, as we use kmemdup()
>> 2) Code around dst_init_metrics() in ip6_route_net_init()
>> 3) ip6_route_dev_notify(), which is supposed to initialize it after
>> loopback registers
>>
>> Unfortunately the last one still happens in a wrong order because
>> we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
>> net->loopback_dev's idev, so we have to do that after we add
>> idev to it. However, this notifier has priority == 0 same as
>> ipv6_dev_notf, and ipv6_dev_notf is registered after
>> ip6_route_dev_notifier so it is called actually after
>> ip6_route_dev_notifier.
>>
>> Fix it by specifying a smaller priority for ip6_route_dev_notifier.
>>
>> Signed-off-by: Cong Wang 
>> ---
>>  net/ipv6/route.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 2f11366..4dbf7e2 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -4024,7 +4024,7 @@ static struct pernet_operations ip6_route_net_late_ops 
>> = {
>>
>>  static struct notifier_block ip6_route_dev_notifier = {
>>   .notifier_call = ip6_route_dev_notify,
>> - .priority = 0,
>> + .priority = -10, /* Must be called after addrconf_notify!! */
>>  };
>>
>>  void __init ip6_route_init_special_entries(void)
>>
>
> And I see a refcnt problem with this change:
>
> root@kenny-jessie2:~# unshare -n
> root@kenny-jessie2:~# logout
> root@kenny-jessie2:~# unshare -n
>
> Message from syslogd@kenny-jessie2 at May  4 07:04:38 ...
>  kernel:[   62.581552] unregister_netdevice: waiting for lo to become
> free. Usage count = 1

Ah, looks like we need to put the refcnt for UNREGISTER too.

Will send v2 to include your ADDRCONF_NOTIFY_PRIORITY suggestion.


Re: [Patch net] ipv6: initialize route null entry in addrconf_init()

2017-05-04 Thread Cong Wang
On Thu, May 4, 2017 at 10:12 AM, David Ahern  wrote:
> On 5/4/17 10:51 AM, David Miller wrote:
>> From: Andrey Konovalov 
>> Date: Thu, 4 May 2017 14:28:37 +0200
>>
>>> On Thu, May 4, 2017 at 7:07 AM, Cong Wang  wrote:
 Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
 since it is always NULL.

 This is clearly wrong, we have code to initialize it to loopback_dev,
 unfortunately the order is still not correct.

 loopback_dev is registered very early during boot, we lose a chance
 to re-initialize it in notifier. addrconf_init() is called after
 ip6_route_init(), which means we have no chance to correct it.

 Fix it by moving this initialization explicitly after
 ipv6_add_dev(init_net.loopback_dev) in addrconf_init().

 Reported-by: Andrey Konovalov 
 Signed-off-by: Cong Wang 
>>>
>>> Hi Cong,
>>>
>>> This fixes the bug triggered by my reproducer.
>>>
>>> Thanks!
>>>
>>> Tested-by: Andrey Konovalov 
>>
>> Applied and queued up for -stable, thanks.
>>
>
> This is not the complete solution; it only fixes init_net. It still
> blows up when you do:
>
> unshare -n
> ./rt6_device_match
>
>
> same exact stack trace

This is why I sent
[Patch net] ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf


Re: [net-ipv4] question about arguments position

2017-05-04 Thread Gustavo A. R. Silva

Hi David,

Quoting David Miller :


From: "Gustavo A. R. Silva" 
Date: Thu, 04 May 2017 11:07:54 -0500


While looking into Coverity ID 1357474 I ran into the following piece
of code at net/ipv4/inet_diag.c:392:


Because it's been this way since at least 2005, it doesn't matter if
the order is correct or not.  What's there is the locked in behavior
exposed to userspace and changing it will break things for people.


Oh, I see.

Thanks for clarifying
--
Gustavo A. R. Silva






Re: [PATCH] cfg80211: make RATE_INFO_BW_20 the default

2017-05-04 Thread David Miller
From: Johannes Berg 
Date: Thu,  4 May 2017 08:42:30 +0200

> From: Johannes Berg 
> 
> Due to the way I did the RX bitrate conversions in mac80211 with
> spatch, going setting flags to setting the value, many drivers now
> don't set the bandwidth value for 20 MHz, since with the flags it
> wasn't necessary to (there was no 20 MHz flag, only the others.)
> 
> Rather than go through and try to fix up all the drivers, instead
> renumber the enum so that 20 MHz, which is the typical bandwidth,
> actually has the value 0, making those drivers all work again.
> 
> If VHT was hit used with a driver not reporting it, e.g. iwlmvm,
> this manifested in hitting the bandwidth warning in
> cfg80211_calculate_bitrate_vht().
> 
> Reported-by: Linus Torvalds 
> Signed-off-by: Johannes Berg 

Since Jens Axboe had the same problem and tested this patch, I'm
tossing it into my tree.

Just FYI...


Re: [Patch net] ipv6: initialize route null entry in addrconf_init()

2017-05-04 Thread David Ahern
On 5/4/17 10:51 AM, David Miller wrote:
> From: Andrey Konovalov 
> Date: Thu, 4 May 2017 14:28:37 +0200
> 
>> On Thu, May 4, 2017 at 7:07 AM, Cong Wang  wrote:
>>> Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
>>> since it is always NULL.
>>>
>>> This is clearly wrong, we have code to initialize it to loopback_dev,
>>> unfortunately the order is still not correct.
>>>
>>> loopback_dev is registered very early during boot, we lose a chance
>>> to re-initialize it in notifier. addrconf_init() is called after
>>> ip6_route_init(), which means we have no chance to correct it.
>>>
>>> Fix it by moving this initialization explicitly after
>>> ipv6_add_dev(init_net.loopback_dev) in addrconf_init().
>>>
>>> Reported-by: Andrey Konovalov 
>>> Signed-off-by: Cong Wang 
>>
>> Hi Cong,
>>
>> This fixes the bug triggered by my reproducer.
>>
>> Thanks!
>>
>> Tested-by: Andrey Konovalov 
> 
> Applied and queued up for -stable, thanks.
> 

This is not the complete solution; it only fixes init_net. It still
blows up when you do:

unshare -n
./rt6_device_match


same exact stack trace


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous



On 04.05.2017 19:51, David Miller wrote:

From: Lino Sanfilippo 
Date: Thu, 4 May 2017 18:48:12 +0200


Hi Pavel,

On 04.05.2017 18:33, Pavel Belous wrote:

From: Pavel Belous 

This patch fixes the crash that happens when driver tries to collect statistics
from already released "aq_vec" object.

Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Pavel Belous 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb0299..3a32573 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
count = 0U;

for (i = 0U, aq_vec = self->aq_vec[0];
-   self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
+   aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
data += count;
aq_vec_get_sw_stats(aq_vec, data, );
}
@@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
for (i = AQ_DIMOF(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_free(self->aq_vec[i]);
+   self->aq_vec[i] = NULL;
}

 err_exit:;



if the driver does not support statistics when the interface is down, would not 
it be clearer
to check if netif_running() in get_stats() instead?


Yes, much cleaner.

Much better would be to have a cached software copy so that statistics
can be reported regardless of whether the device is down or not.



Thank you.
I will think about how to do it better.

Regards,
Pavel


Re: struct ip vs struct iphdr

2017-05-04 Thread Girish Moodalbail

On 5/4/17 9:42 AM, Oleg wrote:

  Hi, all.

It seems struct ip and struct iphdr are similar: struct ip, despite of
it name, doesn't contain anything but ip header.

So, my noob question, what is the difference between them?


Also, see this:

http://stackoverflow.com/questions/42840636/difference-between-struct-ip-and-struct-iphdr



Thanks.





Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous



On 04.05.2017 20:00, David Arcari wrote:

Hi Pavel,

On 05/04/2017 12:33 PM, Pavel Belous wrote:

From: Pavel Belous 

This patch fixes the crash that happens when driver tries to collect statistics
from already released "aq_vec" object.

Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Pavel Belous 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb0299..3a32573 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
count = 0U;

for (i = 0U, aq_vec = self->aq_vec[0];
-   self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
+   aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
data += count;
aq_vec_get_sw_stats(aq_vec, data, );
}
@@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
for (i = AQ_DIMOF(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_free(self->aq_vec[i]);
+   self->aq_vec[i] = NULL;


I think you intended to to add { } to the if statement.  The code compiles as
is, but the indentation is not correct.

-DA


Oh. Sorry about that. I did not see the loss of braces during merge.
I will prepare another patch with Lino and David M. comments.

Regards,
Pavel.





}

 err_exit:;





Why do we need MSG_SENDPAGE_NOTLAST?

2017-05-04 Thread Ilya Lesokhin
I don't understand the need for MSG_SENDPAGE_NOTLAST and I'm hoping someone can 
enlighten me.

According to commit 35f9c09 ('tcp: tcp_sendpages() should call tcp_push() 
once'):
"We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall."

I don't understand why we need to differentiate between the user setting 
MSG_MORE 
and splice indicating that more data is going to be sent.
if the user passed MSG_MORE and didn't push any extra data, isn't it the users 
fault? 
Do we need it because poorly written applications were broken when 
MSG_MORE was added to tcp_sendpage? Or is there a deeper reason?

The reason I'm asking is that we are working on a kernel TLS implementation 
and I would like to know if we can coalesce multiple tls_sendpage calls with 
MSG_MORE into a single
tls record or whether we must push out the record as soon as 
MSG_SENDPAGE_NOTLAST is cleared?

Thanks,
Ilya



Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread David Arcari
Hi Pavel,

On 05/04/2017 12:33 PM, Pavel Belous wrote:
> From: Pavel Belous 
> 
> This patch fixes the crash that happens when driver tries to collect 
> statistics
> from already released "aq_vec" object.
> 
> Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
> Signed-off-by: Pavel Belous 
> ---
>  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
> b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> index cdb0299..3a32573 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
>   count = 0U;
>  
>   for (i = 0U, aq_vec = self->aq_vec[0];
> - self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
> + aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>   data += count;
>   aq_vec_get_sw_stats(aq_vec, data, );
>   }
> @@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>   for (i = AQ_DIMOF(self->aq_vec); i--;) {
>   if (self->aq_vec[i])
>   aq_vec_free(self->aq_vec[i]);
> + self->aq_vec[i] = NULL;

I think you intended to to add { } to the if statement.  The code compiles as
is, but the indentation is not correct.

-DA

>   }
>  
>  err_exit:;
> 



Re: struct ip vs struct iphdr

2017-05-04 Thread Sowmini Varadhan
On (05/04/17 19:42), Oleg wrote:
> 
>   Hi, all.
> 
> It seems struct ip and struct iphdr are similar: struct ip, despite of
> it name, doesn't contain anything but ip header.
> 
> So, my noob question, what is the difference between them?
> 
> Thanks.

BSD vs linux?

struct ip is a BSD-ism, intended to be used if you were porting
some BSD app.

--Sowmini


Re: [Patch net] ipv6: initialize route null entry in addrconf_init()

2017-05-04 Thread David Miller
From: Andrey Konovalov 
Date: Thu, 4 May 2017 14:28:37 +0200

> On Thu, May 4, 2017 at 7:07 AM, Cong Wang  wrote:
>> Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
>> since it is always NULL.
>>
>> This is clearly wrong, we have code to initialize it to loopback_dev,
>> unfortunately the order is still not correct.
>>
>> loopback_dev is registered very early during boot, we lose a chance
>> to re-initialize it in notifier. addrconf_init() is called after
>> ip6_route_init(), which means we have no chance to correct it.
>>
>> Fix it by moving this initialization explicitly after
>> ipv6_add_dev(init_net.loopback_dev) in addrconf_init().
>>
>> Reported-by: Andrey Konovalov 
>> Signed-off-by: Cong Wang 
> 
> Hi Cong,
> 
> This fixes the bug triggered by my reproducer.
> 
> Thanks!
> 
> Tested-by: Andrey Konovalov 

Applied and queued up for -stable, thanks.


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread David Miller
From: Lino Sanfilippo 
Date: Thu, 4 May 2017 18:48:12 +0200

> Hi Pavel,
> 
> On 04.05.2017 18:33, Pavel Belous wrote:
>> From: Pavel Belous 
>> 
>> This patch fixes the crash that happens when driver tries to collect 
>> statistics
>> from already released "aq_vec" object.
>> 
>> Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific 
>> code")
>> Signed-off-by: Pavel Belous 
>> ---
>>  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
>> b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
>> index cdb0299..3a32573 100644
>> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
>> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
>> @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
>>  count = 0U;
>>  
>>  for (i = 0U, aq_vec = self->aq_vec[0];
>> -self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>> +aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>>  data += count;
>>  aq_vec_get_sw_stats(aq_vec, data, );
>>  }
>> @@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>>  for (i = AQ_DIMOF(self->aq_vec); i--;) {
>>  if (self->aq_vec[i])
>>  aq_vec_free(self->aq_vec[i]);
>> +self->aq_vec[i] = NULL;
>>  }
>>  
>>  err_exit:;
>> 
> 
> if the driver does not support statistics when the interface is down, would 
> not it be clearer
> to check if netif_running() in get_stats() instead?

Yes, much cleaner.

Much better would be to have a cached software copy so that statistics
can be reported regardless of whether the device is down or not.


struct ip vs struct iphdr

2017-05-04 Thread Oleg
  Hi, all.

It seems struct ip and struct iphdr are similar: struct ip, despite of
it name, doesn't contain anything but ip header.

So, my noob question, what is the difference between them?

Thanks.

-- 
Олег Неманов (Oleg Nemanov)


Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Lino Sanfilippo
Hi Pavel,

On 04.05.2017 18:33, Pavel Belous wrote:
> From: Pavel Belous 
> 
> This patch fixes the crash that happens when driver tries to collect 
> statistics
> from already released "aq_vec" object.
> 
> Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
> Signed-off-by: Pavel Belous 
> ---
>  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
> b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> index cdb0299..3a32573 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
>   count = 0U;
>  
>   for (i = 0U, aq_vec = self->aq_vec[0];
> - self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
> + aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>   data += count;
>   aq_vec_get_sw_stats(aq_vec, data, );
>   }
> @@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>   for (i = AQ_DIMOF(self->aq_vec); i--;) {
>   if (self->aq_vec[i])
>   aq_vec_free(self->aq_vec[i]);
> + self->aq_vec[i] = NULL;
>   }
>  
>  err_exit:;
> 

if the driver does not support statistics when the interface is down, would not 
it be clearer
to check if netif_running() in get_stats() instead?

Regards,
Lino



Fw: [Bug 195217] siocsifflags - irda doesn't work (MCS7780)

2017-05-04 Thread Stephen Hemminger
Apparently IRDA is broken by VMAP_STACK

Begin forwarded message:

Date: Thu, 04 May 2017 12:16:15 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 195217] siocsifflags - irda doesn't work (MCS7780)


https://bugzilla.kernel.org/show_bug.cgi?id=195217

Jerome (jb_69...@yahoo.com) changed:

   What|Removed |Added

  Component|Other   |USB
 Kernel Version|4.9.X, 4.10.X, 4.11-rcX |4.9.0 --> 4.11.0
Product|Networking  |Drivers
   Severity|blocking|high

--- Comment #2 from Jerome (jb_69...@yahoo.com) ---
Hi,

If I compile the kernel by setting CONFIG_VMAP_STACK = N, "ifconfig irda0 up"
is working correctly. (Tested with kernel 4.11.0)

I suppose the problem comes either from the MCS7780 driver (use the stack not
"in the rules"?) Or from VMAP_STACK.

Can you help me better understand the problem?

Regards,
Jerome.

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [net-ipv4] question about arguments position

2017-05-04 Thread David Miller
From: "Gustavo A. R. Silva" 
Date: Thu, 04 May 2017 11:07:54 -0500

> While looking into Coverity ID 1357474 I ran into the following piece
> of code at net/ipv4/inet_diag.c:392:

Because it's been this way since at least 2005, it doesn't matter if
the order is correct or not.  What's there is the locked in behavior
exposed to userspace and changing it will break things for people.


Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-04 Thread Stephen Hemminger
On Thu, 4 May 2017 17:37:38 +0300
Leon Romanovsky  wrote:

> On Thu, May 04, 2017 at 11:36:36AM +0200, Daniel Borkmann wrote:
> > On 05/04/2017 01:56 AM, Stephen Hemminger wrote:  
> > > Add support for extended ack error reporting via libmnl. This
> > > is a better alternative to use existing library and not copy/paste
> > > code from the kernel. Also make arguments const where possible.
> > >
> > > Add a new function rtnl_talk_extack that takes a callback as an input
> > > arg. If a netlink response contains extack attributes, the callback is
> > > is invoked with the the err string, offset in the message and a pointer
> > > to the message returned by the kernel.
> > >
> > > Adding a new function allows commands to be moved over to the
> > > extended error reporting over time.
> > >
> > > For feedback, compile tested only.  
> >
> > Just out of curiosity, what is the plan regarding converting iproute2
> > over to libmnl (ip, tc, ss, ...)? In 2015, tipc tool was the first
> > user merged that requires libmnl, the only other user today in the
> > tree is devlink, which even seems to define its own libmnl library
> > helpers. What is the clear benefit/rationale of outsourcing this to
> > libmnl? I always was the impression we should strive for as little
> > dependencies as possible?  
> 
> And I would like to get direction for the RDMA tool [1] which I'm
> working on it now.
> 
> The overall decision was to use netlink and put it under iproute2
> umbrella. Currently, I have working RFC which is based on
> legacy sysfs interface to ensure that we are converging on
> user-experience even before moving to actual netlink defines.
> 
> An I would like to continue to work on netlink interface, but which lib 
> interface
> should I need to base rdmatool's netlink code?
> 
> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg148523.html
> 
> >
> > I don't really like that we make extended ack reporting now dependent
> > on libmnl, which further diverts from iproute's native nl library vs
> > requiring to install another nl library, making the current status
> > quo even worse ... :/
> >
> > Thanks,
> > Daniel  

I would prefer new code use libmnl, but using libnetlink would also be ok.
Any later conversion to libmnl would be mostly automated anyway.

The real objection was copy/pasting in the kernel netlink parser.
That was unnecessary bloat.


pgpA426sGjYe1.pgp
Description: OpenPGP digital signature


Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses

2017-05-04 Thread Jiri Benc
On Thu, 4 May 2017 09:37:00 +, Chiappero, Marco wrote:
> This looks conceptually wrong. Yes, ipvlan works at L3 (which is an
> implementation detail anyway), but slaves are Ethernet interfaces and
> should behave as much as possible as such regardless, with an
> individual MAC address assigned.

Isn't the proper fix then converting ipvlan interfaces to be L3 only
interfaces? I.e., ARPHRD_NONE? There's not much ipvlan can do with
arbitrary Ethernet frames anyway. Of course, a flag to switch to the
new behavior would be needed in order to preserve backwards
compatibility.

This patchset looks very wrong. For proper support of multiple MAC
addresses, we have macvlan and it's pointless to add that to ipvlan.
And doing some kind of weird MAC NAT in ipvlan just to satisfy broken
tools that can't cope with multiple interfaces with the same MAC address
is wrong, too. Those tools are already broken anyway, there's nothing
preventing anyone to set the same MAC address to multiple interfaces.
I suppose those tools don't work with bonding and bridge, either?

> So, either we fix this by forcing slaves to stay in sync with master,

Yes, that's the correct behavior. Well, at least as correct as one can
get with the ipvlan broken design of pretending that an interface is L2
when in fact, it is not.

 Jiri


Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-04 Thread Stephen Hemminger
On Thu, 04 May 2017 10:41:03 -0400 (EDT)
David Miller  wrote:

> From: David Ahern 
> Date: Thu, 4 May 2017 08:27:35 -0600
> 
> > On 5/4/17 3:36 AM, Daniel Borkmann wrote:  
> >> What is the clear benefit/rationale of outsourcing this to
> >> libmnl? I always was the impression we should strive for as little
> >> dependencies as possible?  
> > 
> > +1  
> 
> Agreed, all else being equal iproute2 should be as self contained
> as possible since it is such a fundamental tool.

Sorry, the old netlink code is more difficult to understand than libmnl.
Having dependency on a library is not a problem. There already is
an alternative implementation of ip commands in busybox for those
people trying to work in small environments.


Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-04 Thread Stephen Hemminger
On Thu, 04 May 2017 11:36:36 +0200
Daniel Borkmann  wrote:

> On 05/04/2017 01:56 AM, Stephen Hemminger wrote:
> > Add support for extended ack error reporting via libmnl. This
> > is a better alternative to use existing library and not copy/paste
> > code from the kernel. Also make arguments const where possible.
> >
> > Add a new function rtnl_talk_extack that takes a callback as an input
> > arg. If a netlink response contains extack attributes, the callback is
> > is invoked with the the err string, offset in the message and a pointer
> > to the message returned by the kernel.
> >
> > Adding a new function allows commands to be moved over to the
> > extended error reporting over time.
> >
> > For feedback, compile tested only.  
> 
> Just out of curiosity, what is the plan regarding converting iproute2
> over to libmnl (ip, tc, ss, ...)? In 2015, tipc tool was the first
> user merged that requires libmnl, the only other user today in the
> tree is devlink, which even seems to define its own libmnl library
> helpers. What is the clear benefit/rationale of outsourcing this to
> libmnl? I always was the impression we should strive for as little
> dependencies as possible?
> 
> I don't really like that we make extended ack reporting now dependent
> on libmnl, which further diverts from iproute's native nl library vs
> requiring to install another nl library, making the current status
> quo even worse ... :/
> 
> Thanks,
> Daniel

No rush for migration. just slow migration as time permits.
This would be good kernel janitor type project.


Re: [PATCH net v2 0/3] qed*: Bug fix series.

2017-05-04 Thread David Miller
From: Sudarsana Reddy Kalluru 
Date: Thu, 4 May 2017 08:15:02 -0700

> From: Sudarsana Reddy Kalluru 
> 
> The series contains minor bug fixes for qed/qede drivers.
> 
> Please consider applying it to 'net' branch.

Series applied, thanks.


Re: [PATCH] iproute2: hide devices starting with period by default

2017-05-04 Thread David Ahern
On 5/4/17 9:15 AM, Nicolas Dichtel wrote:
> Le 24/02/2017 à 16:52, David Ahern a écrit :
>> On 2/23/17 8:12 PM, David Miller wrote:
>>> This really need to be a fundamental facility, so that it transparently
>>> works for NetworkManager, router daemons, everything.  Not just iproute2
>>> and "ls".
>>
>> I'll rebase my patch and send out as RFC.
>>
> David, did you finally send those patches?
> 

No, but for a few reasons.

It is easy to hide devices in a dump:

https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00


But I think those devices should also not exist in sysfs or procfs which
overlaps what I would like to see for lightweight netdevices:

https://github.com/dsahern/linux/commit/70574be699cf252e77f71e3df11192438689f976


and to be complete, hidden devices should not be allowed to have a
network address or transmit packets which is the L2 only intent from
Florian:
https://www.spinics.net/lists/netdev/msg340808.html



[PATCH] aquantia: Fix "ethtool -S" crash when adapter down.

2017-05-04 Thread Pavel Belous
From: Pavel Belous 

This patch fixes the crash that happens when driver tries to collect statistics
from already released "aq_vec" object.

Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Pavel Belous 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index cdb0299..3a32573 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
count = 0U;
 
for (i = 0U, aq_vec = self->aq_vec[0];
-   self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
+   aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
data += count;
aq_vec_get_sw_stats(aq_vec, data, );
}
@@ -961,6 +961,7 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
for (i = AQ_DIMOF(self->aq_vec); i--;) {
if (self->aq_vec[i])
aq_vec_free(self->aq_vec[i]);
+   self->aq_vec[i] = NULL;
}
 
 err_exit:;
-- 
2.7.4



  1   2   >