date:20161111

Re: [PATCH iproute2 0/2] tc: flower: Support matching on SCTP ports

2016-11-11 Thread Stephen Hemminger

On Thu,  3 Nov 2016 13:26:39 +0100
Simon Horman  wrote:

> Hi,
> 
> this short series adds support for matching on SCTP ports in the same way
> that matching on TCP and UDP ports is already supported. It corresponds to
> a net-next patch to add the same support to the kernel.
> 
> Example usage:
> 
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 protocol ip parent : \
> flower indev eth0 ip_proto sctp dst_port 80 \
> action drop
> 
> 
> Simon Horman (2):
>   tc: update headers for TCA_FLOWER_KEY_SCTP_*
>   tc: flower: Support matching on SCTP ports
> 
>  include/linux/pkt_cls.h |  5 
>  tc/f_flower.c   | 65 
> +++--
>  2 files changed, 36 insertions(+), 34 deletions(-)
> 

Applied, thanks.

Re: [PATCH iproute2] tc: flower: Fix usage message

2016-11-11 Thread Stephen Hemminger

On Wed,  2 Nov 2016 17:09:58 +0200
Paul Blakey  wrote:

> Remove left over usage from removal of eth_type argument.
> 
> Fixes: 488b41d020fb ('tc: flower no need to specify the ethertype')
> Signed-off-by: Paul Blakey 
> ---

Applied, thanks.
Then I changed usage message to pass checkpatch long line nags.

Re: [PATCH v4] iproute2: macvlan: add "source" mode

2016-11-11 Thread Stephen Hemminger

On Thu, 27 Oct 2016 12:44:36 +0200
Michael Braun  wrote:

> Adjusting iproute2 utility to support new macvlan link type mode called
> "source".
> 
> Example of commands that can be applied:
>   ip link add link eth0 name macvlan0 type macvlan mode source
>   ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr del 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr flush
>   ip -details link show dev macvlan0
> 
> Based on previous work of Stefan Gula 
> 
> Signed-off-by: Michael Braun 
> 
> Cc: ste...@gmail.com
> 
> v4:
>  - add MACADDR_SET support
>  - skip FLAG_UNICAST / FLAG_UNICAST_ALL as this is not upstream
>  - fix man page

The patch looks good, but needs to be cleaned up.

Does not apply to current iproute2 git, and also has minor checkpatch issue.

--- ip/iplink_macvlan.c
+++ ip/iplink_macvlan.c
@@ -46,7 +51,14 @@ static void explain(struct link_util *lu)
 
 static int mode_arg(const char *arg)
 {
-fprintf(stderr, "Error: argument of \"mode\" must be \"private\", 
\"vepa\", \"bridge\" or \"passthru\", not \"%s\"\n",
+   fprintf(stderr, "Error: argument of \"mode\" must be \"private\", 
\"vepa\", \"bridge\", \"passthru\" or \"source\", not \"%s\"\n",
+   arg);
+   return -1;
+}
+
+static int flag_arg(const char *arg)
+{
+   fprintf(stderr, "Error: argument of \"flag\" must be \"nopromisc\" or 
\"null\", not \"%s\"\n",
arg);
return -1;
 }


WARNING: unnecessary whitespace before a quoted newline
#59: FILE: ip/iplink_macvlan.c:35:
+   "MODE_FLAG: null | nopromisc \n"

Re: [PATCH] iproute2: ss: escape all null bytes in abstract unix domain socket

2016-11-11 Thread Stephen Hemminger

On Sat, 29 Oct 2016 22:20:19 +0300
Isaac Boukris  wrote:

> Abstract unix domain socket may embed null characters,
> these should be translated to '@' when printed by ss the
> same way the null prefix is currently being translated.
> 
> Signed-off-by: Isaac Boukris 

Applied

Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor

2016-11-11 Thread Stephen Hemminger

On Fri, 28 Oct 2016 11:42:03 +0200
Stefan Schmidt  wrote:

> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt 

Applied

Re: [PATCH net-next] bpf: fix range arithmetic for bpf map access

2016-11-11 Thread Alexei Starovoitov

On Fri, Nov 11, 2016 at 04:47:39PM -0500, Josef Bacik wrote:
> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
> invalid accesses to bpf map entries.  Fix this up by doing a few things
> 
> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in 
> real
> life and just adds extra complexity.
> 
> 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
> minimum value to 0 for positive AND's.
> 
> 3) Don't do operations on the ranges if they are set to the limits, as they 
> are
> by definition undefined, and allowing arithmetic operations on those values
> could make them appear valid when they really aren't.
> 
> This fixes the testcase provided by Jann as well as a few other theoretical
> problems.
> 
> Reported-by: Jann Horn 
> Signed-off-by: Josef Bacik 
> ---
>  include/linux/bpf_verifier.h |  3 +-
>  kernel/bpf/verifier.c| 70 
> +---
>  2 files changed, 49 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index ac5b393..15ceb7f 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -22,7 +22,8 @@ struct bpf_reg_state {
>* Used to determine if any memory access using this register will
>* result in a bad access.
>*/
> - u64 min_value, max_value;
> + s64 min_value;
> + u64 max_value;
>   u32 id;
>   union {
>   /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
> */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 89f787c..709fe0e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -234,8 +234,8 @@ static void print_verifier_state(struct 
> bpf_verifier_state *state)
>   reg->map_ptr->value_size,
>   reg->id);
>   if (reg->min_value != BPF_REGISTER_MIN_RANGE)
> - verbose(",min_value=%llu",
> - (unsigned long long)reg->min_value);
> + verbose(",min_value=%lld",
> + (long long)reg->min_value);
>   if (reg->max_value != BPF_REGISTER_MAX_RANGE)
>   verbose(",max_value=%llu",
>   (unsigned long long)reg->max_value);
> @@ -778,7 +778,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
> u32 regno, int off,
>* index'es we need to make sure that whatever we use
>* will have a set floor within our range.
>*/
> - if ((s64)reg->min_value < 0) {
> + if (reg->min_value < 0) {
>   verbose("R%d min value is negative, either use 
> unsigned index or do a if (index >=0) check.\n",
>   regno);
>   return -EACCES;
> @@ -1490,7 +1490,8 @@ static void check_reg_overflow(struct bpf_reg_state 
> *reg)
>  {
>   if (reg->max_value > BPF_REGISTER_MAX_RANGE)
>   reg->max_value = BPF_REGISTER_MAX_RANGE;
> - if ((s64)reg->min_value < BPF_REGISTER_MIN_RANGE)
> + if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
> + reg->min_value > BPF_REGISTER_MAX_RANGE)
>   reg->min_value = BPF_REGISTER_MIN_RANGE;
>  }
>  
> @@ -1498,7 +1499,8 @@ static void adjust_reg_min_max_vals(struct 
> bpf_verifier_env *env,
>   struct bpf_insn *insn)
>  {
>   struct bpf_reg_state *regs = env->cur_state.regs, *dst_reg;
> - u64 min_val = BPF_REGISTER_MIN_RANGE, max_val = BPF_REGISTER_MAX_RANGE;
> + s64 min_val = BPF_REGISTER_MIN_RANGE;
> + u64 max_val = BPF_REGISTER_MAX_RANGE;
>   u8 opcode = BPF_OP(insn->code);
>  
>   dst_reg = [insn->dst_reg];
> @@ -1532,22 +1534,43 @@ static void adjust_reg_min_max_vals(struct 
> bpf_verifier_env *env,
>   return;
>   }
>  
> + /* If one of our values was at the end of our ranges then we can't just
> +  * do our normal operations to the register, we need to set the values
> +  * to the min/max since they are undefined.
> +  */
> + if (min_val == BPF_REGISTER_MIN_RANGE)
> + dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
> + if (max_val == BPF_REGISTER_MAX_RANGE)
> + dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
> +
>   switch (opcode) {
>   case BPF_ADD:
> - dst_reg->min_value += min_val;
> - dst_reg->max_value += max_val;
> + if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
> + dst_reg->min_value += min_val;
> + if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
> + dst_reg->max_value += max_val;
>   break;
>   case BPF_SUB:
> - dst_reg->min_value -= min_val;
> -

Re: TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-11 Thread David Miller

From: Russell King - ARM Linux 
Date: Fri, 11 Nov 2016 22:33:08 +

> "The new buffer management algorithm provides capabilities of Microsoft
> Large-Send offload" and as yet I haven't found anything that describes
> what this is or how it works.

For once I will give Microsoft a big shout out here.

This, and everything a Microsoft networking driver interfaces to, is
_very_ much documented in extreme detail in the Microsoft NDIS
(Network Driver Interface Specification).

Microsoft's networking driver interfaces and expectations are
documented 1,000 times better than that of Linux.

Re: Source address fib invalidation on IPv6

2016-11-11 Thread Jason A. Donenfeld

Hi David,

On Fri, Nov 11, 2016 at 11:14 PM, David Ahern  wrote:
> What do you mean by 'valid dst'? ipv6 returns net->ipv6.ip6_null_entry on 
> lookup failures so yes dst is non-NULL but that does not mean the lookup 
> succeeded.

What I mean is that it returns an ordinary dst, as if that souce
address _hadn't_ been removed from the interface, even though I just
removed it. Is this buggy behavior? If so, let me know and I'll try to
track it down. The expected behavior, as far as I can see, would be
the same that ip_route_output_flow has -- returning -EINVAL when the
saddr isn't valid. At the moment, when the saddr is invalid,
ipv6_stub->ipv6_dst_lookup returns 0 and  contains a real entry.

Regards,
Jason

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Cong Wang

On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
 wrote:
>
> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> modified to check for net_mutex being held in addition to the current
> checks for RTNL being held?
>

Good point!

Like commit be3fc413da9eb17cce0991f214ab0, checking
for net_mutex for this case seems to be an optimization, I assume
synchronize_rcu_expedited() and synchronize_rcu() have the same
behavior...

diff --git a/net/core/dev.c b/net/core/dev.c
index eaad4c2..3415b6b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
 void synchronize_net(void)
 {
might_sleep();
-   if (rtnl_is_locked())
+   if (rtnl_is_locked() || lockdep_is_held(_mutex))
synchronize_rcu_expedited();
else
synchronize_rcu();

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Paul E. McKenney

On Fri, Nov 11, 2016 at 01:11:01PM +, Rolf Neugebauer wrote:
> On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
>  wrote:
> > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> >> (Cc'ing Paul)
> >>
> >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
> >>  wrote:
> >> > Hi
> >> >
> >> > We noticed some long delays starting docker containers on some newer
> >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> >> > fine). We narrowed this down to the creation of a network namespace
> >> > being delayed directly after removing another one (details and
> >> > reproduction below). We have seen delays of up to 60s on some systems.
> >> >
> >> > - The delay is proportional to the number of CPUs (online or offline).
> >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> >> > to 240 offline vCPUs even if one configures the VM with only, say 2
> >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> >> > kernel config.
> >> >
> >> > - The delay is also dependent on some tunnel network interfaces being
> >> > present (which we had compiled in in one of our kernel configs).
> >> >
> >> > - We can reproduce this issue with stock kernels from
> >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> >> > as well as other hypervisors like qemu and hyperkit where we have good
> >> > control over the number of CPUs.
> >> >
> >> > A simple test is:
> >> > modprobe ipip
> >> > moprobe  ip_gre
> >> > modprobe ip_vti
> >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >> >
> >> > with an output like:
> >> > add netns foo ===> 0:00.00
> >> > del netns foo ===> 0:00.01
> >> > add netns bar ===> 0:08.53
> >> > del netns bar ===> 0:00.01
> >> >
> >> > This is on a 4.9-rc4 kernel from the above URL configured with
> >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >> >
> >> > Below is a dump of the work queues while the second 'ip add netns' is
> >> > hanging. The state of the work queues does not seem to change while
> >> > the command is delayed and the pattern shown is consistent across
> >> > different kernel versions.
> >> >
> >> > Is this a known issue and/or is someone working on a fix?
> >>
> >> Not to me.
> >>
> >>
> >> >
> >> > [  610.356272] sysrq: SysRq : Show Blocked State
> >> > [  610.356742]   taskPC stack   pid father
> >> > [  610.357252] kworker/u480:1  D0  1994  2 0x
> >> > [  610.357752] Workqueue: netns cleanup_net
> >> > [  610.358239]  9892f1065800  9892ee1e1e00
> >> > 9892f8e59340
> >> > [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
> >> > bf0104b5ba60
> >> > [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
> >> > 9892ee1e1e00
> >> > [  610.359677] Call Trace:
> >> > [  610.360169]  [] ? __schedule+0x233/0x6e0
> >> > [  610.360723]  [] schedule+0x36/0x80
> >> > [  610.361194]  [] schedule_timeout+0x22a/0x3f0
> >> > [  610.361789]  [] ? __schedule+0x23b/0x6e0
> >> > [  610.362260]  [] wait_for_completion+0xb4/0x140
> >> > [  610.362736]  [] ? wake_up_q+0x80/0x80
> >> > [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
> >> > [  610.363782]  [] synchronize_sched+0x5c/0x80
> >> > [  610.364137]  [] ? call_rcu_bh+0x20/0x20
> >> > [  610.364742]  [] ?
> >> > trace_raw_output_rcu_utilization+0x60/0x60
> >> > [  610.365337]  [] synchronize_net+0x1c/0x30
> >>
> >> This is a worker which holds the net_mutex and is waiting for
> >> a RCU grace period to elapse.

Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
modified to check for net_mutex being held in addition to the current
checks for RTNL being held?

Thanx, Paul

> >> > [  610.365846]  [] netif_napi_del+0x23/0x80
> >> > [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 
> >> > [ip_tunnel]
> >> > [  610.368007]  [] netdev_run_todo+0x230/0x330
> >> > [  610.368454]  [] rtnl_unlock+0xe/0x10
> >> > [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
> >> > [ip_tunnel]
> >> > [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
> >> > [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
> >> > [  610.370636]  [] cleanup_net+0x1c4/0x2b0
> >> > [  610.371130]  [] process_one_work+0x1fc/0x4b0
> >> > [  610.371812]  [] worker_thread+0x4b/0x500
> >> > [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
> >> > [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
> >> > [  610.374100]  [] kthread+0xd9/0xf0
> >> > [  610.374574]  [] ? kthread_park+0x60/0x60
> >> > [  610.375198]  [] ret_from_fork+0x25/0x30
> >> > [  610.375678] ip  D0  2149   2148

Re: [PATCH net-next v2 2/7] vxlan: simplify exception handling

2016-11-11 Thread Pravin Shelar

On Fri, Nov 11, 2016 at 3:14 AM, Jiri Benc  wrote:
> On Thu, 10 Nov 2016 11:21:19 -0800, Pravin Shelar wrote:
>> One additional variable is not bad but look at what has happened in
>> vxlan_xmit_one(). There are already more than 20 variables defined. It
>> is hard to read code in this case.
>
> I agree that the function is horrible.
>
> What I was thinking about was separating the vxlan data and control
> plane. The vxlan data plane would perform encapsulation and
> decapsulation based on lwtunnel infrastructure and the rest of the
> "classical" vxlan would be just one of the users of that. Basically
> replacing vxlan_rdst by ip_tunnel_info, among other things.
>
> That would make the vxlan code much much cleaner.
>
I have patch which does something similar for geneve. But it is tricky
to do it for vxlan.

>> anyways I can add another variable to the function. I do not feel that
>> strongly about this.
>
> Me neither, actually. I prefer another variable but I won't oppose the
> patchset just based on that if you choose differently.
>
I have updated patches already. I will post it soon.

Re: TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-11 Thread Russell King - ARM Linux

On Fri, Nov 11, 2016 at 09:23:43PM +, David Woodhouse wrote:
> It's also *fairly* unlikely that the kernel in the guest has developed
> a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> that qemu isn't properly emulating those bits. But at first glance at
> the code, it looks like *that's* been there for the last decade too...

I take issue with that, having looked at the qemu rtl8139 code:

if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
{
int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;

DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
"frame data %d specified MSS=%d\n", ETH_MTU,
ip_data_len, saved_size - ETH_HLEN, large_send_mss);

That's the only reference to "large_send_mss" there, other than that,
the MSS value that gets stuck into the field by 8139cp.c is completely
unused.  Instead, qemu does this:

eth_payload_data = saved_buffer + ETH_HLEN;
eth_payload_len  = saved_size   - ETH_HLEN;

ip = (ip_header*)eth_payload_data;

hlen = IP_HEADER_LENGTH(ip);
ip_data_len = be16_to_cpu(ip->ip_len) - hlen;

tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + 
hlen);
int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);

/* ETH_MTU = ip header len + tcp header len + payload */
int tcp_data_len = ip_data_len - tcp_hlen;
int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;

for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; 
tcp_send_offset += tcp_chunk_size)
{

It uses a fixed value of ETH_MTU to calculate the size of the TCP
data chunks, and this is not surprisingly the well known:

#define ETH_MTU 1500

Qemu seems to be buggy - it ignores the MSS value, and always tries to
send 1500 byte frames.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Re: TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-11 Thread Russell King - ARM Linux

On Fri, Nov 11, 2016 at 09:23:43PM +, David Woodhouse wrote:
> On Fri, 2016-11-11 at 21:05 +, Russell King - ARM Linux wrote:
> > 
> > 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto 
> > TCP (6), length 60)
> >     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db 
> > (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 
> > 1377914597 ecr 0,nop,wscale 7], length 0
> 
> ... (MSS 1452)
> 
> > 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto 
> > TCP (6), length 1500)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 
> > 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 
> > 1448: HTTP, length: 1448
> > 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto 
> > TCP (6), length 1484)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 
> > 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 
> > 1432: HTTP
> 
> Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
> value of 'mss' when this happens?

Well, I'm not going to fiddle in such a way with a public box... that
would be utter madness.  I'll fiddle with mvneta locally on 4.9-rc
instead - and yes, I know that's not the F23 4.4 kernel, so doesn't
really tell us very much.

I _could_ ask bryce to setup another VM on ZenV for me to play with,
but we'll have to wait for bryce to be around for that... I don't
want to break zenv or zeniv. :)

> All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
> bit, and shove it in the descriptor ring. There's not much scope for a
> driver-specific bug.

Unless there's a different interpretation of what the MSS field in the
driver means...

Looking at mvneta, which works correctly,

- On mvneta (192.168.1.59):

21:39:38.535549 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP 
(6), length 7252)
192.168.1.59.55170 > 192.168.1.18.5001: Flags [.], seq 25:7225, ack 1, win 
229, options [nop,nop,TS val 62231754 ecr 1387514367], length 7200

- On laptop (192.168.1.18):

21:39:38.537442 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP 
(6), length 1492)
192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 25:1465, 
ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537453 IP (tos 0x0, ttl 64, id 27669, offset 0, flags [DF], proto TCP 
(6), length 1492)
192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 1465:2905, 
ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537461 IP (tos 0x0, ttl 64, id 27670, offset 0, flags [DF], proto TCP 
(6), length 1492)
192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 2905:4345, 
ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537464 IP (tos 0x0, ttl 64, id 9968, offset 0, flags [DF], proto TCP 
(6), length 52)
192.168.1.18.commplex-link > 192.168.1.59.55170: Flags [.], cksum 0x83c4 
(incorrect -> 0xa338), ack 1465, win 249, options [nop,nop,TS val 1387514368 
ecr 62231754], length 0
21:39:38.537465 IP (tos 0x0, ttl 64, id 27671, offset 0, flags [DF], proto TCP 
(6), length 1492)
192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 4345:5785, 
ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537469 IP (tos 0x0, ttl 64, id 27672, offset 0, flags [DF], proto TCP 
(6), length 1492)
192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 5785:7225, 
ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440

which is all correct.  Now, these packets have a larger TCP header
due to the options:

0x:  0022 6815 37dd 0050 4321 0201 0800 4500  ."h.7..PC!E.
 ^mac   ^iphdr
0x0010:  05d4 6c14 4000 4006 4572 c0a8 013b c0a8  ..l.@.@.Er...;..
0x0020:  0112 d782 1389 4cb4 f8f4 7454 ef10 8010  ..L...tT
  ^tcphdr
0x0030:  00e5 2a80  0101 080a 03b5 94ca 52b3  ..*...R.
^tcpopts
0x0040:  c9ff    0001  1389   
  ^start of data
0x0050:      fc18 3435 3637 3839  ..456789
0x0060:  3031 3233 3435 3637 3839 3031 3233 3435  0123456789012345

So the data starts at 66 (0x42) into this packet, followed by 1440 bytes
of data.  Looking at drivers/net/ethernet/marvell/mvneta.c, the only
way this can happen is if skb_shinfo(skb)->gso_size is 1440.  I'll
instrument mvneta to dump this value...

While waiting for the kernel to build, I've been reading the TCP code,
and found this:

/* Compute the current effective MSS, taking SACKs and IP options,
 * and even PMTU discovery events into account.
 */
unsigned int tcp_current_mss(struct sock *sk)
...

[PATCH] net: alx: use new api ethtool_{get|set}_link_ksettings

2016-11-11 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/atheros/alx/ethtool.c |   59 ---
 1 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c 
b/drivers/net/ethernet/atheros/alx/ethtool.c
index 08e22df..2f4eabf 100644
--- a/drivers/net/ethernet/atheros/alx/ethtool.c
+++ b/drivers/net/ethernet/atheros/alx/ethtool.c
@@ -125,64 +125,75 @@ static u32 alx_get_supported_speeds(struct alx_hw *hw)
return supported;
 }
 
-static int alx_get_settings(struct net_device *netdev, struct ethtool_cmd 
*ecmd)
+static int alx_get_link_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *cmd)
 {
struct alx_priv *alx = netdev_priv(netdev);
struct alx_hw *hw = >hw;
+   u32 supported, advertising;
 
-   ecmd->supported = SUPPORTED_Autoneg |
+   supported = SUPPORTED_Autoneg |
  SUPPORTED_TP |
  SUPPORTED_Pause |
  SUPPORTED_Asym_Pause;
if (alx_hw_giga(hw))
-   ecmd->supported |= SUPPORTED_1000baseT_Full;
-   ecmd->supported |= alx_get_supported_speeds(hw);
+   supported |= SUPPORTED_1000baseT_Full;
+   supported |= alx_get_supported_speeds(hw);
 
-   ecmd->advertising = ADVERTISED_TP;
+   advertising = ADVERTISED_TP;
if (hw->adv_cfg & ADVERTISED_Autoneg)
-   ecmd->advertising |= hw->adv_cfg;
+   advertising |= hw->adv_cfg;
 
-   ecmd->port = PORT_TP;
-   ecmd->phy_address = 0;
+   cmd->base.port = PORT_TP;
+   cmd->base.phy_address = 0;
 
if (hw->adv_cfg & ADVERTISED_Autoneg)
-   ecmd->autoneg = AUTONEG_ENABLE;
+   cmd->base.autoneg = AUTONEG_ENABLE;
else
-   ecmd->autoneg = AUTONEG_DISABLE;
-   ecmd->transceiver = XCVR_INTERNAL;
+   cmd->base.autoneg = AUTONEG_DISABLE;
 
if (hw->flowctrl & ALX_FC_ANEG && hw->adv_cfg & ADVERTISED_Autoneg) {
if (hw->flowctrl & ALX_FC_RX) {
-   ecmd->advertising |= ADVERTISED_Pause;
+   advertising |= ADVERTISED_Pause;
 
if (!(hw->flowctrl & ALX_FC_TX))
-   ecmd->advertising |= ADVERTISED_Asym_Pause;
+   advertising |= ADVERTISED_Asym_Pause;
} else if (hw->flowctrl & ALX_FC_TX) {
-   ecmd->advertising |= ADVERTISED_Asym_Pause;
+   advertising |= ADVERTISED_Asym_Pause;
}
}
 
-   ethtool_cmd_speed_set(ecmd, hw->link_speed);
-   ecmd->duplex = hw->duplex;
+   cmd->base.speed = hw->link_speed;
+   cmd->base.duplex = hw->duplex;
+
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+   supported);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+   advertising);
 
return 0;
 }
 
-static int alx_set_settings(struct net_device *netdev, struct ethtool_cmd 
*ecmd)
+static int alx_set_link_ksettings(struct net_device *netdev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct alx_priv *alx = netdev_priv(netdev);
struct alx_hw *hw = >hw;
u32 adv_cfg;
+   u32 advertising;
 
ASSERT_RTNL();
 
-   if (ecmd->autoneg == AUTONEG_ENABLE) {
-   if (ecmd->advertising & ~alx_get_supported_speeds(hw))
+   ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.advertising);
+
+   if (cmd->base.autoneg == AUTONEG_ENABLE) {
+   if (advertising & ~alx_get_supported_speeds(hw))
return -EINVAL;
-   adv_cfg = ecmd->advertising | ADVERTISED_Autoneg;
+   adv_cfg = advertising | ADVERTISED_Autoneg;
} else {
-   adv_cfg = alx_speed_to_ethadv(ethtool_cmd_speed(ecmd),
- ecmd->duplex);
+   adv_cfg = alx_speed_to_ethadv(cmd->base.speed,
+ cmd->base.duplex);
 
if (!adv_cfg || adv_cfg == ADVERTISED_1000baseT_Full)
return -EINVAL;
@@ -300,8 +311,6 @@ static int alx_get_sset_count(struct net_device *netdev, 
int sset)
 }
 
 const struct ethtool_ops alx_ethtool_ops = {
-   .get_settings   = alx_get_settings,
-   .set_settings   = alx_set_settings,
.get_pauseparam = alx_get_pauseparam,
.set_pauseparam = alx_set_pauseparam,
.get_msglevel   = alx_get_msglevel,
@@ -310,4 +319,6 @@ static int alx_get_sset_count(struct net_device *netdev, 
int sset)

Re: Source address fib invalidation on IPv6

2016-11-11 Thread David Ahern

On 11/11/16 12:29 PM, Jason A. Donenfeld wrote:
> Hi folks,
> 
> If I'm replying to a UDP packet, I generally want to use a source
> address that's the same as the destination address of the packet to
> which I'm replying. For example:
> 
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1
> 
> But let's complicate things. Let's say Peer B has multiple IPs on an
> interface: 10.0.0.2, 10.0.0.3. The default route uses 10.0.0.2. In
> this case what do you think should happen?
> 
> Case 1:
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.2, dst = 10.0.0.1
> 
> Case 2:
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1
> 
> Intuition tells me the answer is "Case 2". If you agree, keep reading.
> If you disagree, stop reading here, and instead correct my poor
> intuition.
> 
> So, assuming "Case 2", when Peer B receives the first packet, he notes
> that packet's destination address, so that he can use it as a source
> address next. When replying, Peer B sets the stored source address and
> calls the routing function:
> 
> struct flowi4 fl = {
>.saddr = from_daddr_of_previous_packet,
>.daddr = from_saddr_of_previous_packet,
> };
> rt = ip_route_output_flow(sock_net(sock), , sock);
> 
> What if, however, by the time Peer B chooses to reply, his interface
> no longer has that source address? No problem, because
> ip_route_output_flow will return -EINVAL in that case. So, we can do
> this:
> 
> struct flowi4 fl = {
>.saddr = from_daddr_of_previous_packet,
>.daddr = from_saddr_of_previous_packet,
> };
> rt = ip_route_output_flow(sock_net(sock), , sock);
> if (unlikely(IS_ERR(rt))) {
> fl.saddr = 0;
> rt = ip_route_output_flow(sock_net(sock), , sock);
> }
> 
> And then all is good in the neighborhood. This solution works. Done.
> 
> But what about IPv6? That's where we get into trouble:
> 
> struct flowi6 fl = {
>.saddr = from_daddr_of_previous_packet,
>.daddr = from_saddr_of_previous_packet,
> };
> ret = ipv6_stub->ipv6_dst_lookup(sock_net(sock), sock, , );
> 
> In this case, IPv6 returns a valid dst, when no interface has the
> source address anymore! So, there's no way to know whether or not the
> source address for replying has gone stale. We don't have a means of
> falling back to inaddr_any for the source address.

What do you mean by 'valid dst'? ipv6 returns net->ipv6.ip6_null_entry on 
lookup failures so yes dst is non-NULL but that does not mean the lookup 
succeeded.

For example take a look at ip6_dst_lookup_tail():
if (!*dst)
*dst = ip6_route_output_flags(net, sk, fl6, flags);

err = (*dst)->error;
if (err)
goto out_err_release;


perhaps I should add dst->error to the fib tracepoints ...

> 
> Primary question: is this behavior a bug? Or is this some consequence
> of a fundamental IPv6 difference with v4? Or is something else
> happening here?
> 
> Thanks,
> Jason
>

[PATCH net-next] bpf: fix range arithmetic for bpf map access

2016-11-11 Thread Josef Bacik

I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries.  Fix this up by doing a few things

1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
life and just adds extra complexity.

2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.

3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.

This fixes the testcase provided by Jann as well as a few other theoretical
problems.

Reported-by: Jann Horn 
Signed-off-by: Josef Bacik 
---
 include/linux/bpf_verifier.h |  3 +-
 kernel/bpf/verifier.c| 70 +---
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ac5b393..15ceb7f 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -22,7 +22,8 @@ struct bpf_reg_state {
 * Used to determine if any memory access using this register will
 * result in a bad access.
 */
-   u64 min_value, max_value;
+   s64 min_value;
+   u64 max_value;
u32 id;
union {
/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 89f787c..709fe0e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -234,8 +234,8 @@ static void print_verifier_state(struct bpf_verifier_state 
*state)
reg->map_ptr->value_size,
reg->id);
if (reg->min_value != BPF_REGISTER_MIN_RANGE)
-   verbose(",min_value=%llu",
-   (unsigned long long)reg->min_value);
+   verbose(",min_value=%lld",
+   (long long)reg->min_value);
if (reg->max_value != BPF_REGISTER_MAX_RANGE)
verbose(",max_value=%llu",
(unsigned long long)reg->max_value);
@@ -778,7 +778,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
u32 regno, int off,
 * index'es we need to make sure that whatever we use
 * will have a set floor within our range.
 */
-   if ((s64)reg->min_value < 0) {
+   if (reg->min_value < 0) {
verbose("R%d min value is negative, either use 
unsigned index or do a if (index >=0) check.\n",
regno);
return -EACCES;
@@ -1490,7 +1490,8 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
 {
if (reg->max_value > BPF_REGISTER_MAX_RANGE)
reg->max_value = BPF_REGISTER_MAX_RANGE;
-   if ((s64)reg->min_value < BPF_REGISTER_MIN_RANGE)
+   if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
+   reg->min_value > BPF_REGISTER_MAX_RANGE)
reg->min_value = BPF_REGISTER_MIN_RANGE;
 }
 
@@ -1498,7 +1499,8 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
struct bpf_insn *insn)
 {
struct bpf_reg_state *regs = env->cur_state.regs, *dst_reg;
-   u64 min_val = BPF_REGISTER_MIN_RANGE, max_val = BPF_REGISTER_MAX_RANGE;
+   s64 min_val = BPF_REGISTER_MIN_RANGE;
+   u64 max_val = BPF_REGISTER_MAX_RANGE;
u8 opcode = BPF_OP(insn->code);
 
dst_reg = [insn->dst_reg];
@@ -1532,22 +1534,43 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
return;
}
 
+   /* If one of our values was at the end of our ranges then we can't just
+* do our normal operations to the register, we need to set the values
+* to the min/max since they are undefined.
+*/
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+
switch (opcode) {
case BPF_ADD:
-   dst_reg->min_value += min_val;
-   dst_reg->max_value += max_val;
+   if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+   dst_reg->min_value += min_val;
+   if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
+   dst_reg->max_value += max_val;
break;
case BPF_SUB:
-   dst_reg->min_value -= min_val;
-   dst_reg->max_value -= max_val;
+   if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+   dst_reg->min_value -= min_val;
+

Re: TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-11 Thread David Woodhouse

On Fri, 2016-11-11 at 21:05 +, Russell King - ARM Linux wrote:
> 
> 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto 
> TCP (6), length 60)
>     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db 
> (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 
> 1377914597 ecr 0,nop,wscale 7], length 0

... (MSS 1452)

> 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto 
> TCP (6), length 1500)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, 
> win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: 
> HTTP, length: 1448
> 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto 
> TCP (6), length 1484)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 
> 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 
> 1432: HTTP

Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
value of 'mss' when this happens?

All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
bit, and shove it in the descriptor ring. There's not much scope for a
driver-specific bug.

It's also *fairly* unlikely that the kernel in the guest has developed
a bug and isn't setting gso_size sanely. I'm more inclined to suspect
that qemu isn't properly emulating those bits. But at first glance at
the code, it looks like *that's* been there for the last decade too...

-- 
dwmw2

smime.p7s
Description: S/MIME cryptographic signature

do bridge members need to be listed in /proc/net/dev_mcast

2016-11-11 Thread Brian J. Murrell

Hi.

I have a Linux router running 3.18.23 with IPv6 as well as IPv4
interfaces.  It doesn't seem to be hearing IPv6 multicast packets
though.

For example, it won't hear and respond to either router or neighbour
discovery packets unless i put the interface in promiscuous mode with
tcpdump.  I'm a bit stumped at what could cause that.

The interface that is not hearing the IPv6 multicast packets is a
bridge with an ethernet and wifi interfaces as members:

# brctl show br-lan
bridge name bridge id   STP enabled interfaces
br-lan  7fff.26d42cb3eadf   no  eth0.1
wlan0
wlan1

The bridge does have the right multicast addresses configured in
/proc/net/dev_mcast:

8br-lan  1 0 0001
8br-lan  1 0 0002
8br-lan  1 0 01005e01
8br-lan  1 0 ff01
8br-lan  1 0 ffb3eadf
8br-lan  1 0 ff00
8br-lan  1 0 01005e05
8br-lan  1 0 01005e06

But what is interesting is that the wlan{0,1} interfaces that are in
the br-lan bridge are in the /proc/net/dev_mcast also:

15   wlan1   2 0 0001
15   wlan1   2 0 0002
15   wlan1   2 0 01005e01
15   wlan1   2 0 fff51e4c
15   wlan1   2 0 ff00
16   wlan0   2 0 0001
16   wlan0   2 0 0002
16   wlan0   2 0 01005e01
16   wlan0   2 0 fff51e4a
16   wlan0   2 0 ff00

But the ethernet member, eth0.1 is not.

Is it sufficient to have a bridge interface in /proc/net/dev_mcast or
do all of it's member interfaces need the respective multicast
addresses listed in that file also?  It just seems odd to me that the
wlan interfaces are there but the ethernet interface is not.

If it is sufficient to have just the bridge in /proc/net/dev_mcast what
else could be causing this "deafness" to multicast that is resolved by
putting the interface into promiscuous mode?

Cheers,
b.


signature.asc
Description: This is a digitally signed message part

TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-11 Thread Russell King - ARM Linux

Hi,

I seem to have found a severe performance issue somewhere in the
networking code.

This involves ZenIV.linux.org.uk, which is a qemu-kvm guest instance
on ZenV, which is configured to use macvtap for ZenIV to gain its
network access, with ZenIV using the 8139cp driver.

My initial testing was from my laptop (running 4.5.7), through a
router box (also running 4.5.7) and out my FTTC link, across the
Internet to ZenV (4.4.8-300.fc23.x86_64) and then onto the ZenIV
(also 4.4.8-300.fc23.x86_64) guest.  Thinking that it may be an
issue with my crappy FTTC, I switched the routing at my end over
the ADSL line, which showed the same issues.

Eventually, what fixed it was disabling both TSO and GSO in the
ZenIV guest.

Now, both my FTTC and ADSL links have a reduced MTU, and I'm having
to use TCPMSS on the router box to clamp the MSS - which gets
clamped to 1452, 8 bytes lower than the usual 1460 for standard
ethernet.

With TSO on, I see the guest sending TCP packets with a 2880 byte
payload:

17:36:07.006009 IP (tos 0x0, ttl 52, id 17517, offset 0, flags [DF], proto TCP 
(6), length 60)
84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [S], cksum 0x2c25 (correct), 
seq 356291023, win 29200, options [mss 1452,sackOK,TS val 1372902818 ecr 
0,nop,wscale 7], length 0
17:36:07.006122 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), 
length 60)
195.92.253.2.http > 84.xx.xxx.196.60846: Flags [S.], cksum 0xed7f 
(incorrect -> 0x674a), seq 2784716623, ack 356291024, win 28960, options [mss 
1460,sackOK,TS val 3358126141 ecr 1372902818,nop,wscale 7], length 0
17:36:07.035531 IP (tos 0x0, ttl 52, id 17518, offset 0, flags [DF], proto TCP 
(6), length 52)
84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [.], cksum 0x0634 (correct), 
ack 1, win 229, options [nop,nop,TS val 1372902848 ecr 3358126141], length 0
17:36:07.038233 IP (tos 0x0, ttl 52, id 17519, offset 0, flags [DF], proto TCP 
(6), length 205)
84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [P.], cksum 0x3a1e 
(correct), seq 1:154, ack 1, win 229, options [nop,nop,TS val 1372902848 ecr 
3358126141], length 153: HTTP, length: 153
17:36:07.038356 IP (tos 0x0, ttl 64, id 38669, offset 0, flags [DF], proto TCP 
(6), length 52)
195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], cksum 0xed77 (incorrect 
-> 0x0575), ack 154, win 235, options [nop,nop,TS val 3358126173 ecr 
1372902848], length 0
17:36:07.039255 IP (tos 0x0, ttl 64, id 38670, offset 0, flags [DF], proto TCP 
(6), length 2932)
195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 1:2881, ack 154, 
win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP, 
length: 2880
17:36:07.039442 IP (tos 0x0, ttl 64, id 38672, offset 0, flags [DF], proto TCP 
(6), length 2932)
195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 2881:5761, ack 154, 
win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP
17:36:07.039579 IP (tos 0x0, ttl 64, id 38674, offset 0, flags [DF], proto TCP 
(6), length 2932)
195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 5761:8641, ack 154, 
win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP
...etc...

On the macvtap side, however, which is post-segmentation by the
virtualised 8139cp hardware (this taken at a later time):

18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP 
(6), length 60)
84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), 
seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 
0,nop,wscale 7], length 0
18:59:38.783270 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), 
length 60)
195.92.253.2.http > 84.xx.xxx.196.61236: Flags [S.], cksum 0x575d 
(correct), seq 4091022471, ack 158975431, win 28960, options [mss 
1460,sackOK,TS val 3363137919 ecr 1377914597,nop,wscale 7], length 0
18:59:38.812089 IP (tos 0x0, ttl 52, id 35620, offset 0, flags [DF], proto TCP 
(6), length 52)
84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [.], cksum 0xf646 (correct), 
ack 1, win 229, options [nop,nop,TS val 1377914627 ecr 3363137919], length 0
18:59:38.814623 IP (tos 0x0, ttl 52, id 35621, offset 0, flags [DF], proto TCP 
(6), length 205)
84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [P.], cksum 0x2a31 
(correct), seq 1:154, ack 1, win 229, options [nop,nop,TS val 1377914627 ecr 
3363137919], length 153: HTTP, length: 153
18:59:38.815025 IP (tos 0x0, ttl 64, id 25878, offset 0, flags [DF], proto TCP 
(6), length 52)
195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], cksum 0xf588 (correct), 
ack 154, win 235, options [nop,nop,TS val 3363137950 ecr 1377914627], length 0
18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP 
(6), length 1500)
195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, 
win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, 
length: 1448
18:59:38.816393 IP (tos 0x0,

Re: [PATCH v2 1/6] qed: Add support for hardware offloaded iSCSI.

2016-11-11 Thread Arun Easi

On Fri, 11 Nov 2016, 7:57am, Hannes Reinecke wrote:

> On 11/08/2016 07:56 AM, Manish Rangankar wrote:
> > From: Yuval Mintz 
> > 
> > This adds the backbone required for the various HW initalizations
> > which are necessary for the iSCSI driver (qedi) for QLogic FastLinQ
> > 4 line of adapters - FW notification, resource initializations, etc.
> > 
> > Signed-off-by: Arun Easi 
> > Signed-off-by: Yuval Mintz 
> > ---
> >  drivers/net/ethernet/qlogic/Kconfig|   15 +
> >  drivers/net/ethernet/qlogic/qed/Makefile   |1 +
> >  drivers/net/ethernet/qlogic/qed/qed.h  |7 +-
> >  drivers/net/ethernet/qlogic/qed/qed_dev.c  |   12 +
> >  drivers/net/ethernet/qlogic/qed/qed_int.h  |1 -
> >  drivers/net/ethernet/qlogic/qed/qed_iscsi.c| 1276
> > 
> >  drivers/net/ethernet/qlogic/qed/qed_iscsi.h|   52 +
> >  drivers/net/ethernet/qlogic/qed/qed_l2.c   |1 -
> >  drivers/net/ethernet/qlogic/qed/qed_ll2.c  |4 +-
> >  drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |2 +
> >  drivers/net/ethernet/qlogic/qed/qed_spq.c  |   15 +
> >  include/linux/qed/qed_if.h |2 +
> >  include/linux/qed/qed_iscsi_if.h   |  229 +
> >  13 files changed, 1613 insertions(+), 4 deletions(-)
> >  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c
> >  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h
> >  create mode 100644 include/linux/qed/qed_iscsi_if.h
> > 
> > diff --git a/drivers/net/ethernet/qlogic/Kconfig
> > b/drivers/net/ethernet/qlogic/Kconfig
> > index 32f2a45..2832570 100644
> > --- a/drivers/net/ethernet/qlogic/Kconfig
> > +++ b/drivers/net/ethernet/qlogic/Kconfig
> > @@ -110,4 +110,19 @@ config QEDE
> >  config QED_RDMA
> > bool
> > 
> > +config QED_ISCSI
> > +   bool
> > +
> > +config QEDI
> > +   tristate "QLogic QED 25/40/100Gb iSCSI driver"
> > +   depends on QED
> > +   select QED_LL2
> > +   select QED_ISCSI
> > +   default n
> > +   ---help---
> > + This provides a temporary node that allows the compilation
> > + and logical testing of the hardware offload iSCSI support
> > + for QLogic QED. This would be replaced by the 'real' option
> > + once the QEDI driver is added [+relocated].
> > +
> >  endif # NET_VENDOR_QLOGIC
> > diff --git a/drivers/net/ethernet/qlogic/qed/Makefile
> > b/drivers/net/ethernet/qlogic/qed/Makefile
> > index 967acf3..597e15c 100644
> > --- a/drivers/net/ethernet/qlogic/qed/Makefile
> > +++ b/drivers/net/ethernet/qlogic/qed/Makefile
> > @@ -6,3 +6,4 @@ qed-y := qed_cxt.o qed_dev.o qed_hw.o qed_init_fw_funcs.o
> > qed_init_ops.o \
> >  qed-$(CONFIG_QED_SRIOV) += qed_sriov.o qed_vf.o
> >  qed-$(CONFIG_QED_LL2) += qed_ll2.o
> >  qed-$(CONFIG_QED_RDMA) += qed_roce.o
> > +qed-$(CONFIG_QED_ISCSI) += qed_iscsi.o
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed.h
> > b/drivers/net/ethernet/qlogic/qed/qed.h
> > index 50b8a01..15286c1 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed.h
> > +++ b/drivers/net/ethernet/qlogic/qed/qed.h
> > @@ -35,6 +35,7 @@
> > 
> >  #define QED_WFQ_UNIT   100
> > 
> > +#define ISCSI_BDQ_ID(_port_id) (_port_id)
> >  #define QED_WID_SIZE(1024)
> >  #define QED_PF_DEMS_SIZE(4)
> > 
> > @@ -392,6 +393,7 @@ struct qed_hwfn {
> > boolusing_ll2;
> > struct qed_ll2_info *p_ll2_info;
> > struct qed_rdma_info*p_rdma_info;
> > +   struct qed_iscsi_info   *p_iscsi_info;
> > struct qed_pf_paramspf_params;
> > 
> > bool b_rdma_enabled_in_prs;
> > @@ -593,6 +595,8 @@ struct qed_dev {
> > /* Linux specific here */
> > struct  qede_dev*edev;
> > struct  pci_dev *pdev;
> > +   u32 flags;
> > +#define QED_FLAG_STORAGE_STARTED   (BIT(0))
> > int msg_enable;
> > 
> > struct pci_params   pci_params;
> > @@ -606,6 +610,7 @@ struct qed_dev {
> > union {
> > struct qed_common_cb_ops*common;
> > struct qed_eth_cb_ops   *eth;
> > +   struct qed_iscsi_cb_ops *iscsi;
> > } protocol_ops;
> > void*ops_cookie;
> > 
> > @@ -615,7 +620,7 @@ struct qed_dev {
> > struct qed_cb_ll2_info  *ll2;
> > u8  ll2_mac_address[ETH_ALEN];
> >  #endif
> > -
> > +   DECLARE_HASHTABLE(connections, 10);
> > const struct firmware   *firmware;
> > 
> > u32 rdma_max_sge;
> 10 connections? Only?
> Hmm.

10 is the hash bits => 2^10 hash buckets, allowing for a large number of 
connections. qedi driver currently uses 1k connections per port.

Thanks for the reviews, Hannes.

Regards,
-Arun

> 
> Other than that:
> 
> Reviewed-by: Hannes Reinecke 
> 
> Cheers,
> 
> Hannes
>

[RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-11 Thread Casey Leedom

N.B.  Sorry I'm not able to respond to the original message since I
wasn't subscribed to netdev when it was sent a couple of weeks ago.

This feature is something that Chelsio's cxgb4 driver needs.
As we've tested our adapters against a number of switches,
we've discovered a few which use varying defaults for FEC.
And when Auto-Negotiation isn't used (or even possible with
Optical Links), we need to be able to control turning FEC on/off.

For our part, we default FEC Off for Optical Transceivers.
For Copper, we read the Cable's EEPROM to determine
how to default FEC.  For some switches this works, but
for at least one where that switch enables FEC for Optical
Transceivers, it doesn't.  For that switch we had to hard
wire FEC on.  Obviously that's not a good solution and we
need an administrative interface so the system
administrator can configure our adapter to use the
appropriate FEC setting to match that of the switch.

So this is basically a long-winded ACK of Vidya's patch
and we would immediately implement this new ethtool API
as soon as it's available.

Casey

[PATCH] icmp: Restore resistence to abnormal messages

2016-11-11 Thread Vicente Jimenez Aguilar

Restore network resistance to abnormal ICMP fragmentation needed messages
with next hop MTU equal to (or exceeding) dropped packet size

Fixes: 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Vicente Jimenez Aguilar 
---
 net/ipv4/icmp.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 38abe70..4c90d76 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -773,6 +773,7 @@ static bool icmp_tag_validation(int proto)
 static bool icmp_unreach(struct sk_buff *skb)
 {
const struct iphdr *iph;
+   unsigned short old_mtu;
struct icmphdr *icmph;
struct net *net;
u32 info = 0;
@@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
/* fall through */
case 0:
info = ntohs(icmph->un.frag.mtu);
+   /* Handle weird case where next hop MTU is
+* equal to or exceeding dropped packet size
+*/
+   old_mtu = ntohs(iph->tot_len);
+   if (info >= old_mtu)
+   info = old_mtu - 2;
}
break;
case ICMP_SR_FAILED:
-- 
2.9.3

Source address fib invalidation on IPv6

2016-11-11 Thread Jason A. Donenfeld

Hi folks,

If I'm replying to a UDP packet, I generally want to use a source
address that's the same as the destination address of the packet to
which I'm replying. For example:

Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1

But let's complicate things. Let's say Peer B has multiple IPs on an
interface: 10.0.0.2, 10.0.0.3. The default route uses 10.0.0.2. In
this case what do you think should happen?

Case 1:
Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
Peer B replies with: src = 10.0.0.2, dst = 10.0.0.1

Case 2:
Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1

Intuition tells me the answer is "Case 2". If you agree, keep reading.
If you disagree, stop reading here, and instead correct my poor
intuition.

So, assuming "Case 2", when Peer B receives the first packet, he notes
that packet's destination address, so that he can use it as a source
address next. When replying, Peer B sets the stored source address and
calls the routing function:

struct flowi4 fl = {
   .saddr = from_daddr_of_previous_packet,
   .daddr = from_saddr_of_previous_packet,
};
rt = ip_route_output_flow(sock_net(sock), , sock);

What if, however, by the time Peer B chooses to reply, his interface
no longer has that source address? No problem, because
ip_route_output_flow will return -EINVAL in that case. So, we can do
this:

struct flowi4 fl = {
   .saddr = from_daddr_of_previous_packet,
   .daddr = from_saddr_of_previous_packet,
};
rt = ip_route_output_flow(sock_net(sock), , sock);
if (unlikely(IS_ERR(rt))) {
fl.saddr = 0;
rt = ip_route_output_flow(sock_net(sock), , sock);
}

And then all is good in the neighborhood. This solution works. Done.

But what about IPv6? That's where we get into trouble:

struct flowi6 fl = {
   .saddr = from_daddr_of_previous_packet,
   .daddr = from_saddr_of_previous_packet,
};
ret = ipv6_stub->ipv6_dst_lookup(sock_net(sock), sock, , );

In this case, IPv6 returns a valid dst, when no interface has the
source address anymore! So, there's no way to know whether or not the
source address for replying has gone stale. We don't have a means of
falling back to inaddr_any for the source address.

Primary question: is this behavior a bug? Or is this some consequence
of a fundamental IPv6 difference with v4? Or is something else
happening here?

Thanks,
Jason

Re: [net-next PATCH] net: dummy: Introduce dummy virtual functions

2016-11-11 Thread kbuild test robot

Hi Phil,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Phil-Sutter/net-dummy-Introduce-dummy-virtual-functions/20161112-013558
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   drivers/net/dummy.c:53:2: error: unknown field 'sriov' specified in 
initializer
 .sriov = _sriov,
 ^
   drivers/net/dummy.c:53:2: warning: initialization makes integer from pointer 
without a cast
   drivers/net/dummy.c:53:2: warning: (near initialization for 
'pci_pdev.is_virtfn')
   drivers/net/dummy.c:53:2: error: initializer element is not computable at 
load time
   drivers/net/dummy.c:53:2: error: (near initialization for 
'pci_pdev.is_virtfn')
>> drivers/net/dummy.c:54:14: error: 'pci_bus_type' undeclared here (not in a 
>> function)
 .dev.bus = _bus_type,
 ^

vim +/pci_bus_type +54 drivers/net/dummy.c

47  static int num_vfs;
48  
49  static struct pci_sriov pdev_sriov;
50  
51  static struct pci_dev pci_pdev = {
52  .is_physfn = 1,
  > 53  .sriov = _sriov,
  > 54  .dev.bus = _bus_type,
55  };
56  
57  struct vf_data_storage {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [net-next PATCH] net: dummy: Introduce dummy virtual functions

2016-11-11 Thread kbuild test robot

Hi Phil,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Phil-Sutter/net-dummy-Introduce-dummy-virtual-functions/20161112-013558
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

>> drivers/net/dummy.c:53:2: error: unknown field 'sriov' specified in 
>> initializer
 .sriov = _sriov,
 ^
>> drivers/net/dummy.c:53:2: warning: initialization makes integer from pointer 
>> without a cast
   drivers/net/dummy.c:53:2: warning: (near initialization for 
'pci_pdev.is_virtfn')
>> drivers/net/dummy.c:53:2: error: initializer element is not computable at 
>> load time
   drivers/net/dummy.c:53:2: error: (near initialization for 
'pci_pdev.is_virtfn')

vim +/sriov +53 drivers/net/dummy.c

47  static int num_vfs;
48  
49  static struct pci_sriov pdev_sriov;
50  
51  static struct pci_dev pci_pdev = {
52  .is_physfn = 1,
  > 53  .sriov = _sriov,
54  .dev.bus = _bus_type,
55  };
56  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH] bpf: fix range arithmetic for bpf map access

2016-11-11 Thread Josef Bacik


On 11/11/2016 11:36 AM, Jann Horn wrote:

On Fri, Nov 11, 2016 at 1:18 AM, Josef Bacik  wrote:

---
Sorry Jann, I saw your response last night and then promptly forgot about it,
here's the git-send-email version.
---


A note: This doesn't seem to apply cleanly to current net-next (or I'm
too stupid to
use "git am"), so I'm applying it on f41cd11d64b2b21012eb4abffbe579bc0b90467f,
which is net-next from a few days ago.



Yeah Dave pulled in a cleanup fix like right after I rebased onto net-next, I'll 
rebase again before I send the next patch.





I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries.  Fix this up by doing a few things

1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
life and just adds extra complexity.


Yay! As a security person, I am very much in favor of killing unused features.



I almost dropped AND but thought better of it ;).




2) Fix the logic for BPF_AND.  If the min value is negative then that is the new
minimum, otherwise it is unconditionally 0.

3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.

This fixes the testcase provided by Jann as well as a few other theoretical
problems.

Reported-by: Jann Horn 
Signed-off-by: Josef Bacik 


A nit: check_mem_access() still has an explicit cast of reg->min_value to s64, I
think that's not necessary anymore?


Yup just missed that, I'll fix it.




case BPF_AND:
-   /* & is special since it could end up with 0 bits set. */
-   dst_reg->min_value &= min_val;
+   /* & is special since it's could be any value within our range,
+* including 0.  But if the thing we're AND'ing against is
+* negative and we're negative then that's the minimum value,
+* otherwise the minimum will always be 0.
+*/
+   if (min_val < 0 && dst_reg->min_value < 0)
+   dst_reg->min_value = min_t(s64, dst_reg->min_value,
+  min_val);
+   else
+   dst_reg->min_value = 0;
dst_reg->max_value = max_val;


I'm not sure whether this is correct when dealing with signed numbers.
Let's say I have -2 and -3 (as u32: 0xfffe and 0xfffd) and AND them
together. The result is 0xfffc, or -4, right? So if I just compute
the AND of
constant numbers -2 and -3 (known to the verifier), the verifier would
compute minimum -3 while the actual value is -4, right?

If I am correct about this, I think it might make sense to just reset
the state to
unknown in the `min_val < 0 && dst_reg->min_value < 0` case. That shouldn't
occur in legitimate programs, right?



Yeah actually I think you are right, we'll just assume that AND'ing negative 
values means you did something wrong and set it to the RANGE_MIN value.  Thanks!


Josef

[PATCH iproute2 v1 1/2] iproute2: avoid exit in case of error.

2016-11-11 Thread David Decotigny

Be consistent with how non-0 print_route() return values are handled
elesewhere: return -1.


---
 ip/iproute.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 98bfad6..dae793b 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1743,7 +1743,7 @@ static int iproute_get(int argc, char **argv)
 
if (print_route(NULL, , (void *)stdout) < 0) {
fprintf(stderr, "An error :-)\n");
-   exit(1);
+   return -1;
}
 
if (req.n.nlmsg_type != RTM_NEWROUTE) {
-- 
2.8.0.rc3.226.g39d4020

[PATCH iproute2 v1 2/2] iproute2: a non-expected rtnl message is an error

2016-11-11 Thread David Decotigny


---
 ip/iproute.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index dae793b..10d0afe 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -320,7 +320,7 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWROUTE && n->nlmsg_type != RTM_DELROUTE) {
fprintf(stderr, "Not a route: %08x %08x %08x\n",
n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
-   return 0;
+   return -1;
}
if (filter.flushb && n->nlmsg_type != RTM_NEWROUTE)
return 0;
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 6/6] bpf: Add tests for the LRU bpf_htab

2016-11-11 Thread Martin KaFai Lau

This patch has some unit tests and a test_lru_dist.

The test_lru_dist reads in the numeric keys from a file.
The files used here are generated by a modified fio-genzipf tool
originated from the fio test suit.  The sample data file can be
found here: https://github.com/iamkafai/bpf-lru

The zipf.* data files have 100k numeric keys and the key is also
ranged from 1 to 100k.

The test_lru_dist outputs the number of unique keys (nr_unique).
F.e. The following means, 61239 of them is unique out of 100k keys.
nr_misses means it cannot be found in the LRU map, so nr_misses
must be >= nr_unique. test_lru_dist also simulates a perfect LRU
map as a comparison:

[root@arch-fb-vm1 ~]# ~/devshare/fb-kernel/linux/samples/bpf/test_lru_dist \
/root/zipf.100k.a1_01.out 4000 1
...
test_parallel_lru_dist (map_type:9 map_flags:0x0):
task:0 BPF LRU: nr_unique:23093(/10) nr_misses:31603(/10)
task:0 Perfect LRU: nr_unique:23093(/10 nr_misses:34328(/10)

test_parallel_lru_dist (map_type:9 map_flags:0x2):
task:0 BPF LRU: nr_unique:23093(/10) nr_misses:31710(/10)
task:0 Perfect LRU: nr_unique:23093(/10 nr_misses:34328(/10)

[root@arch-fb-vm1 ~]# ~/devshare/fb-kernel/linux/samples/bpf/test_lru_dist \
/root/zipf.100k.a0_01.out 4 1
...
test_parallel_lru_dist (map_type:9 map_flags:0x0):
task:0 BPF LRU: nr_unique:61239(/10) nr_misses:67054(/10)
task:0 Perfect LRU: nr_unique:61239(/10 nr_misses:66993(/10)
...
test_parallel_lru_dist (map_type:9 map_flags:0x2):
task:0 BPF LRU: nr_unique:61239(/10) nr_misses:67068(/10)
task:0 Perfect LRU: nr_unique:61239(/10 nr_misses:66993(/10)

LRU map has also been added to map_perf_test:
/* Global LRU */
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
 1 cpus: 2934082 updates
 4 cpus: 7391434 updates
 8 cpus: 6500576 updates

/* Percpu LRU */
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{print r " updates"}'; done
  1 cpus: 2896553 updates
  4 cpus: 9766395 updates
  8 cpus: 17460553 updates

Signed-off-by: Martin KaFai Lau 
---
 samples/bpf/Makefile   |   2 +
 samples/bpf/map_perf_test_kern.c   |  39 ++
 samples/bpf/map_perf_test_user.c   |  32 ++
 samples/bpf/test_lru_dist.c| 538 ++
 tools/testing/selftests/bpf/Makefile   |   6 +-
 tools/testing/selftests/bpf/test_lru_map.c | 583 +
 6 files changed, 1197 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_lru_dist.c
 create mode 100644 tools/testing/selftests/bpf/test_lru_map.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 5c53fdb..efd08c0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,6 +2,7 @@
 obj- := dummy.o
 
 # List of programs to build
+hostprogs-y := test_lru_dist
 hostprogs-y += sock_example
 hostprogs-y += fds_example
 hostprogs-y += sockex1
@@ -27,6 +28,7 @@ hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
 
+test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 fds_example-objs := bpf_load.o libbpf.o fds_example.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 311538e..7ee1574 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -19,6 +19,21 @@ struct bpf_map_def SEC("maps") hash_map = {
.max_entries = MAX_ENTRIES,
 };
 
+struct bpf_map_def SEC("maps") lru_hash_map = {
+   .type = BPF_MAP_TYPE_LRU_HASH,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") percpu_lru_hash_map = {
+   .type = BPF_MAP_TYPE_LRU_HASH,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 1,
+   .map_flags = BPF_F_NO_COMMON_LRU,
+};
+
 struct bpf_map_def SEC("maps") percpu_hash_map = {
.type = BPF_MAP_TYPE_PERCPU_HASH,
.key_size = sizeof(u32),
@@ -53,6 +68,7 @@ int stress_hmap(struct pt_regs *ctx)
value = bpf_map_lookup_elem(_map, );
if (value)
bpf_map_delete_elem(_map, );
+
return 0;
 }
 
@@ -96,5 +112,28 @@ int stress_percpu_hmap_alloc(struct pt_regs *ctx)
bpf_map_delete_elem(_hash_map_alloc, );
return 0;
 }
+
+SEC("kprobe/sys_getpid")
+int stress_lru_hmap_alloc(struct pt_regs *ctx)
+{
+   u32 key = bpf_get_prandom_u32();
+   long val = 1;
+
+   bpf_map_update_elem(_hash_map, , , BPF_ANY);
+
+   return 0;
+}
+
+SEC("kprobe/sys_getppid")
+int stress_percpu_lru_hmap_alloc(struct pt_regs *ctx)
+{
+   u32 key = bpf_get_prandom_u32();
+   long val = 1;
+
+

[PATCH v2 net-next 3/6] bpf: Refactor codes handling percpu map

2016-11-11 Thread Martin KaFai Lau

Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.

Signed-off-by: Martin KaFai Lau 
---
 kernel/bpf/hashtab.c | 47 +--
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 570eeca..a5e3915 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -420,6 +420,24 @@ static void free_htab_elem(struct bpf_htab *htab, struct 
htab_elem *l)
}
 }
 
+static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
+   void *value, bool onallcpus)
+{
+   if (!onallcpus) {
+   /* copy true value_size bytes */
+   memcpy(this_cpu_ptr(pptr), value, htab->map.value_size);
+   } else {
+   u32 size = round_up(htab->map.value_size, 8);
+   int off = 0, cpu;
+
+   for_each_possible_cpu(cpu) {
+   bpf_long_memcpy(per_cpu_ptr(pptr, cpu),
+   value + off, size);
+   off += size;
+   }
+   }
+}
+
 static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 void *value, u32 key_size, u32 hash,
 bool percpu, bool onallcpus,
@@ -479,18 +497,8 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab 
*htab, void *key,
}
}
 
-   if (!onallcpus) {
-   /* copy true value_size bytes */
-   memcpy(this_cpu_ptr(pptr), value, htab->map.value_size);
-   } else {
-   int off = 0, cpu;
+   pcpu_copy_value(htab, pptr, value, onallcpus);
 
-   for_each_possible_cpu(cpu) {
-   bpf_long_memcpy(per_cpu_ptr(pptr, cpu),
-   value + off, size);
-   off += size;
-   }
-   }
if (!prealloc)
htab_elem_set_ptr(l_new, key_size, pptr);
} else {
@@ -606,22 +614,9 @@ static int __htab_percpu_map_update_elem(struct bpf_map 
*map, void *key,
goto err;
 
if (l_old) {
-   void __percpu *pptr = htab_elem_get_ptr(l_old, key_size);
-   u32 size = htab->map.value_size;
-
/* per-cpu hash map can update value in-place */
-   if (!onallcpus) {
-   memcpy(this_cpu_ptr(pptr), value, size);
-   } else {
-   int off = 0, cpu;
-
-   size = round_up(size, 8);
-   for_each_possible_cpu(cpu) {
-   bpf_long_memcpy(per_cpu_ptr(pptr, cpu),
-   value + off, size);
-   off += size;
-   }
-   }
+   pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
+   value, onallcpus);
} else {
l_new = alloc_htab_elem(htab, key, value, key_size,
hash, true, onallcpus, false);
-- 
2.5.1

[PATCH v2 net-next 4/6] bpf: Add BPF_MAP_TYPE_LRU_HASH

2016-11-11 Thread Martin KaFai Lau

Provide a LRU version of the existing BPF_MAP_TYPE_HASH.

Signed-off-by: Martin KaFai Lau 
---
 include/uapi/linux/bpf.h |   8 ++
 kernel/bpf/hashtab.c | 266 ---
 2 files changed, 260 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..ed8c679 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -85,6 +85,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_PERCPU_ARRAY,
BPF_MAP_TYPE_STACK_TRACE,
BPF_MAP_TYPE_CGROUP_ARRAY,
+   BPF_MAP_TYPE_LRU_HASH,
 };
 
 enum bpf_prog_type {
@@ -106,6 +107,13 @@ enum bpf_prog_type {
 #define BPF_EXIST  2 /* update existing element */
 
 #define BPF_F_NO_PREALLOC  (1U << 0)
+/* Instead of having one common LRU list in the
+ * BPF_MAP_TYPE_LRU_HASH map, use a percpu LRU list
+ * which can scale and perform better.
+ * Note, the LRU nodes (including free nodes) cannot be moved
+ * across different LRU lists.
+ */
+#define BPF_F_NO_COMMON_LRU(1U << 1)
 
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index a5e3915..60e9e85 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include "percpu_freelist.h"
+#include "bpf_lru_list.h"
 
 struct bucket {
struct hlist_head head;
@@ -25,7 +26,10 @@ struct bpf_htab {
struct bpf_map map;
struct bucket *buckets;
void *elems;
-   struct pcpu_freelist freelist;
+   union {
+   struct pcpu_freelist freelist;
+   struct bpf_lru lru;
+   };
void __percpu *extra_elems;
atomic_t count; /* number of elements in this hashtable */
u32 n_buckets;  /* number of hash buckets */
@@ -48,11 +52,19 @@ struct htab_elem {
union {
struct rcu_head rcu;
enum extra_elem_state state;
+   struct bpf_lru_node lru_node;
};
u32 hash;
char key[0] __aligned(8);
 };
 
+static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
+
+static bool htab_is_lru(const struct bpf_htab *htab)
+{
+   return htab->map.map_type == BPF_MAP_TYPE_LRU_HASH;
+}
+
 static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size,
 void __percpu *pptr)
 {
@@ -87,7 +99,22 @@ static void htab_free_elems(struct bpf_htab *htab)
vfree(htab->elems);
 }
 
-static int prealloc_elems_and_freelist(struct bpf_htab *htab)
+static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
+ u32 hash)
+{
+   struct bpf_lru_node *node = bpf_lru_pop_free(>lru, hash);
+   struct htab_elem *l;
+
+   if (node) {
+   l = container_of(node, struct htab_elem, lru_node);
+   memcpy(l->key, key, htab->map.key_size);
+   return l;
+   }
+
+   return NULL;
+}
+
+static int prealloc_init(struct bpf_htab *htab)
 {
int err = -ENOMEM, i;
 
@@ -110,12 +137,27 @@ static int prealloc_elems_and_freelist(struct bpf_htab 
*htab)
}
 
 skip_percpu_elems:
-   err = pcpu_freelist_init(>freelist);
+   if (htab_is_lru(htab))
+   err = bpf_lru_init(>lru,
+  htab->map.map_flags & BPF_F_NO_COMMON_LRU,
+  offsetof(struct htab_elem, hash) -
+  offsetof(struct htab_elem, lru_node),
+  htab_lru_map_delete_node,
+  htab);
+   else
+   err = pcpu_freelist_init(>freelist);
+
if (err)
goto free_elems;
 
-   pcpu_freelist_populate(>freelist, htab->elems, htab->elem_size,
-  htab->map.max_entries);
+   if (htab_is_lru(htab))
+   bpf_lru_populate(>lru, htab->elems,
+offsetof(struct htab_elem, lru_node),
+htab->elem_size, htab->map.max_entries);
+   else
+   pcpu_freelist_populate(>freelist, htab->elems,
+  htab->elem_size, htab->map.max_entries);
+
return 0;
 
 free_elems:
@@ -123,6 +165,16 @@ static int prealloc_elems_and_freelist(struct bpf_htab 
*htab)
return err;
 }
 
+static void prealloc_destroy(struct bpf_htab *htab)
+{
+   htab_free_elems(htab);
+
+   if (htab_is_lru(htab))
+   bpf_lru_destroy(>lru);
+   else
+   pcpu_freelist_destroy(>freelist);
+}
+
 static int alloc_extra_elems(struct bpf_htab *htab)
 {
void __percpu *pptr;
@@ -144,14 +196,34 @@ static int alloc_extra_elems(struct bpf_htab *htab)
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_HASH;
+

[PATCH v2 net-next 5/6] bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH

2016-11-11 Thread Martin KaFai Lau

Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH

Signed-off-by: Martin KaFai Lau 
---
 include/uapi/linux/bpf.h |   3 +-
 kernel/bpf/hashtab.c | 129 ---
 kernel/bpf/syscall.c |   8 ++-
 3 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ed8c679..7d9b283 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -86,6 +86,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_STACK_TRACE,
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
+   BPF_MAP_TYPE_LRU_PERCPU_HASH,
 };
 
 enum bpf_prog_type {
@@ -108,7 +109,7 @@ enum bpf_prog_type {
 
 #define BPF_F_NO_PREALLOC  (1U << 0)
 /* Instead of having one common LRU list in the
- * BPF_MAP_TYPE_LRU_HASH map, use a percpu LRU list
+ * BPF_MAP_TYPE_LRU_[PERCPU_]HASH map, use a percpu LRU list
  * which can scale and perform better.
  * Note, the LRU nodes (including free nodes) cannot be moved
  * across different LRU lists.
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 60e9e85..60e76fc 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -62,7 +62,14 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node);
 
 static bool htab_is_lru(const struct bpf_htab *htab)
 {
-   return htab->map.map_type == BPF_MAP_TYPE_LRU_HASH;
+   return htab->map.map_type == BPF_MAP_TYPE_LRU_HASH ||
+   htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
+}
+
+static bool htab_is_percpu(const struct bpf_htab *htab)
+{
+   return htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+   htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
 }
 
 static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size,
@@ -85,7 +92,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 {
int i;
 
-   if (htab->map.map_type != BPF_MAP_TYPE_PERCPU_HASH)
+   if (!htab_is_percpu(htab))
goto free_elems;
 
for (i = 0; i < htab->map.max_entries; i++) {
@@ -122,7 +129,7 @@ static int prealloc_init(struct bpf_htab *htab)
if (!htab->elems)
return -ENOMEM;
 
-   if (htab->map.map_type != BPF_MAP_TYPE_PERCPU_HASH)
+   if (!htab_is_percpu(htab))
goto skip_percpu_elems;
 
for (i = 0; i < htab->map.max_entries; i++) {
@@ -195,8 +202,10 @@ static int alloc_extra_elems(struct bpf_htab *htab)
 /* Called from syscall */
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
-   bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_HASH;
-   bool lru = attr->map_type == BPF_MAP_TYPE_LRU_HASH;
+   bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+  attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
+   bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
+   attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
/* percpu_lru means each cpu has its own LRU list.
 * it is different from BPF_MAP_TYPE_PERCPU_HASH where
 * the map's value itself is percpu.  percpu_lru has
@@ -823,12 +832,84 @@ static int __htab_percpu_map_update_elem(struct bpf_map 
*map, void *key,
return ret;
 }
 
+static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
+void *value, u64 map_flags,
+bool onallcpus)
+{
+   struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+   struct htab_elem *l_new = NULL, *l_old;
+   struct hlist_head *head;
+   unsigned long flags;
+   struct bucket *b;
+   u32 key_size, hash;
+   int ret;
+
+   if (unlikely(map_flags > BPF_EXIST))
+   /* unknown flags */
+   return -EINVAL;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   key_size = map->key_size;
+
+   hash = htab_map_hash(key, key_size);
+
+   b = __select_bucket(htab, hash);
+   head = >head;
+
+   /* For LRU, we need to alloc before taking bucket's
+* spinlock because LRU's elem alloc may need
+* to remove older elem from htab and this removal
+* operation will need a bucket lock.
+*/
+   if (map_flags != BPF_EXIST) {
+   l_new = prealloc_lru_pop(htab, key, hash);
+   if (!l_new)
+   return -ENOMEM;
+   }
+
+   /* bpf_map_update_elem() can be called in_irq() */
+   raw_spin_lock_irqsave(>lock, flags);
+
+   l_old = lookup_elem_raw(head, hash, key, key_size);
+
+   ret = check_flags(htab, l_old, map_flags);
+   if (ret)
+   goto err;
+
+   if (l_old) {
+   bpf_lru_node_set_ref(_old->lru_node);
+
+   /* per-cpu hash map can update value in-place */
+   pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
+   value,

[PATCH v2 net-next 1/6] bpf: LRU List

2016-11-11 Thread Martin KaFai Lau

Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.

* General Thoughts:
1. Target use case.  Read is more often than update.
   (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
   If bpf_prog does a bpf_lookup_elem() first and then an in-place
   update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
   3.1 No lock in read case
   3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
   property.  A completely isolated percpu LRU list has the best
   performance but the memory utilization is not ideal considering
   the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
   list with batched operations before accessing the global LRU list.
   As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
   'struct bpf_lru_locallist'.  This local list is not used to sort
   the LRU property.  Instead, the local list is to batch enough
   operations before acquiring the lock of the global LRU list.  More
   details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
   map-attribute for scalability reason and for use cases that need to
   prepare for the worst (and pathological) case like DoS attack.
   The percpu LRU list is completely isolated from each other and the
   LRU nodes (including free nodes) cannot be moved across the list.  The
   following description is for the global LRU list but mostly applicable
   to the percpu LRU list also.

* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
   page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
   global LRU list) at the beginning.  The pre-allocation reasoning
   is similar to the existing BPF_MAP_TYPE_HASH.  However,
   opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
   the LRU map.

* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
   the nodes.  We can think of it as the working set or more frequently
   accessed nodes.  The access frequency is approximated by a ref-bit.
   The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
   active set of nodes.  They are the candidates to be removed
   from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
   The tail of the inactive list is the older nodes and
   should be released first if the bpf_htab needs free element.

* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
   the global list.
2. The active list is only rotated when the inactive list is running
   low.  This idea is similar to the current page cache.
   Inactive running low is currently defined as
   "# of inactive < # of active".
3. The active list rotation always starts from the tail.  It moves
   node without ref-bit set to the head of the inactive list.
   It moves node with ref-bit set back to the head of the active
   list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
   It walks the inactive list and moves the nodes back to the head of
   active list if its ref-bit is set. The ref-bit is cleared after moving
   to the active list.
   If the node does not have ref-bit set, it just leave it as it is
   because it is already in the inactive list.

* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
   full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
   delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
   one node from the tail of inactive or active list.  Forcefully is
   in the sense that it ignores the ref-bit.

* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
   batch enough operations before acquiring the lock of the
   global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
   of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
   global LRU list.  The global LRU list can either satisfy it
   by its global free-list or by shrinking the global inactive
   list.  Since we have acquired the global LRU list lock,
   it will try to get at most LOCAL_FREE_TARGET elements
   to the local free list.
5. When a new element is added to the bpf_htab, it will
   first sit at the

[PATCH v2 net-next 0/6] bpf: LRU map

2016-11-11 Thread Martin KaFai Lau

Hi,

This patch set adds LRU map implementation to the existing BPF map
family.

The first few patches introduce the basic BPF LRU list
implementation.

The later patches introduce the LRU versions of the
existing BPF_MAP_TYPE_LRU_[PERCPU_]HASH maps by leveraging
the BPF LRU list.

v2:
- Added a percpu LRU list option which can be specified as
  a map attribute.

  [Note: percpu LRU list has nothing to do with the map's value]

- Removed the cpu variable from the struct bpf_lru_locallist
  since it is not needed.

- Changed the __bpf_lru_node_move_out to __bpf_lru_node_move_to_free in
  patch 1 to prepare the percpu LRU list in patch 2.

- Moved the test_lru_map under selftests

- Refactored a few things in the test codes

Thanks,
-- Martin

[PATCH v2 net-next 2/6] bpf: Add percpu LRU list

2016-11-11 Thread Martin KaFai Lau

Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute.  The map attribute will be added in the later
patch.

While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).

Each percpu LRU is isolated from each other.  The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.

Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
 1 cpus: 2934082 updates
 4 cpus: 7391434 updates
 8 cpus: 6500576 updates

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
  1 cpus: 2896553 updates
  4 cpus: 9766395 updates
  8 cpus: 17460553 updates

Signed-off-by: Martin KaFai Lau 
---
 kernel/bpf/bpf_lru_list.c | 162 +-
 kernel/bpf/bpf_lru_list.h |   8 ++-
 2 files changed, 151 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
index 73f6709..bfebff0 100644
--- a/kernel/bpf/bpf_lru_list.c
+++ b/kernel/bpf/bpf_lru_list.c
@@ -13,6 +13,9 @@
 #define LOCAL_FREE_TARGET  (128)
 #define LOCAL_NR_SCANS LOCAL_FREE_TARGET
 
+#define PERCPU_FREE_TARGET (16)
+#define PERCPU_NR_SCANSPERCPU_FREE_TARGET
+
 /* Helpers to get the local list index */
 #define LOCAL_LIST_IDX(t)  ((t) - BPF_LOCAL_LIST_T_OFFSET)
 #define LOCAL_FREE_LIST_IDXLOCAL_LIST_IDX(BPF_LRU_LOCAL_LIST_T_FREE)
@@ -396,7 +399,40 @@ struct bpf_lru_node *__local_list_pop_pending(struct 
bpf_lru *lru,
return NULL;
 }
 
-struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru *lru, u32 hash)
+static struct bpf_lru_node *bpf_percpu_lru_pop_free(struct bpf_lru *lru,
+   u32 hash)
+{
+   struct list_head *free_list;
+   struct bpf_lru_node *node = NULL;
+   struct bpf_lru_list *l;
+   unsigned long flags;
+   int cpu = raw_smp_processor_id();
+
+   l = per_cpu_ptr(lru->percpu_lru, cpu);
+
+   raw_spin_lock_irqsave(>lock, flags);
+
+   __bpf_lru_list_rotate(lru, l);
+
+   free_list = >lists[BPF_LRU_LIST_T_FREE];
+   if (list_empty(free_list))
+   __bpf_lru_list_shrink(lru, l, PERCPU_FREE_TARGET, free_list,
+ BPF_LRU_LIST_T_FREE);
+
+   if (!list_empty(free_list)) {
+   node = list_first_entry(free_list, struct bpf_lru_node, list);
+   *(u32 *)((void *)node + lru->hash_offset) = hash;
+   node->ref = 0;
+   __bpf_lru_node_move(l, node, BPF_LRU_LIST_T_INACTIVE);
+   }
+
+   raw_spin_unlock_irqrestore(>lock, flags);
+
+   return node;
+}
+
+static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
+   u32 hash)
 {
struct bpf_lru_locallist *loc_l, *steal_loc_l;
struct bpf_common_lru *clru = >common_lru;
@@ -458,7 +494,16 @@ struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru *lru, 
u32 hash)
return node;
 }
 
-void bpf_lru_push_free(struct bpf_lru *lru, struct bpf_lru_node *node)
+struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru *lru, u32 hash)
+{
+   if (lru->percpu)
+   return bpf_percpu_lru_pop_free(lru, hash);
+   else
+   return bpf_common_lru_pop_free(lru, hash);
+}
+
+static void bpf_common_lru_push_free(struct bpf_lru *lru,
+struct bpf_lru_node *node)
 {
unsigned long flags;
 
@@ -490,8 +535,31 @@ void bpf_lru_push_free(struct bpf_lru *lru, struct 
bpf_lru_node *node)
bpf_lru_list_push_free(>common_lru.lru_list, node);
 }
 
-void bpf_lru_populate(struct bpf_lru *lru, void *buf, u32 node_offset,
- u32 elem_size, u32 nr_elems)
+static void bpf_percpu_lru_push_free(struct bpf_lru *lru,
+struct bpf_lru_node *node)
+{
+   struct bpf_lru_list *l;
+   unsigned long flags;
+
+   l = per_cpu_ptr(lru->percpu_lru, node->cpu);
+
+   raw_spin_lock_irqsave(>lock, flags);
+
+   __bpf_lru_node_move(l, node, BPF_LRU_LIST_T_FREE);
+
+   raw_spin_unlock_irqrestore(>lock, flags);
+}
+
+void bpf_lru_push_free(struct bpf_lru *lru, struct bpf_lru_node *node)
+{
+   if (lru->percpu)
+   bpf_percpu_lru_push_free(lru, node);
+   else
+   bpf_common_lru_push_free(lru, node);
+}
+
+void bpf_common_lru_populate(struct bpf_lru *lru, void *buf, u32 node_offset,
+u32 elem_size, u32

[Patch net-next] net: fix sleeping for sk_wait_event()

2016-11-11 Thread Cong Wang

Similar to commit 14135f30e33c ("inet: fix sleeping inside 
inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().

Switch to the new wait API.

Cc: Eric Dumazet 
Cc: Peter Zijlstra 
Signed-off-by: Cong Wang 
---
 crypto/algif_aead.c |  9 -
 crypto/algif_skcipher.c | 18 +-
 include/net/sock.h  |  8 +---
 net/core/sock.c |  8 
 net/core/stream.c   | 28 ++--
 net/decnet/af_decnet.c  | 16 
 net/llc/af_llc.c| 24 
 net/phonet/pep.c|  9 -
 net/tipc/socket.c   | 24 
 net/vmw_vsock/virtio_transport_common.c | 10 +-
 10 files changed, 77 insertions(+), 77 deletions(-)

diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 80a0f1a..8948392 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -132,28 +132,27 @@ static void aead_wmem_wakeup(struct sock *sk)
 
 static int aead_wait_for_data(struct sock *sk, unsigned flags)
 {
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
struct alg_sock *ask = alg_sk(sk);
struct aead_ctx *ctx = ask->private;
long timeout;
-   DEFINE_WAIT(wait);
int err = -ERESTARTSYS;
 
if (flags & MSG_DONTWAIT)
return -EAGAIN;
 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
-
+   add_wait_queue(sk_sleep(sk), );
for (;;) {
if (signal_pending(current))
break;
-   prepare_to_wait(sk_sleep(sk), , TASK_INTERRUPTIBLE);
timeout = MAX_SCHEDULE_TIMEOUT;
-   if (sk_wait_event(sk, , !ctx->more)) {
+   if (sk_wait_event(sk, , !ctx->more, )) {
err = 0;
break;
}
}
-   finish_wait(sk_sleep(sk), );
+   remove_wait_queue(sk_sleep(sk), );
 
sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index 28556fc..1e38aaa 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -199,26 +199,26 @@ static void skcipher_free_sgl(struct sock *sk)
 
 static int skcipher_wait_for_wmem(struct sock *sk, unsigned flags)
 {
-   long timeout;
-   DEFINE_WAIT(wait);
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
int err = -ERESTARTSYS;
+   long timeout;
 
if (flags & MSG_DONTWAIT)
return -EAGAIN;
 
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
+   add_wait_queue(sk_sleep(sk), );
for (;;) {
if (signal_pending(current))
break;
-   prepare_to_wait(sk_sleep(sk), , TASK_INTERRUPTIBLE);
timeout = MAX_SCHEDULE_TIMEOUT;
-   if (sk_wait_event(sk, , skcipher_writable(sk))) {
+   if (sk_wait_event(sk, , skcipher_writable(sk), )) {
err = 0;
break;
}
}
-   finish_wait(sk_sleep(sk), );
+   remove_wait_queue(sk_sleep(sk), );
 
return err;
 }
@@ -242,10 +242,10 @@ static void skcipher_wmem_wakeup(struct sock *sk)
 
 static int skcipher_wait_for_data(struct sock *sk, unsigned flags)
 {
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
struct alg_sock *ask = alg_sk(sk);
struct skcipher_ctx *ctx = ask->private;
long timeout;
-   DEFINE_WAIT(wait);
int err = -ERESTARTSYS;
 
if (flags & MSG_DONTWAIT) {
@@ -254,17 +254,17 @@ static int skcipher_wait_for_data(struct sock *sk, 
unsigned flags)
 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 
+   add_wait_queue(sk_sleep(sk), );
for (;;) {
if (signal_pending(current))
break;
-   prepare_to_wait(sk_sleep(sk), , TASK_INTERRUPTIBLE);
timeout = MAX_SCHEDULE_TIMEOUT;
-   if (sk_wait_event(sk, , ctx->used)) {
+   if (sk_wait_event(sk, , ctx->used, )) {
err = 0;
break;
}
}
-   finish_wait(sk_sleep(sk), );
+   remove_wait_queue(sk_sleep(sk), );
 
sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index cf617ee..9d905ed 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -915,14 +915,16 @@ static inline void sock_rps_reset_rxhash(struct sock *sk)
 #endif
 }
 
-#define sk_wait_event(__sk, __timeo, __condition)  \
+#define sk_wait_event(__sk, __timeo, __condition, __wait)  \
({  int

Re: [PATCH net-next] ibmveth: v1 calculate correct gso_size and set gso_type

2016-11-11 Thread Brian King

On 10/27/2016 10:26 AM, Eric Dumazet wrote:
> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>> We recently encountered a bug where a few customers using ibmveth on the 
>> same LPAR hit an issue where a TCP session hung when large receive was
>> enabled. Closer analysis revealed that the session was stuck because the 
>> one side was advertising a zero window repeatedly.
>>
>> We narrowed this down to the fact the ibmveth driver did not set gso_size 
>> which is translated by TCP into the MSS later up the stack. The MSS is 
>> used to calculate the TCP window size and as that was abnormally large, 
>> it was calculating a zero window, even although the sockets receive buffer 
>> was completely empty. 
>>
>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom 
>> and Marcelo for all your help and review on this.
>>
>> The patch fixes both our internal reproduction tests and our customers tests.
>>
>> Signed-off-by: Jon Maxwell 
>> ---
>>  drivers/net/ethernet/ibm/ibmveth.c | 20 
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
>> b/drivers/net/ethernet/ibm/ibmveth.c
>> index 29c05d0..c51717e 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int 
>> budget)
>>  int frames_processed = 0;
>>  unsigned long lpar_rc;
>>  struct iphdr *iph;
>> +bool large_packet = 0;
>> +u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>  
>>  restart_poll:
>>  while (frames_processed < budget) {
>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, 
>> int budget)
>>  iph->check = 0;
>>  iph->check = 
>> ip_fast_csum((unsigned char *)iph, iph->ihl);
>>  adapter->rx_large_packets++;
>> +large_packet = 1;
>>  }
>>  }
>>  }
>>  
>> +if (skb->len > netdev->mtu) {
>> +iph = (struct iphdr *)skb->data;
>> +if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>> +iph->protocol == IPPROTO_TCP) {
>> +hdr_len += sizeof(struct iphdr);
>> +skb_shinfo(skb)->gso_type = 
>> SKB_GSO_TCPV4;
>> +skb_shinfo(skb)->gso_size = netdev->mtu 
>> - hdr_len;
>> +} else if (be16_to_cpu(skb->protocol) == 
>> ETH_P_IPV6 &&
>> +   iph->protocol == IPPROTO_TCP) {
>> +hdr_len += sizeof(struct ipv6hdr);
>> +skb_shinfo(skb)->gso_type = 
>> SKB_GSO_TCPV6;
>> +skb_shinfo(skb)->gso_size = netdev->mtu 
>> - hdr_len;
>> +}
>> +if (!large_packet)
>> +adapter->rx_large_packets++;
>> +}
>> +
>>  
> 
> This might break forwarding and PMTU discovery.
> 
> You force gso_size to device mtu, regardless of real MSS used by the TCP
> sender.
> 
> Don't you have the MSS provided in RX descriptor, instead of guessing
> the value ?

Eric,

We are currently pursuing making changes to the Power Virtual I/O Server to 
provide
the MSS to the ibmveth driver. However, this will take time to go through test
and ultimately get released. Although imperfect, this patch does help a real 
customer
hitting this issue right now. Would you object to this patch getting merged as 
is,
with the understanding that when we get the change in the Virtual I/O Server 
released,
we will revert this interim change and apply the new method?

Thanks,

Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

Re: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread David Miller

From: John Fastabend 
Date: Fri, 11 Nov 2016 06:52:31 -0800

> On 16-11-11 04:43 AM, Simon Horman wrote:
>> On Fri, Nov 11, 2016 at 08:28:50AM +, Yotam Gigi wrote:
>> 
>> ...
>> 
>>> John, as a result of your question I realized that our hardware does do
>>> randomized sampling that I was not aware of. I will use the extensibility of
>>> the API and implement a random keyword, that will be offloaded in our
>>> hardware. Those changes will be sent on v2.
>>>
>>> Eventually, your question was very relevant :) Thanks!
>> 
>> Perhaps I am missing the point but why not just make random the default and
>> implement the inverse as an extension if it turns out to be needed in
>> future?
>> 
> 
> +1 just implement the random one.

Agreed.

Re: [PATCH 0/2] bnx2: Hard reset bnx2 chip at probe stage

2016-11-11 Thread Michael Chan

On Fri, Nov 11, 2016 at 6:02 AM, Baoquan He  wrote:
> On 11/11/16 at 09:46pm, Baoquan He wrote:
>> Hi bnx2 experts,
>>
>> In commit 3e1be7a ("bnx2: Reset device during driver initialization"),
>> firmware requesting code was moved from open stage to probe stage.
>> The reason is in kdump kernel hardware iommu need device be reset in
>> driver probe stage, otherwise those in-flight DMA from 1st kernel
>> will continue going and look up into the newly created io-page tables.
>> So we need reset device to stop in-flight DMA as early as possibe.
>>
>> But with commit 3e1be7a merged, people reported their bnx2 driver init
>> failed because of failed firmware loading. After discussion, it's found
>> that they built bnx2 driver into kernel, and that makes probe function
>> bnx2_init_one be called in do_initcalls(). But at this time the initramfs
>> has not been uncompressed yet and mounted, kernel can't detect firmware.
>>
>> So there's only one way to cover both. Try to hard reset the bnx2 device
>> at probe stage, without involving firmware issues. I tried to add function
>> bnx2_hard_reset_chip() to do this and it's only called in kdump kernel.
>> The thing is I am not quite familiar with bnx2 chip spec, just abstract
>> code from bnx2_reset_chip, the testing result is good.
>
> Here I changed to send BNX2_MISC_COMMAND_HD_RESET in BNX2_CHIP_5709
> case.
>

>From my old 5709 Documentation:

Bit 6 HD_RESET:  Writing this bit as 1 will cause the chip to do a
hard reset like bit 5 except the sticky bits in the PCI function are
not reset.

Bit 5 POR_RESET: Writing this bit as 1 will cause the chip to do an
internal reset exactly like a power-up reset.  There is no protection
for this request and it may cause any current PCI cycle to lock up.
This reset is intended for use under manufacturing conditions only.

So it sounds like doing HD_RESET can potentially cause a PCI bus lock up.

Why not just disable DMA gracefully as done at the beginning of
bnx2_reset_chip()?

[net-next PATCH] net: dummy: Introduce dummy virtual functions

2016-11-11 Thread Phil Sutter

The idea for this was born when testing VF support in iproute2 which was
impeded by hardware requirements. In fact, not every VF-capable hardware
driver implements all netdev ops, so testing the interface is still hard
to do even with a well-sorted hardware shelf.

To overcome this and allow for testing the user-kernel interface, this
patch allows to turn dummy into a PF with a configurable amount of VFs.

Due to the assumption that all PFs are PCI devices, this implementation
is not completely straightforward: In order to allow for
rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
attached to the dummy netdev. This has to happen at the right spot so
register_netdevice() does not get confused. This patch abuses
ndo_fix_features callback for that. In ndo_uninit callback, the fake
parent is removed again for the same purpose.

Joint work with Sabrina Dubroca.

Signed-off-by: Sabrina Dubroca 
Signed-off-by: Phil Sutter 
---
 drivers/net/dummy.c | 195 +++-
 1 file changed, 193 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 69fc8409a9733..39d0d5354414a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -34,6 +34,8 @@
 #include 
 #include 
 #include 
+#include 
+#include "../pci/pci.h"/* for struct pci_sriov */
 #include 
 #include 
 #include 
@@ -42,6 +44,33 @@
 #define DRV_VERSION"1.0"
 
 static int numdummies = 1;
+static int num_vfs;
+
+static struct pci_sriov pdev_sriov;
+
+static struct pci_dev pci_pdev = {
+   .is_physfn = 1,
+   .sriov = _sriov,
+   .dev.bus = _bus_type,
+};
+
+struct vf_data_storage {
+   unsigned char vf_mac[ETH_ALEN];
+   u16 pf_vlan; /* When set, guest VLAN config not allowed. */
+   u16 pf_qos;
+   __be16 vlan_proto;
+   u16 min_tx_rate;
+   u16 max_tx_rate;
+   u8 spoofchk_enabled;
+   bool rss_query_enabled;
+   u8 trusted;
+   int link_state;
+};
+
+struct dummy_priv {
+   int num_vfs;
+   struct vf_data_storage *vfinfo;
+};
 
 /* fake multicast ability */
 static void set_multicast_list(struct net_device *dev)
@@ -91,15 +120,29 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
 static int dummy_dev_init(struct net_device *dev)
 {
+   struct dummy_priv *priv = netdev_priv(dev);
+
dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
if (!dev->dstats)
return -ENOMEM;
 
+   priv->num_vfs = num_vfs;
+   priv->vfinfo = NULL;
+
+   if (!num_vfs)
+   return 0;
+
+   priv->vfinfo = kcalloc(num_vfs, sizeof(struct vf_data_storage),
+  GFP_KERNEL);
+   if (!priv->vfinfo)
+   return -ENOMEM;
+
return 0;
 }
 
 static void dummy_dev_uninit(struct net_device *dev)
 {
+   dev->dev.parent = NULL;
free_percpu(dev->dstats);
 }
 
@@ -112,6 +155,129 @@ static int dummy_change_carrier(struct net_device *dev, 
bool new_carrier)
return 0;
 }
 
+/* fake, just to set fake PCI parent after netdev_register_kobject() */
+static netdev_features_t dummy_fix_features(struct net_device *dev,
+   netdev_features_t features)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if (priv->num_vfs)
+   dev->dev.parent = _pdev.dev;
+
+   return features;
+}
+
+static int dummy_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if (!is_valid_ether_addr(mac) || (vf >= priv->num_vfs))
+   return -EINVAL;
+
+   memcpy(priv->vfinfo[vf].vf_mac, mac, ETH_ALEN);
+
+   return 0;
+}
+
+static int dummy_set_vf_vlan(struct net_device *dev, int vf,
+u16 vlan, u8 qos, __be16 vlan_proto)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if ((vf >= priv->num_vfs) || (vlan > 4095) || (qos > 7))
+   return -EINVAL;
+
+   priv->vfinfo[vf].pf_vlan = vlan;
+   priv->vfinfo[vf].pf_qos = qos;
+   priv->vfinfo[vf].vlan_proto = vlan_proto;
+
+   return 0;
+}
+
+static int dummy_set_vf_rate(struct net_device *dev, int vf, int min, int max)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if (vf >= priv->num_vfs)
+   return -EINVAL;
+
+   priv->vfinfo[vf].min_tx_rate = min;
+   priv->vfinfo[vf].max_tx_rate = max;
+
+   return 0;
+}
+
+static int dummy_set_vf_spoofchk(struct net_device *dev, int vf, bool val)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if (vf >= priv->num_vfs)
+   return -EINVAL;
+
+   priv->vfinfo[vf].spoofchk_enabled = val;
+
+   return 0;
+}
+
+static int dummy_set_vf_rss_query_en(struct net_device *dev, int vf, bool val)
+{
+   struct dummy_priv *priv = netdev_priv(dev);
+
+   if (vf >= priv->num_vfs)
+

wl1251 & mac address & calibration data

2016-11-11 Thread Pali Rohár

Hi! I will open discussion about mac address and calibration data for 
wl1251 wireless chip again...

Problem: Mac address & calibration data for wl1251 chip on Nokia N900 
are stored on second nand partition (mtd1) in special proprietary format 
which is used only for Nokia N900 (probably on N8x0 and N9 too). 
Wireless driver wl1251.ko cannot work without mac address and 
calibration data.

Absence of mac address cause that driver generates random mac address at 
every kernel boot which has couple of problems (unstable identifier of 
wireless device due to udev permanent storage rules; unpredictable 
behaviour for dhcp mac address assignment, mac address filtering, ...).

Currently there is no way to set (permanent) mac address for network 
interface from userspace. And it does not make sense to implement in 
linux kernel large parser for proprietary format of second nand 
partition where is mac address stored only for one device -- Nokia N900.

Driver wl1251.ko loads calibration data via request_firmware() for file 
wl1251-nvs.bin. There are some "example" calibration file in linux-
firmware repository, but it is not suitable for normal usage as real 
calibration data are per-device specific.

So questions are:

1) How to set mac address from userspace for that wl1251 interface? In 
userspace I can write parser for that proprietary format of nand 
partition and extract mac address from it

2) How to send calibration data to wl1251 driver? Those are again stored 
in proprietary format and I can write userspace parser for it.

-- 
Pali Rohár
pali.ro...@gmail.com


signature.asc
Description: This is a digitally signed message part.

RE: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread Yotam Gigi

>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: Friday, November 11, 2016 2:44 PM
>To: Yotam Gigi 
>Cc: John Fastabend ; Jiri Pirko ;
>netdev@vger.kernel.org; da...@davemloft.net; Ido Schimmel
>; Elad Raz ; Nogah Frankel
>; Or Gerlitz ;
>j...@mojatatu.com; geert+rene...@glider.be; step...@networkplumber.org;
>xiyou.wangc...@gmail.com; li...@roeck-us.net; ro...@cumulusnetworks.com
>Subject: Re: [patch net-next 5/8] Introduce sample tc action
>
>On Fri, Nov 11, 2016 at 08:28:50AM +, Yotam Gigi wrote:
>
>...
>
>> John, as a result of your question I realized that our hardware does do
>> randomized sampling that I was not aware of. I will use the extensibility of
>> the API and implement a random keyword, that will be offloaded in our
>> hardware. Those changes will be sent on v2.
>>
>> Eventually, your question was very relevant :) Thanks!
>
>Perhaps I am missing the point but why not just make random the default and
>implement the inverse as an extension if it turns out to be needed in
>future?

It makes sense. It does seem to me that the average user does prefer random
sampling over deterministic one. 

We will consider that. Thanks for the comment!

Re: [PATCH v2 5/6] qedi: Add support for iSCSI session management.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:57 AM, Manish Rangankar wrote:

This patch adds support for iscsi_transport LLD Login,
Logout, NOP-IN/NOP-OUT, Async, Reject PDU processing
and Firmware async event handling support.

Signed-off-by: Nilesh Javali 
Signed-off-by: Adheer Chandravanshi 
Signed-off-by: Chad Dupuis 
Signed-off-by: Saurav Kashyap 
Signed-off-by: Arun Easi 
Signed-off-by: Manish Rangankar 
---
 drivers/scsi/qedi/qedi_fw.c| 1106 +++
 drivers/scsi/qedi/qedi_gbl.h   |   67 ++
 drivers/scsi/qedi/qedi_iscsi.c | 1611 
 drivers/scsi/qedi/qedi_iscsi.h |  232 ++
 drivers/scsi/qedi/qedi_main.c  |  166 +
 5 files changed, 3182 insertions(+)
 create mode 100644 drivers/scsi/qedi/qedi_fw.c
 create mode 100644 drivers/scsi/qedi/qedi_gbl.h
 create mode 100644 drivers/scsi/qedi/qedi_iscsi.c
 create mode 100644 drivers/scsi/qedi/qedi_iscsi.h


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

[PATCH net 1/2] ibmvnic: Unmap ibmvnic_statistics structure

2016-11-11 Thread Thomas Falcon

This structure was mapped but never subsequently unmapped.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index f6c9b6d..921c40f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3844,6 +3844,9 @@ static int ibmvnic_remove(struct vio_dev *dev)
if (adapter->debugfs_dir && !IS_ERR(adapter->debugfs_dir))
debugfs_remove_recursive(adapter->debugfs_dir);
 
+   dma_unmap_single(>dev, adapter->stats_token,
+sizeof(struct ibmvnic_statistics), DMA_FROM_DEVICE);
+
if (adapter->ras_comps)
dma_free_coherent(>dev,
  adapter->ras_comp_num *
-- 
1.8.3.1

[PATCH net 2/2] ibmvnic: Fix size of debugfs name buffer

2016-11-11 Thread Thomas Falcon

This mistake was causing debugfs directory creation
failures when multiple ibmvnic devices were probed.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 921c40f..4f3281a 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3705,7 +3705,7 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
struct net_device *netdev;
unsigned char *mac_addr_p;
struct dentry *ent;
-   char buf[16]; /* debugfs name buf */
+   char buf[17]; /* debugfs name buf */
int rc;
 
dev_dbg(>dev, "entering ibmvnic_probe for UA 0x%x\n",
-- 
1.8.3.1

Re: [PATCH v2 6/6] qedi: Add support for data path.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:57 AM, Manish Rangankar wrote:

This patch adds support for data path and TMF handling.

Signed-off-by: Nilesh Javali 
Signed-off-by: Adheer Chandravanshi 
Signed-off-by: Chad Dupuis 
Signed-off-by: Saurav Kashyap 
Signed-off-by: Arun Easi 
Signed-off-by: Manish Rangankar 
---
 drivers/scsi/qedi/qedi_fw.c| 1272 
 drivers/scsi/qedi/qedi_gbl.h   |6 +
 drivers/scsi/qedi/qedi_iscsi.c |   13 +
 drivers/scsi/qedi/qedi_main.c  |4 +
 4 files changed, 1295 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Re: [PATCH v2 4/6] qedi: Add LL2 iSCSI interface for offload iSCSI.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:57 AM, Manish Rangankar wrote:

This patch adds support for iscsiuio interface using Light L2 (LL2) qed
interface.

Signed-off-by: Nilesh Javali 
Signed-off-by: Adheer Chandravanshi 
Signed-off-by: Chad Dupuis 
Signed-off-by: Saurav Kashyap 
Signed-off-by: Arun Easi 
Signed-off-by: Manish Rangankar 
---
 drivers/scsi/qedi/qedi.h  |  73 +
 drivers/scsi/qedi/qedi_main.c | 357 ++
 2 files changed, 430 insertions(+)


Oh well; and I thought we could do away with the iscsiuio thingie ...

Sigh.

But nevertheless,

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Re: [PATCH v2 3/6] qedi: Add QLogic FastLinQ offload iSCSI driver framework.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:57 AM, Manish Rangankar wrote:

The QLogic FastLinQ Driver for iSCSI (qedi) is the iSCSI specific module
for 41000 Series Converged Network Adapters by QLogic.

This patch consists of following changes:
  - MAINTAINERS Makefile and Kconfig changes for qedi,
  - PCI driver registration,
  - iSCSI host level initialization,
  - Debugfs and log level infrastructure.

Signed-off-by: Nilesh Javali 
Signed-off-by: Adheer Chandravanshi 
Signed-off-by: Chad Dupuis 
Signed-off-by: Saurav Kashyap 
Signed-off-by: Arun Easi 
Signed-off-by: Manish Rangankar 
---
 MAINTAINERS |6 +
 drivers/net/ethernet/qlogic/Kconfig |   12 -
 drivers/scsi/Kconfig|1 +
 drivers/scsi/Makefile   |1 +
 drivers/scsi/qedi/Kconfig   |   10 +
 drivers/scsi/qedi/Makefile  |5 +
 drivers/scsi/qedi/qedi.h|  291 +++
 drivers/scsi/qedi/qedi_dbg.c|  143 
 drivers/scsi/qedi/qedi_dbg.h|  144 
 drivers/scsi/qedi/qedi_debugfs.c|  244 ++
 drivers/scsi/qedi/qedi_hsi.h|   52 ++
 drivers/scsi/qedi/qedi_main.c   | 1616 +++
 drivers/scsi/qedi/qedi_sysfs.c  |   52 ++
 drivers/scsi/qedi/qedi_version.h|   14 +
 14 files changed, 2579 insertions(+), 12 deletions(-)
 create mode 100644 drivers/scsi/qedi/Kconfig
 create mode 100644 drivers/scsi/qedi/Makefile
 create mode 100644 drivers/scsi/qedi/qedi.h
 create mode 100644 drivers/scsi/qedi/qedi_dbg.c
 create mode 100644 drivers/scsi/qedi/qedi_dbg.h
 create mode 100644 drivers/scsi/qedi/qedi_debugfs.c
 create mode 100644 drivers/scsi/qedi/qedi_hsi.h
 create mode 100644 drivers/scsi/qedi/qedi_main.c
 create mode 100644 drivers/scsi/qedi/qedi_sysfs.c
 create mode 100644 drivers/scsi/qedi/qedi_version.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e5c17a9..04eec14 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9934,6 +9934,12 @@ F:   drivers/net/ethernet/qlogic/qed/
 F: include/linux/qed/
 F: drivers/net/ethernet/qlogic/qede/

+QLOGIC QL41xxx ISCSI DRIVER
+M: qlogic-storage-upstr...@cavium.com
+L: linux-s...@vger.kernel.org
+S: Supported
+F: drivers/scsi/qedi/
+
 QNX4 FILESYSTEM
 M: Anders Larsen 
 W: http://www.alarsen.net/linux/qnx4fs/
diff --git a/drivers/net/ethernet/qlogic/Kconfig 
b/drivers/net/ethernet/qlogic/Kconfig
index 2832570..3cfd105 100644
--- a/drivers/net/ethernet/qlogic/Kconfig
+++ b/drivers/net/ethernet/qlogic/Kconfig
@@ -113,16 +113,4 @@ config QED_RDMA
 config QED_ISCSI
bool

-config QEDI
-   tristate "QLogic QED 25/40/100Gb iSCSI driver"
-   depends on QED
-   select QED_LL2
-   select QED_ISCSI
-   default n
-   ---help---
- This provides a temporary node that allows the compilation
- and logical testing of the hardware offload iSCSI support
- for QLogic QED. This would be replaced by the 'real' option
- once the QEDI driver is added [+relocated].
-
 endif # NET_VENDOR_QLOGIC
diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 3e2bdb9..5cf03db 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1254,6 +1254,7 @@ config SCSI_QLOGICPTI

 source "drivers/scsi/qla2xxx/Kconfig"
 source "drivers/scsi/qla4xxx/Kconfig"
+source "drivers/scsi/qedi/Kconfig"

 config SCSI_LPFC
tristate "Emulex LightPulse Fibre Channel Support"
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 38d938d..da9e312 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -132,6 +132,7 @@ obj-$(CONFIG_PS3_ROM)   += ps3rom.o
 obj-$(CONFIG_SCSI_CXGB3_ISCSI) += libiscsi.o libiscsi_tcp.o cxgbi/
 obj-$(CONFIG_SCSI_CXGB4_ISCSI) += libiscsi.o libiscsi_tcp.o cxgbi/
 obj-$(CONFIG_SCSI_BNX2_ISCSI)  += libiscsi.o bnx2i/
+obj-$(CONFIG_QEDI)  += libiscsi.o qedi/
 obj-$(CONFIG_BE2ISCSI) += libiscsi.o be2iscsi/
 obj-$(CONFIG_SCSI_ESAS2R)  += esas2r/
 obj-$(CONFIG_SCSI_PMCRAID) += pmcraid.o
diff --git a/drivers/scsi/qedi/Kconfig b/drivers/scsi/qedi/Kconfig
new file mode 100644
index 000..23ca8a2
--- /dev/null
+++ b/drivers/scsi/qedi/Kconfig
@@ -0,0 +1,10 @@
+config QEDI
+   tristate "QLogic QEDI 25/40/100Gb iSCSI Initiator Driver Support"
+   depends on PCI && SCSI
+   depends on QED
+   select SCSI_ISCSI_ATTRS
+   select QED_LL2
+   select QED_ISCSI
+   ---help---
+   This driver supports iSCSI offload for the QLogic FastLinQ
+   41000 Series Converged Network Adapters.
diff --git a/drivers/scsi/qedi/Makefile b/drivers/scsi/qedi/Makefile
new file mode 100644
index 000..2b3e16b
--- /dev/null
+++ b/drivers/scsi/qedi/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_QEDI) := qedi.o
+qedi-y := qedi_main.o qedi_iscsi.o qedi_fw.o qedi_sysfs.o \
+

Re: [PATCH] bpf: fix range arithmetic for bpf map access

2016-11-11 Thread Jann Horn

On Fri, Nov 11, 2016 at 1:18 AM, Josef Bacik  wrote:
> ---
> Sorry Jann, I saw your response last night and then promptly forgot about it,
> here's the git-send-email version.
> ---

A note: This doesn't seem to apply cleanly to current net-next (or I'm
too stupid to
use "git am"), so I'm applying it on f41cd11d64b2b21012eb4abffbe579bc0b90467f,
which is net-next from a few days ago.

> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
> invalid accesses to bpf map entries.  Fix this up by doing a few things
>
> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in 
> real
> life and just adds extra complexity.

Yay! As a security person, I am very much in favor of killing unused features.

> 2) Fix the logic for BPF_AND.  If the min value is negative then that is the 
> new
> minimum, otherwise it is unconditionally 0.
>
> 3) Don't do operations on the ranges if they are set to the limits, as they 
> are
> by definition undefined, and allowing arithmetic operations on those values
> could make them appear valid when they really aren't.
>
> This fixes the testcase provided by Jann as well as a few other theoretical
> problems.
>
> Reported-by: Jann Horn 
> Signed-off-by: Josef Bacik 

A nit: check_mem_access() still has an explicit cast of reg->min_value to s64, I
think that's not necessary anymore?

> case BPF_AND:
> -   /* & is special since it could end up with 0 bits set. */
> -   dst_reg->min_value &= min_val;
> +   /* & is special since it's could be any value within our 
> range,
> +* including 0.  But if the thing we're AND'ing against is
> +* negative and we're negative then that's the minimum value,
> +* otherwise the minimum will always be 0.
> +*/
> +   if (min_val < 0 && dst_reg->min_value < 0)
> +   dst_reg->min_value = min_t(s64, dst_reg->min_value,
> +  min_val);
> +   else
> +   dst_reg->min_value = 0;
> dst_reg->max_value = max_val;

I'm not sure whether this is correct when dealing with signed numbers.
Let's say I have -2 and -3 (as u32: 0xfffe and 0xfffd) and AND them
together. The result is 0xfffc, or -4, right? So if I just compute
the AND of
constant numbers -2 and -3 (known to the verifier), the verifier would
compute minimum -3 while the actual value is -4, right?

If I am correct about this, I think it might make sense to just reset
the state to
unknown in the `min_val < 0 && dst_reg->min_value < 0` case. That shouldn't
occur in legitimate programs, right?

Re: [PATCH 2/3] vhost: better detection of available buffers

2016-11-11 Thread Michael S. Tsirkin

On Fri, Nov 11, 2016 at 12:18:50PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月11日 11:41, Michael S. Tsirkin wrote:
> > On Fri, Nov 11, 2016 at 10:18:37AM +0800, Jason Wang wrote:
> > > >
> > > >
> > > >On 2016年11月10日 03:57, Michael S. Tsirkin wrote:
> > > > > >On Wed, Nov 09, 2016 at 03:38:32PM +0800, Jason Wang wrote:
> > > > > > > >We should use vq->last_avail_idx instead of vq->avail_idx in the
> > > > > > > >checking of vhost_vq_avail_empty() since latter is the cached 
> > > > > > > >avail
> > > > > > > >index from guest but we want to know if there's pending available
> > > > > > > >buffers in the virtqueue.
> > > > > > > >
> > > > > > > >Signed-off-by: Jason Wang
> > > > > >I'm not sure why is this patch here. Is it related to
> > > > > >batching somehow?
> > > >
> > > >Yes, we need to know whether or not there's still buffers left in the
> > > >virtqueue, so need to check last_avail_idx. Otherwise, we're checking if
> > > >guest has submitted new buffers.
> > > >
> > > > > >
> > > > > >
> > > > > > > >---
> > > > > > > >   drivers/vhost/vhost.c | 2 +-
> > > > > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > >
> > > > > > > >diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > >index c6f2d89..fdf4cdf 100644
> > > > > > > >--- a/drivers/vhost/vhost.c
> > > > > > > >+++ b/drivers/vhost/vhost.c
> > > > > > > >@@ -2230,7 +2230,7 @@ bool vhost_vq_avail_empty(struct vhost_dev 
> > > > > > > >*dev, struct vhost_virtqueue *vq)
> > > > > > > > if (r)
> > > > > > > > return false;
> > > > > > > >-return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
> > > > > > > >+return vhost16_to_cpu(vq, avail_idx) == 
> > > > > > > >vq->last_avail_idx;
> > > > > > > >   }
> > > > > > > >   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> > > > > >That might be OK for TX but it's probably wrong for RX
> > > > > >where the fact that used != avail does not mean
> > > > > >we have enough space to store the packet.
> > > >
> > > >Right, but it's no harm since it was just a hint, handle_rx() can handle
> > > >this situation.
> > Means busy polling will cause useless load on the CPU though.
> > 
> 
> Right, but,it's not easy to have 100% correct hint here. Needs more thought.

What's wrong with what we have? It polls until value changes.

-- 
MST

Re: [PATCH 1/3] tuntap: rx batching

2016-11-11 Thread Michael S. Tsirkin

On Fri, Nov 11, 2016 at 12:28:38PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月11日 12:17, John Fastabend wrote:
> > On 16-11-10 07:31 PM, Michael S. Tsirkin wrote:
> > > >On Fri, Nov 11, 2016 at 10:07:44AM +0800, Jason Wang wrote:
> > > > >>
> > > > >>
> > > > >>On 2016年11月10日 00:38, Michael S. Tsirkin wrote:
> > > > > >>>On Wed, Nov 09, 2016 at 03:38:31PM +0800, Jason Wang wrote:
> > > > > > Backlog were used for tuntap rx, but it can only process 1 
> > > > > > packet at
> > > > > > one time since it was scheduled during sendmsg() synchronously 
> > > > > > in
> > > > > > process context. This lead bad cache utilization so this patch 
> > > > > > tries
> > > > > > to do some batching before call rx NAPI. This is done through:
> > > > > > 
> > > > > > - accept MSG_MORE as a hint from sendmsg() caller, if it was 
> > > > > > set,
> > > > > > batch the packet temporarily in a linked list and submit 
> > > > > >  them all
> > > > > > once MSG_MORE were cleared.
> > > > > > - implement a tuntap specific NAPI handler for processing this 
> > > > > > kind of
> > > > > > possible batching. (This could be done by extending backlog 
> > > > > >  to
> > > > > > support skb like, but using a tun specific one looks 
> > > > > >  cleaner and
> > > > > > easier for future extension).
> > > > > > 
> > > > > > Signed-off-by: Jason Wang
> > > > > >>>So why do we need an extra queue?
> > > > >>
> > > > >>The idea was borrowed from backlog to allow some kind of bulking and 
> > > > >>avoid
> > > > >>spinlock on each dequeuing.
> > > > >>
> > > > > >>>   This is not what hardware devices do.
> > > > > >>>How about adding the packet to queue unconditionally, deferring
> > > > > >>>signalling until we get sendmsg without MSG_MORE?
> > > > >>
> > > > >>Then you need touch spinlock when dequeuing each packet.
> > > >
> > Random thought, I have a cmpxchg ring I am using for the qdisc work that
> > could possibly replace the spinlock implementation. I haven't figured
> > out the resizing API yet because I did not need it but I assume it could
> > help here and let you dequeue multiple skbs in one operation.
> > 
> > I can post the latest version if useful or an older version is
> > somewhere on patchworks as well.
> > 
> > .John
> > 
> > 
> 
> Look useful here, and I can compare the performance if you post.
> 
> A question is can we extend the skb_array to support that?
> 
> Thanks

I'd like to start with simple patch adding napi with one queue, then add
optimization patches on top.

One issue that comes to mind is that write queue limits
are byte based, they do not count packets unlike tun rx queue.



-- 
MST

Re: [PATCH net-next] sfc: clear napi_hash state when copying channels

2016-11-11 Thread Bert Kenward

Apologies, this was meant for net, not net-next. Shall I resubmit?

Bert.

Re: [PATCH v2 2/6] qed: Add iSCSI out of order packet handling.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:56 AM, Manish Rangankar wrote:

From: Yuval Mintz 

This patch adds out of order packet handling for hardware offloaded
iSCSI. Out of order packet handling requires driver buffer allocation
and assistance.

Signed-off-by: Arun Easi 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/Makefile   |   2 +-
 drivers/net/ethernet/qlogic/qed/qed.h  |   1 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c  |  14 +-
 drivers/net/ethernet/qlogic/qed/qed_ll2.c  | 503 -
 drivers/net/ethernet/qlogic/qed/qed_ll2.h  |   9 +
 drivers/net/ethernet/qlogic/qed/qed_ooo.c  | 501 
 drivers/net/ethernet/qlogic/qed/qed_ooo.h  | 173 ++
 drivers/net/ethernet/qlogic/qed/qed_roce.c |   1 +
 drivers/net/ethernet/qlogic/qed/qed_spq.c  |   9 +
 9 files changed, 1201 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.h


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Re: [PATCH v2 1/6] qed: Add support for hardware offloaded iSCSI.

2016-11-11 Thread Hannes Reinecke


On 11/08/2016 07:56 AM, Manish Rangankar wrote:

From: Yuval Mintz 

This adds the backbone required for the various HW initalizations
which are necessary for the iSCSI driver (qedi) for QLogic FastLinQ
4 line of adapters - FW notification, resource initializations, etc.

Signed-off-by: Arun Easi 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/Kconfig|   15 +
 drivers/net/ethernet/qlogic/qed/Makefile   |1 +
 drivers/net/ethernet/qlogic/qed/qed.h  |7 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c  |   12 +
 drivers/net/ethernet/qlogic/qed/qed_int.h  |1 -
 drivers/net/ethernet/qlogic/qed/qed_iscsi.c| 1276 
 drivers/net/ethernet/qlogic/qed/qed_iscsi.h|   52 +
 drivers/net/ethernet/qlogic/qed/qed_l2.c   |1 -
 drivers/net/ethernet/qlogic/qed/qed_ll2.c  |4 +-
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |2 +
 drivers/net/ethernet/qlogic/qed/qed_spq.c  |   15 +
 include/linux/qed/qed_if.h |2 +
 include/linux/qed/qed_iscsi_if.h   |  229 +
 13 files changed, 1613 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h
 create mode 100644 include/linux/qed/qed_iscsi_if.h

diff --git a/drivers/net/ethernet/qlogic/Kconfig 
b/drivers/net/ethernet/qlogic/Kconfig
index 32f2a45..2832570 100644
--- a/drivers/net/ethernet/qlogic/Kconfig
+++ b/drivers/net/ethernet/qlogic/Kconfig
@@ -110,4 +110,19 @@ config QEDE
 config QED_RDMA
bool

+config QED_ISCSI
+   bool
+
+config QEDI
+   tristate "QLogic QED 25/40/100Gb iSCSI driver"
+   depends on QED
+   select QED_LL2
+   select QED_ISCSI
+   default n
+   ---help---
+ This provides a temporary node that allows the compilation
+ and logical testing of the hardware offload iSCSI support
+ for QLogic QED. This would be replaced by the 'real' option
+ once the QEDI driver is added [+relocated].
+
 endif # NET_VENDOR_QLOGIC
diff --git a/drivers/net/ethernet/qlogic/qed/Makefile 
b/drivers/net/ethernet/qlogic/qed/Makefile
index 967acf3..597e15c 100644
--- a/drivers/net/ethernet/qlogic/qed/Makefile
+++ b/drivers/net/ethernet/qlogic/qed/Makefile
@@ -6,3 +6,4 @@ qed-y := qed_cxt.o qed_dev.o qed_hw.o qed_init_fw_funcs.o 
qed_init_ops.o \
 qed-$(CONFIG_QED_SRIOV) += qed_sriov.o qed_vf.o
 qed-$(CONFIG_QED_LL2) += qed_ll2.o
 qed-$(CONFIG_QED_RDMA) += qed_roce.o
+qed-$(CONFIG_QED_ISCSI) += qed_iscsi.o
diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 50b8a01..15286c1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -35,6 +35,7 @@

 #define QED_WFQ_UNIT   100

+#define ISCSI_BDQ_ID(_port_id) (_port_id)
 #define QED_WID_SIZE(1024)
 #define QED_PF_DEMS_SIZE(4)

@@ -392,6 +393,7 @@ struct qed_hwfn {
boolusing_ll2;
struct qed_ll2_info *p_ll2_info;
struct qed_rdma_info*p_rdma_info;
+   struct qed_iscsi_info   *p_iscsi_info;
struct qed_pf_paramspf_params;

bool b_rdma_enabled_in_prs;
@@ -593,6 +595,8 @@ struct qed_dev {
/* Linux specific here */
struct  qede_dev*edev;
struct  pci_dev *pdev;
+   u32 flags;
+#define QED_FLAG_STORAGE_STARTED   (BIT(0))
int msg_enable;

struct pci_params   pci_params;
@@ -606,6 +610,7 @@ struct qed_dev {
union {
struct qed_common_cb_ops*common;
struct qed_eth_cb_ops   *eth;
+   struct qed_iscsi_cb_ops *iscsi;
} protocol_ops;
void*ops_cookie;

@@ -615,7 +620,7 @@ struct qed_dev {
struct qed_cb_ll2_info  *ll2;
u8  ll2_mac_address[ETH_ALEN];
 #endif
-
+   DECLARE_HASHTABLE(connections, 10);
const struct firmware   *firmware;

u32 rdma_max_sge;

10 connections? Only?
Hmm.

Other than that:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

[PATCH net-next] sfc: clear napi_hash state when copying channels

2016-11-11 Thread Bert Kenward

efx_copy_channel() doesn't correctly clear the napi_hash related state.
This means that when napi_hash_add is called for that channel nothing is
done, and we are left with a copy of the napi_hash_node from the old
channel. When we later call napi_hash_del() on this channel we have a
stale napi_hash_node.

Corruption is only seen when there are multiple entries in one of the
napi_hash lists. This is made more likely by having a very large number
of channels. Testing was carried out with 512 channels - 32 channels on
each of 16 ports.

This failure typically appears as protection faults within napi_by_id()
or napi_hash_add(). efx_copy_channel() is only used when tx or rx ring
sizes are changed (ethtool -G).

Fixes: 36763266bbe8 ("sfc: Add support for busy polling")
Signed-off-by: Bert Kenward 
---
 drivers/net/ethernet/sfc/efx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 3cf3557..6b89e4a 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -485,6 +485,9 @@ efx_copy_channel(const struct efx_channel *old_channel)
*channel = *old_channel;
 
channel->napi_dev = NULL;
+   INIT_HLIST_NODE(>napi_str.napi_hash_node);
+   channel->napi_str.napi_id = 0;
+   channel->napi_str.state = 0;
memset(>eventq, 0, sizeof(channel->eventq));
 
for (j = 0; j < EFX_TXQ_TYPES; j++) {
-- 
2.7.4

[patch net v2 1/2] mlxsw: spectrum: Fix refcount bug on span entries

2016-11-11 Thread Jiri Pirko

From: Yotam Gigi 

When binding port to a newly created span entry, its refcount is
initialized to zero even though it has a bound port. That leads
to unexpected behaviour when the user tries to delete that port
from the span entry.

Fix this by initializing the reference count to 1.

Also add a warning to put function.

Fixes: 763b4b70afcd ("mlxsw: spectrum: Add support in matchall mirror TC 
offloading")
Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
v1->v2:
- fix this rather by initializing refcount to 1
- fix typo in description
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1ec0a4c..dda5761 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -231,7 +231,7 @@ mlxsw_sp_span_entry_create(struct mlxsw_sp_port *port)
 
span_entry->used = true;
span_entry->id = index;
-   span_entry->ref_count = 0;
+   span_entry->ref_count = 1;
span_entry->local_port = local_port;
return span_entry;
 }
@@ -270,6 +270,7 @@ static struct mlxsw_sp_span_entry
 
span_entry = mlxsw_sp_span_entry_find(port);
if (span_entry) {
+   /* Already exists, just take a reference */
span_entry->ref_count++;
return span_entry;
}
@@ -280,6 +281,7 @@ static struct mlxsw_sp_span_entry
 static int mlxsw_sp_span_entry_put(struct mlxsw_sp *mlxsw_sp,
   struct mlxsw_sp_span_entry *span_entry)
 {
+   WARN_ON(!span_entry->ref_count);
if (--span_entry->ref_count == 0)
mlxsw_sp_span_entry_destroy(mlxsw_sp, span_entry);
return 0;
-- 
2.7.4

[patch net v2 2/2] mlxsw: spectrum_router: Correctly dump neighbour activity

2016-11-11 Thread Jiri Pirko

From: Arkadi Sharshevsky 

The device's neighbour table is periodically dumped in order to update
the kernel about active neighbours. A single dump session may span
multiple queries, until the response carries less records than requested
or when a record (can contain up to four neighbour entries) is not full.
Current code stops the session when the number of returned records is
zero, which can result in infinite loop in case of high packet rate.

Fix this by stopping the session according to the above logic.

Fixes: c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's 
neigh table")
Signed-off-by: Arkadi Sharshevsky 
Signed-off-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
v1->v2:
- remove an extra space
- fix the description
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 040737e..cbeeddd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -800,6 +800,26 @@ static void mlxsw_sp_router_neigh_rec_process(struct 
mlxsw_sp *mlxsw_sp,
}
 }
 
+static bool mlxsw_sp_router_rauhtd_is_full(char *rauhtd_pl)
+{
+   u8 num_rec, last_rec_index, num_entries;
+
+   num_rec = mlxsw_reg_rauhtd_num_rec_get(rauhtd_pl);
+   last_rec_index = num_rec - 1;
+
+   if (num_rec < MLXSW_REG_RAUHTD_REC_MAX_NUM)
+   return false;
+   if (mlxsw_reg_rauhtd_rec_type_get(rauhtd_pl, last_rec_index) ==
+   MLXSW_REG_RAUHTD_TYPE_IPV6)
+   return true;
+
+   num_entries = mlxsw_reg_rauhtd_ipv4_rec_num_entries_get(rauhtd_pl,
+   last_rec_index);
+   if (++num_entries == MLXSW_REG_RAUHTD_IPV4_ENT_PER_REC)
+   return true;
+   return false;
+}
+
 static int mlxsw_sp_router_neighs_update_rauhtd(struct mlxsw_sp *mlxsw_sp)
 {
char *rauhtd_pl;
@@ -826,7 +846,7 @@ static int mlxsw_sp_router_neighs_update_rauhtd(struct 
mlxsw_sp *mlxsw_sp)
for (i = 0; i < num_rec; i++)
mlxsw_sp_router_neigh_rec_process(mlxsw_sp, rauhtd_pl,
  i);
-   } while (num_rec);
+   } while (mlxsw_sp_router_rauhtd_is_full(rauhtd_pl));
rtnl_unlock();
 
kfree(rauhtd_pl);
-- 
2.7.4

[patch net v2 0/2] mlxsw: Couple of fixes

2016-11-11 Thread Jiri Pirko

From: Jiri Pirko 

Please, queue-up both for stable. Thanks!

---
v1->v2:
- patch 1:
 - fix this rather by initializing refcount to 1
 - fix typo in description
- patch 2:
 - remove an extra space
 - fix the description

Arkadi Sharshevsky (1):
  mlxsw: spectrum_router: Correctly dump neighbour activity

Yotam Gigi (1):
  mlxsw: spectrum: Fix refcount bug on span entries

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  4 +++-
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 22 +-
 2 files changed, 24 insertions(+), 2 deletions(-)

-- 
2.7.4

[PATCH] net: ioctl SIOCSIFADDR minor cleanup

2016-11-11 Thread yuan linyu

From: yuan linyu 

1. set interface address label to ioctl request device name is enough
2. when address pass inet_abc_len check, prefixlen less than 31 is always true

Signed-off-by: yuan linyu 
---
 net/ipv4/devinet.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 062a67c..d491a7a 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1063,10 +1063,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void
__user *arg)
    if (!ifa)
    break;
    INIT_HLIST_NODE(>hash);
-   if (colon)
-   memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
-   else
-   memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
+   memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
    } else {
    ret = 0;
    if (ifa->ifa_local == sin->sin_addr.s_addr)
@@ -1081,8 +1078,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void
__user *arg)
    if (!(dev->flags & IFF_POINTOPOINT)) {
    ifa->ifa_prefixlen = inet_abc_len(ifa->ifa_address);
    ifa->ifa_mask = inet_make_mask(ifa->ifa_prefixlen);
-   if ((dev->flags & IFF_BROADCAST) &&
-   ifa->ifa_prefixlen < 31)
+   if (dev->flags & IFF_BROADCAST)
    ifa->ifa_broadcast = ifa->ifa_address |
     ~ifa->ifa_mask;
    } else {
-- 
2.7.4

Re: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread John Fastabend

On 16-11-11 04:43 AM, Simon Horman wrote:
> On Fri, Nov 11, 2016 at 08:28:50AM +, Yotam Gigi wrote:
> 
> ...
> 
>> John, as a result of your question I realized that our hardware does do
>> randomized sampling that I was not aware of. I will use the extensibility of
>> the API and implement a random keyword, that will be offloaded in our
>> hardware. Those changes will be sent on v2.
>>
>> Eventually, your question was very relevant :) Thanks!
> 
> Perhaps I am missing the point but why not just make random the default and
> implement the inverse as an extension if it turns out to be needed in
> future?
> 

+1 just implement the random one.

.John

Re: [PATCH] usbnet: prevent device rpm suspend in usbnet_probe function

2016-11-11 Thread Mathias Nyman


On 10.11.2016 13:22, Oliver Neukum wrote:

On Thu, 2016-11-10 at 12:09 +0100, Bjørn Mork wrote:

Kai-Heng Feng  writes:

On Wed, Nov 9, 2016 at 8:32 PM, Bjørn Mork  wrote:

Oliver Neukum  writes:


On Tue, 2016-11-08 at 13:44 -0500, Alan Stern wrote:


These problems could very well be caused by running at SuperSpeed
(USB-3) instead of high speed (USB-2).


Yes, it's running at SuperSpeed, on a Kabylake laptop.

It does not have this issue on a Broadwell laptop, also running at SuperSpeed.


Then I must join Oliver, being very surprised by where in the stack you
attempt to fix the issue.  What you write above indicates a problem in
pci bridge or usb host controller, doesn't it?


Indeed. And this means we need an XHCI specialist.
Mathias, we have a failure specific to one implementation of XHCI.




Could be related to resume singnalling time.
Does the xhci fix for it in 4.9-rc3 help?

commit 7d3b016a6f5a0fa610dfd02b05654c08fa4ae514
xhci: use default USB_RESUME_TIMEOUT when resuming ports.

It doesn't directly explain why it would work on Broadwell but not Kabylake,
but it resolved very similar cases.

If not, then adding dynamic debug for xhci could show something.

-Mathias

Re: AF_VSOCK loopback

2016-11-11 Thread Jorgen S. Hansen

Hi Stefan,

All datagram communication in VMCI based AF_VSOCK is going through the host - 
also for loopback communication. The only difference wrt loopback is that the 
VMCI queue pairs implementing the shared queues for the stream protocols aren't 
registered with the hypervisor - they are created specifying the 
VMCI_QPFLAG_LOCAL flag, and exist only as local guest memory.

So in the current form, there isn't much loopback code in the vmci AF_VSOCK 
implementation, so it doesn't seem like there would be much to share either.

Thanks,
Jørgen


From: Stefan Hajnoczi 
Sent: Thursday, November 10, 2016 3:43 PM
To: Jorgen S. Hansen
Cc: cav...@redhat.com; netdev@vger.kernel.org
Subject: AF_VSOCK loopback

Hi Jorgen,
Cathy Avery found that the AF_VSOCK VMCI transport does loopback inside
the guest (but not on the host?).  The virtio transport currently does
no loopback.

The loopback scenario I'm thinking of is where process A listens on port
1234 and process B on the same machine connects to port 1234 both with
the same CID.

I'd like to make the virtio transport compatible with VMCI transport
semantics so AF_VSOCK behaves the same regardless of the transport.
This means loopback must be added to virtio-vsock.

The core net/vmware/af_vsock.c code does not implement loopback.  How
does VMCI do loopback?  Are the loopback packets reflected back from the
host?  Or does the guest driver notice the loopback and avoid passing
packets to the host in the first place?

Maybe we can make the loopback code common in af_vsock.c if that avoids
code duplication.

Thanks,
Stefan

[PATCH] net: ethernet: ti: davinci_cpdma: don't stop ctlr if it was stopped

2016-11-11 Thread Ivan Khoronzhuk

No need to stop ctlr if it was already stopped. It can cause timeout
warns. Steps:
- ifconfig eth0 down
- ethtool -l eth0 rx 8 tx 8
- ethtool -l eth0 rx 1 tx 1

Signed-off-by: Ivan Khoronzhuk 
---

Based on net-next/master

 drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index 56395ce..56708a7 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -387,7 +387,7 @@ int cpdma_ctlr_stop(struct cpdma_ctlr *ctlr)
int i;
 
spin_lock_irqsave(>lock, flags);
-   if (ctlr->state == CPDMA_STATE_TEARDOWN) {
+   if (ctlr->state != CPDMA_STATE_ACTIVE) {
spin_unlock_irqrestore(>lock, flags);
return -EINVAL;
}
-- 
1.9.1

Re: [PATCH 0/2] bnx2: Hard reset bnx2 chip at probe stage

2016-11-11 Thread Baoquan He

On 11/11/16 at 09:46pm, Baoquan He wrote:
> Hi bnx2 experts,
> 
> In commit 3e1be7a ("bnx2: Reset device during driver initialization"),
> firmware requesting code was moved from open stage to probe stage.
> The reason is in kdump kernel hardware iommu need device be reset in
> driver probe stage, otherwise those in-flight DMA from 1st kernel
> will continue going and look up into the newly created io-page tables.
> So we need reset device to stop in-flight DMA as early as possibe.
> 
> But with commit 3e1be7a merged, people reported their bnx2 driver init
> failed because of failed firmware loading. After discussion, it's found
> that they built bnx2 driver into kernel, and that makes probe function
> bnx2_init_one be called in do_initcalls(). But at this time the initramfs
> has not been uncompressed yet and mounted, kernel can't detect firmware.
> 
> So there's only one way to cover both. Try to hard reset the bnx2 device
> at probe stage, without involving firmware issues. I tried to add function
> bnx2_hard_reset_chip() to do this and it's only called in kdump kernel.
> The thing is I am not quite familiar with bnx2 chip spec, just abstract
> code from bnx2_reset_chip, the testing result is good.

Here I changed to send BNX2_MISC_COMMAND_HD_RESET in BNX2_CHIP_5709
case.

> 
> Any suggestions are welcomed and much appreciated!
> 
> Baoquan He (2):
>   Revert "bnx2: Reset device during driver initialization"
>   bnx2: Hard reset bnx2 chip at probe stage
> 
>  drivers/net/ethernet/broadcom/bnx2.c | 70 
> +---
>  1 file changed, 65 insertions(+), 5 deletions(-)
> 
> -- 
> 2.5.5
>

Re: [PATCH 1/2] Revert "bnx2: Reset device during driver initialization"

2016-11-11 Thread Paul Menzel


Dear Baoquan,


On 11/11/16 14:46, Baoquan He wrote:

This reverts commit 3e1be7ad2d38c6bd6aeef96df9bd0a7822f4e51c.


Thanks a lot.


When people build bnx2 driver into kernel, it will fail to detect
and load firmware because firmware is contained in initramfs and
initramfs has not been uncompressed yet during do_initcalls. So
revert commit 3e1be7a and work out a new way in the later patch.


Just to note, that the other reason is, that in some installations 
people don’t have the firmware in initramfs at all or don’t use an 
initramfs.



Signed-off-by: Baoquan He 


Please mark this for inclusion into the stable Linux kernel.

Acked-by: Paul Menzel 


Thanks,

Paul

[PATCH 0/2] bnx2: Hard reset bnx2 chip at probe stage

2016-11-11 Thread Baoquan He

Hi bnx2 experts,

In commit 3e1be7a ("bnx2: Reset device during driver initialization"),
firmware requesting code was moved from open stage to probe stage.
The reason is in kdump kernel hardware iommu need device be reset in
driver probe stage, otherwise those in-flight DMA from 1st kernel
will continue going and look up into the newly created io-page tables.
So we need reset device to stop in-flight DMA as early as possibe.

But with commit 3e1be7a merged, people reported their bnx2 driver init
failed because of failed firmware loading. After discussion, it's found
that they built bnx2 driver into kernel, and that makes probe function
bnx2_init_one be called in do_initcalls(). But at this time the initramfs
has not been uncompressed yet and mounted, kernel can't detect firmware.

So there's only one way to cover both. Try to hard reset the bnx2 device
at probe stage, without involving firmware issues. I tried to add function
bnx2_hard_reset_chip() to do this and it's only called in kdump kernel.
The thing is I am not quite familiar with bnx2 chip spec, just abstract
code from bnx2_reset_chip, the testing result is good.

Any suggestions are welcomed and much appreciated!

Baoquan He (2):
  Revert "bnx2: Reset device during driver initialization"
  bnx2: Hard reset bnx2 chip at probe stage

 drivers/net/ethernet/broadcom/bnx2.c | 70 +---
 1 file changed, 65 insertions(+), 5 deletions(-)

-- 
2.5.5

[PATCH 2/2] bnx2: Hard reset bnx2 chip at probe stage

2016-11-11 Thread Baoquan He

In commit 3e1be7a ("bnx2: Reset device during driver initialization"),
firmware requesting code was moved to driver probe stage, into function
bnx2_init_one. But if bnx2 driver is build into kernel, it will fail
to request firmware because firmware is contained in initramfs, but
initramfs has not been uncommpressed and mounted yet when do_initcalls
called.

So in order to reset the device at probe stage, have to hard reset bnx2
chip wihtout involving firmware handling. So in this patch add function
bnx2_hard_reset_chip which only trys to hard reset bnx2 chip and only
will be called in kdump kernel.

Signed-off-by: Baoquan He 
---
 drivers/net/ethernet/broadcom/bnx2.c | 62 
 1 file changed, 62 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c 
b/drivers/net/ethernet/broadcom/bnx2.c
index c557972..84e3f12 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_CNIC)
 #define BCM_CNIC 1
@@ -4765,6 +4766,58 @@ bnx2_setup_msix_tbl(struct bnx2 *bp)
 }
 
 static int
+bnx2_hard_reset_chip(struct bnx2 *bp)
+{
+   u32 val;
+   int i, rc = 0;
+
+   if (BNX2_CHIP(bp) == BNX2_CHIP_5709) {
+   BNX2_WR(bp, BNX2_MISC_COMMAND, BNX2_MISC_COMMAND_HD_RESET);
+   BNX2_RD(bp, BNX2_MISC_COMMAND);
+   udelay(5);
+
+   val = BNX2_PCICFG_MISC_CONFIG_REG_WINDOW_ENA |
+ BNX2_PCICFG_MISC_CONFIG_TARGET_MB_WORD_SWAP;
+
+   BNX2_WR(bp, BNX2_PCICFG_MISC_CONFIG, val);
+
+   } else {
+   val = BNX2_PCICFG_MISC_CONFIG_CORE_RST_REQ |
+ BNX2_PCICFG_MISC_CONFIG_REG_WINDOW_ENA |
+ BNX2_PCICFG_MISC_CONFIG_TARGET_MB_WORD_SWAP;
+
+   /* Chip reset. */
+   BNX2_WR(bp, BNX2_PCICFG_MISC_CONFIG, val);
+
+   /* Reading back any register after chip reset will hang the
+* bus on 5706 A0 and A1.  The msleep below provides plenty
+* of margin for write posting.
+*/
+   if ((BNX2_CHIP_ID(bp) == BNX2_CHIP_ID_5706_A0) ||
+   (BNX2_CHIP_ID(bp) == BNX2_CHIP_ID_5706_A1))
+   msleep(20);
+
+   /* Reset takes approximate 30 usec */
+   for (i = 0; i < 10; i++) {
+   val = BNX2_RD(bp, BNX2_PCICFG_MISC_CONFIG);
+   if ((val & (BNX2_PCICFG_MISC_CONFIG_CORE_RST_REQ |
+   BNX2_PCICFG_MISC_CONFIG_CORE_RST_BSY)) == 0)
+   break;
+   udelay(10);
+   }
+
+   if (val & (BNX2_PCICFG_MISC_CONFIG_CORE_RST_REQ |
+  BNX2_PCICFG_MISC_CONFIG_CORE_RST_BSY)) {
+   pr_err("Chip reset did not complete\n");
+   return -EBUSY;
+   }
+   }
+
+   return rc;
+}
+
+
+static int
 bnx2_reset_chip(struct bnx2 *bp, u32 reset_code)
 {
u32 val;
@@ -8580,6 +8633,15 @@ bnx2_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
pci_set_drvdata(pdev, dev);
 
+
+   /*
+* Kdump kernel need reset device at probe stage if hardware iommu
+* is deployed. Otherwise in-flight DMA will continue going until
+* reset is done in open stage.
+*/
+   if (is_kdump_kernel())
+   bnx2_hard_reset_chip(bp);
+
memcpy(dev->dev_addr, bp->mac_addr, ETH_ALEN);
 
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
-- 
2.5.5

[PATCH 1/2] Revert "bnx2: Reset device during driver initialization"

2016-11-11 Thread Baoquan He

This reverts commit 3e1be7ad2d38c6bd6aeef96df9bd0a7822f4e51c.

When people build bnx2 driver into kernel, it will fail to detect
and load firmware because firmware is contained in initramfs and
initramfs has not been uncompressed yet during do_initcalls. So
revert commit 3e1be7a and work out a new way in the later patch.

Signed-off-by: Baoquan He 
---
 drivers/net/ethernet/broadcom/bnx2.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c 
b/drivers/net/ethernet/broadcom/bnx2.c
index b3791b3..c557972 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -6361,6 +6361,10 @@ bnx2_open(struct net_device *dev)
struct bnx2 *bp = netdev_priv(dev);
int rc;
 
+   rc = bnx2_request_firmware(bp);
+   if (rc < 0)
+   goto out;
+
netif_carrier_off(dev);
 
bnx2_disable_int(bp);
@@ -6429,6 +6433,7 @@ bnx2_open(struct net_device *dev)
bnx2_free_irq(bp);
bnx2_free_mem(bp);
bnx2_del_napi(bp);
+   bnx2_release_firmware(bp);
goto out;
 }
 
@@ -8575,12 +8580,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
pci_set_drvdata(pdev, dev);
 
-   rc = bnx2_request_firmware(bp);
-   if (rc < 0)
-   goto error;
-
-
-   bnx2_reset_chip(bp, BNX2_DRV_MSG_CODE_RESET);
memcpy(dev->dev_addr, bp->mac_addr, ETH_ALEN);
 
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
@@ -8613,7 +8612,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
return 0;
 
 error:
-   bnx2_release_firmware(bp);
pci_iounmap(pdev, bp->regview);
pci_release_regions(pdev);
pci_disable_device(pdev);
-- 
2.5.5

[PATCH v2] net: ethernet: ti: davinci_cpdma: fix fixed prio cpdma ctlr configuration

2016-11-11 Thread Ivan Khoronzhuk

The dma ctlr is reseted to 0 while cpdma soft reset, thus cpdma ctlr
cannot be configured after cpdma is stopped. So restoring content
of cpdma ctlr while off/on procedure is needed. The cpdma ctlr off/on
procedure is present while interface down/up and while changing number
of channels with ethtool. In order to not restore content in many
places, move it to cpdma_ctlr_start().

Signed-off-by: Ivan Khoronzhuk 
---

Based on net-next/master

Since v1:
- don't use redundant parameters for cpdma, make prio fixed to be constant

 drivers/net/ethernet/ti/cpsw.c  |   4 --
 drivers/net/ethernet/ti/davinci_cpdma.c | 102 +---
 2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b1ddf89..39d06e8 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1376,10 +1376,6 @@ static int cpsw_ndo_open(struct net_device *ndev)
  ALE_ALL_PORTS, ALE_ALL_PORTS, 0, 0);
 
if (!cpsw_common_res_usage_state(cpsw)) {
-   /* setup tx dma to fixed prio and zero offset */
-   cpdma_control_set(cpsw->dma, CPDMA_TX_PRIO_FIXED, 1);
-   cpdma_control_set(cpsw->dma, CPDMA_RX_BUFFER_OFFSET, 0);
-
/* disable priority elevation */
__raw_writel(0, >regs->ptype);
 
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index c3f35f1..c2fb1b6 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -124,6 +124,29 @@ struct cpdma_chan {
int int_set, int_clear, td;
 };
 
+struct cpdma_control_info {
+   u32 reg;
+   u32 shift, mask;
+   int access;
+#define ACCESS_RO  BIT(0)
+#define ACCESS_WO  BIT(1)
+#define ACCESS_RW  (ACCESS_RO | ACCESS_WO)
+};
+
+static struct cpdma_control_info controls[] = {
+   [CPDMA_CMD_IDLE]  = {CPDMA_DMACONTROL,  3,  1,  ACCESS_WO},
+   [CPDMA_COPY_ERROR_FRAMES] = {CPDMA_DMACONTROL,  4,  1,  ACCESS_RW},
+   [CPDMA_RX_OFF_LEN_UPDATE] = {CPDMA_DMACONTROL,  2,  1,  ACCESS_RW},
+   [CPDMA_RX_OWNERSHIP_FLIP] = {CPDMA_DMACONTROL,  1,  1,  ACCESS_RW},
+   [CPDMA_TX_PRIO_FIXED] = {CPDMA_DMACONTROL,  0,  1,  ACCESS_RW},
+   [CPDMA_STAT_IDLE] = {CPDMA_DMASTATUS,   31, 1,  ACCESS_RO},
+   [CPDMA_STAT_TX_ERR_CODE]  = {CPDMA_DMASTATUS,   20, 0xf,ACCESS_RW},
+   [CPDMA_STAT_TX_ERR_CHAN]  = {CPDMA_DMASTATUS,   16, 0x7,ACCESS_RW},
+   [CPDMA_STAT_RX_ERR_CODE]  = {CPDMA_DMASTATUS,   12, 0xf,ACCESS_RW},
+   [CPDMA_STAT_RX_ERR_CHAN]  = {CPDMA_DMASTATUS,   8,  0x7,ACCESS_RW},
+   [CPDMA_RX_BUFFER_OFFSET]  = {CPDMA_RXBUFFOFS,   0,  0x, ACCESS_RW},
+};
+
 #define tx_chan_num(chan)  (chan)
 #define rx_chan_num(chan)  ((chan) + CPDMA_MAX_CHANNELS)
 #define is_rx_chan(chan)   ((chan)->chan_num >= CPDMA_MAX_CHANNELS)
@@ -253,6 +276,31 @@ static void cpdma_desc_free(struct cpdma_desc_pool *pool,
gen_pool_free(pool->gen_pool, (unsigned long)desc, pool->desc_size);
 }
 
+static int _cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value)
+{
+   struct cpdma_control_info *info = [control];
+   u32 val;
+
+   if (!ctlr->params.has_ext_regs)
+   return -ENOTSUPP;
+
+   if (ctlr->state != CPDMA_STATE_ACTIVE)
+   return -EINVAL;
+
+   if (control < 0 || control >= ARRAY_SIZE(controls))
+   return -ENOENT;
+
+   if ((info->access & ACCESS_WO) != ACCESS_WO)
+   return -EPERM;
+
+   val  = dma_reg_read(ctlr, info->reg);
+   val &= ~(info->mask << info->shift);
+   val |= (value & info->mask) << info->shift;
+   dma_reg_write(ctlr, info->reg, val);
+
+   return 0;
+}
+
 struct cpdma_ctlr *cpdma_ctlr_create(struct cpdma_params *params)
 {
struct cpdma_ctlr *ctlr;
@@ -324,6 +372,10 @@ int cpdma_ctlr_start(struct cpdma_ctlr *ctlr)
if (ctlr->channels[i])
cpdma_chan_start(ctlr->channels[i]);
}
+
+   _cpdma_control_set(ctlr, CPDMA_TX_PRIO_FIXED, 1);
+   _cpdma_control_set(ctlr, CPDMA_RX_BUFFER_OFFSET, 0);
+
spin_unlock_irqrestore(>lock, flags);
return 0;
 }
@@ -874,29 +926,6 @@ int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool 
enable)
return 0;
 }
 
-struct cpdma_control_info {
-   u32 reg;
-   u32 shift, mask;
-   int access;
-#define ACCESS_RO  BIT(0)
-#define ACCESS_WO  BIT(1)
-#define ACCESS_RW  (ACCESS_RO | ACCESS_WO)
-};
-
-static struct cpdma_control_info controls[] = {
-   [CPDMA_CMD_IDLE]  = {CPDMA_DMACONTROL,  3,  1,  ACCESS_WO},
-   [CPDMA_COPY_ERROR_FRAMES] = {CPDMA_DMACONTROL,  4,  1,  ACCESS_RW},
-

Hopefully

2016-11-11 Thread Rita Micheal

Hi friend I am a banker in ADB BANK. I want to transfer an abandoned
$25.5Million to your Bank account. 40/percent will be your share.
No risks involved but keep it as secret. Contact me for more details
And also acknowledge receipt of this message in acceptance of my
mutual business Endeavour by furnishing me with the following:
1. Your Full Names and Address.
2. Direct Telephone and Fax numbers
Please reply in my private email address (michealrit...@yahoo.fr) for
security and confidential reasons.
Yours
Miss Rita Micheal

[PATCH] vhost/scsi: Remove unused but set variable

2016-11-11 Thread Tobias Klauser

Remove the unused but set variable se_tpg in vhost_scsi_nexus_cb() to
fix the following GCC warning when building with 'W=1':

  drivers/vhost/scsi.c:1752:26: warning: variable ‘se_tpg’ set but not used

Signed-off-by: Tobias Klauser 
---
 drivers/vhost/scsi.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6e29d053843d..e2be447752c2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1749,7 +1749,6 @@ static int vhost_scsi_nexus_cb(struct se_portal_group 
*se_tpg,
 static int vhost_scsi_make_nexus(struct vhost_scsi_tpg *tpg,
const char *name)
 {
-   struct se_portal_group *se_tpg;
struct vhost_scsi_nexus *tv_nexus;
 
mutex_lock(>tv_tpg_mutex);
@@ -1758,7 +1757,6 @@ static int vhost_scsi_make_nexus(struct vhost_scsi_tpg 
*tpg,
pr_debug("tpg->tpg_nexus already exists\n");
return -EEXIST;
}
-   se_tpg = >se_tpg;
 
tv_nexus = kzalloc(sizeof(struct vhost_scsi_nexus), GFP_KERNEL);
if (!tv_nexus) {
-- 
2.11.0.rc0.7.gbe5a750

[PATCH] vhost/vsock: Remove unused but set variable

2016-11-11 Thread Tobias Klauser

Remove the unused but set variable vq in vhost_transport_send_pkt() to
fix the following GCC warning when building with 'W=1':

  drivers/vhost/vsock.c:198:26: warning: variable ‘vq’ set but not used

Signed-off-by: Tobias Klauser 
---
 drivers/vhost/vsock.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e3b30ea9ece5..9c3c68b9a49e 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -195,7 +195,6 @@ static int
 vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 {
struct vhost_vsock *vsock;
-   struct vhost_virtqueue *vq;
int len = pkt->len;
 
/* Find the vhost_vsock according to guest context id  */
@@ -205,8 +204,6 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
return -ENODEV;
}
 
-   vq = >vqs[VSOCK_VQ_RX];
-
if (pkt->reply)
atomic_inc(>queued_replies);
 
-- 
2.11.0.rc0.7.gbe5a750

[iproute PATCH 10/18] ss: Make user_ent_hash_build_init local to user_ent_hash_build()

2016-11-11 Thread Phil Sutter

By having it statically defined, there is no need for it to be global.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 477910a842726..3e5c93bb7c6f9 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -99,8 +99,6 @@ int show_bpf;
 int show_proc_ctx;
 int show_sock_ctx;
 int show_header = 1;
-/* If show_users & show_proc_ctx only do user_ent_hash_build() once */
-int user_ent_hash_build_init;
 int follow_events;
 
 int netid_width;
@@ -400,6 +398,7 @@ static void user_ent_hash_build(void)
char *pid_context;
char *sock_context;
const char *no_ctx = "unavailable";
+   static int user_ent_hash_build_init;
 
/* If show_users & show_proc_ctx set only do this once */
if (user_ent_hash_build_init != 0)
-- 
2.10.0

[iproute PATCH 17/18] ss: Make sstate_namel local to scan_state()

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 14492da256c61..6e669f7b0593c 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -634,21 +634,6 @@ static unsigned long long cookie_sk_get(const uint32_t 
*cookie)
return (((unsigned long long)cookie[1] << 31) << 1) | cookie[0];
 }
 
-static const char *sstate_namel[] = {
-   "UNKNOWN",
-   [SS_ESTABLISHED] = "established",
-   [SS_SYN_SENT] = "syn-sent",
-   [SS_SYN_RECV] = "syn-recv",
-   [SS_FIN_WAIT1] = "fin-wait-1",
-   [SS_FIN_WAIT2] = "fin-wait-2",
-   [SS_TIME_WAIT] = "time-wait",
-   [SS_CLOSE] = "unconnected",
-   [SS_CLOSE_WAIT] = "close-wait",
-   [SS_LAST_ACK] = "last-ack",
-   [SS_LISTEN] =   "listening",
-   [SS_CLOSING] = "closing",
-};
-
 struct sockstat {
struct sockstat*next;
unsigned inttype;
@@ -3698,6 +3683,20 @@ static void usage(void)
 
 static int scan_state(const char *state)
 {
+   static const char * const sstate_namel[] = {
+   "UNKNOWN",
+   [SS_ESTABLISHED] = "established",
+   [SS_SYN_SENT] = "syn-sent",
+   [SS_SYN_RECV] = "syn-recv",
+   [SS_FIN_WAIT1] = "fin-wait-1",
+   [SS_FIN_WAIT2] = "fin-wait-2",
+   [SS_TIME_WAIT] = "time-wait",
+   [SS_CLOSE] = "unconnected",
+   [SS_CLOSE_WAIT] = "close-wait",
+   [SS_LAST_ACK] = "last-ack",
+   [SS_LISTEN] =   "listening",
+   [SS_CLOSING] = "closing",
+   };
int i;
 
if (strcasecmp(state, "close") == 0 ||
-- 
2.10.0

Re: [patch net 2/2] mlxsw: spectrum_router: Correctly dump neighbour activity

2016-11-11 Thread Jiri Pirko

Fri, Nov 11, 2016 at 01:54:05PM CET, ido...@idosch.org wrote:
>On Fri, Nov 11, 2016 at 11:20:42AM +0100, Jiri Pirko wrote:
>> From: Arkadi Sharshevsky 
>> 
>> During neighbour activity check the device's table is dumped by multiple
>> query requests. The query session should end when the response carries
>> less records than requested or when a given record is not full. Current
>> code only stops the dumping process if the number of returned records is
>> zero, which can result in infinite loop in case of activity.
>> 
>> Fix this by stopping the dumping process according to the above logic.
>> 
>> Fixes: c723c735fa6b ("mlxsw: spectrum_router: Periodically update the 
>> kernel's neigh table")
>> Signed-off-by: Arkadi Sharshevsky 
>> Signed-off-by: Ido Schimmel 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 22 
>> +-
>>  1 file changed, 21 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> index 040737e..d437457 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> @@ -800,6 +800,26 @@ static void mlxsw_sp_router_neigh_rec_process(struct 
>> mlxsw_sp *mlxsw_sp,
>>  }
>>  }
>>  
>> +static bool mlxsw_sp_router_rauhtd_is_full(char *rauhtd_pl)
>> +{
>> +u8 num_rec, last_rec_index, num_entries;
>> +
>> +num_rec = mlxsw_reg_rauhtd_num_rec_get(rauhtd_pl);
>> +last_rec_index = num_rec - 1;
>> +
>> +if (num_rec < MLXSW_REG_RAUHTD_REC_MAX_NUM)
>> +return false;
>> +if (mlxsw_reg_rauhtd_rec_type_get(rauhtd_pl, last_rec_index) ==
>> +MLXSW_REG_RAUHTD_TYPE_IPV6)
>> +return true;
>> +
>> +num_entries = mlxsw_reg_rauhtd_ipv4_rec_num_entries_get(rauhtd_pl,
>> +last_rec_index);
>> +if (++num_entries ==  MLXSW_REG_RAUHTD_IPV4_ENT_PER_REC)
>
>Jiri, I just noticed we have an extra space after the '=='. Can you
>please remove it in v2? Sorry for not spotting this earlier.


Will do.

[iproute PATCH 14/18] ss: Get rid of single-fielded struct snmpstat

2016-11-11 Thread Phil Sutter

A struct with only a single field does not make much sense. Besides
that, it was used by print_summary() only.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index b06b9e6fa9884..85fc6096a986f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -3472,10 +3472,6 @@ static int handle_follow_request(struct filter *f)
return ret;
 }
 
-struct snmpstat {
-   int tcp_estab;
-};
-
 static int get_snmp_int(char *proto, char *key, int *result)
 {
char buf[1024];
@@ -3596,11 +3592,11 @@ static int get_sockstat(struct ssummary *s)
 static int print_summary(void)
 {
struct ssummary s;
-   struct snmpstat sn;
+   int tcp_estab;
 
if (get_sockstat() < 0)
perror("ss: get_sockstat");
-   if (get_snmp_int("Tcp:", "CurrEstab", _estab) < 0)
+   if (get_snmp_int("Tcp:", "CurrEstab", _estab) < 0)
perror("ss: get_snmpstat");
 
get_slabstat();
@@ -3609,7 +3605,7 @@ static int print_summary(void)
 
printf("TCP:   %d (estab %d, closed %d, orphaned %d, synrecv %d, 
timewait %d/%d), ports %d\n",
   s.tcp_total + slabstat.tcp_syns + s.tcp_tws,
-  sn.tcp_estab,
+  tcp_estab,
   s.tcp_total - (s.tcp4_hashed+s.tcp6_hashed-s.tcp_tws),
   s.tcp_orphans,
   slabstat.tcp_syns,
-- 
2.10.0

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Rolf Neugebauer

On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
 wrote:
> On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
>> (Cc'ing Paul)
>>
>> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
>>  wrote:
>> > Hi
>> >
>> > We noticed some long delays starting docker containers on some newer
>> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
>> > fine). We narrowed this down to the creation of a network namespace
>> > being delayed directly after removing another one (details and
>> > reproduction below). We have seen delays of up to 60s on some systems.
>> >
>> > - The delay is proportional to the number of CPUs (online or offline).
>> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
>> > to 240 offline vCPUs even if one configures the VM with only, say 2
>> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
>> > kernel config.
>> >
>> > - The delay is also dependent on some tunnel network interfaces being
>> > present (which we had compiled in in one of our kernel configs).
>> >
>> > - We can reproduce this issue with stock kernels from
>> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
>> > as well as other hypervisors like qemu and hyperkit where we have good
>> > control over the number of CPUs.
>> >
>> > A simple test is:
>> > modprobe ipip
>> > moprobe  ip_gre
>> > modprobe ip_vti
>> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
>> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
>> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
>> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
>> >
>> > with an output like:
>> > add netns foo ===> 0:00.00
>> > del netns foo ===> 0:00.01
>> > add netns bar ===> 0:08.53
>> > del netns bar ===> 0:00.01
>> >
>> > This is on a 4.9-rc4 kernel from the above URL configured with
>> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
>> >
>> > Below is a dump of the work queues while the second 'ip add netns' is
>> > hanging. The state of the work queues does not seem to change while
>> > the command is delayed and the pattern shown is consistent across
>> > different kernel versions.
>> >
>> > Is this a known issue and/or is someone working on a fix?
>>
>> Not to me.
>>
>>
>> >
>> > [  610.356272] sysrq: SysRq : Show Blocked State
>> > [  610.356742]   taskPC stack   pid father
>> > [  610.357252] kworker/u480:1  D0  1994  2 0x
>> > [  610.357752] Workqueue: netns cleanup_net
>> > [  610.358239]  9892f1065800  9892ee1e1e00
>> > 9892f8e59340
>> > [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
>> > bf0104b5ba60
>> > [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
>> > 9892ee1e1e00
>> > [  610.359677] Call Trace:
>> > [  610.360169]  [] ? __schedule+0x233/0x6e0
>> > [  610.360723]  [] schedule+0x36/0x80
>> > [  610.361194]  [] schedule_timeout+0x22a/0x3f0
>> > [  610.361789]  [] ? __schedule+0x23b/0x6e0
>> > [  610.362260]  [] wait_for_completion+0xb4/0x140
>> > [  610.362736]  [] ? wake_up_q+0x80/0x80
>> > [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
>> > [  610.363782]  [] synchronize_sched+0x5c/0x80
>> > [  610.364137]  [] ? call_rcu_bh+0x20/0x20
>> > [  610.364742]  [] ?
>> > trace_raw_output_rcu_utilization+0x60/0x60
>> > [  610.365337]  [] synchronize_net+0x1c/0x30
>>
>> This is a worker which holds the net_mutex and is waiting for
>> a RCU grace period to elapse.
>>
>>
>> > [  610.365846]  [] netif_napi_del+0x23/0x80
>> > [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 
>> > [ip_tunnel]
>> > [  610.368007]  [] netdev_run_todo+0x230/0x330
>> > [  610.368454]  [] rtnl_unlock+0xe/0x10
>> > [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
>> > [ip_tunnel]
>> > [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
>> > [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
>> > [  610.370636]  [] cleanup_net+0x1c4/0x2b0
>> > [  610.371130]  [] process_one_work+0x1fc/0x4b0
>> > [  610.371812]  [] worker_thread+0x4b/0x500
>> > [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
>> > [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
>> > [  610.374100]  [] kthread+0xd9/0xf0
>> > [  610.374574]  [] ? kthread_park+0x60/0x60
>> > [  610.375198]  [] ret_from_fork+0x25/0x30
>> > [  610.375678] ip  D0  2149   2148 0x
>> > [  610.376185]  9892f0a99000  9892f0a66900
>> > [  610.376185]  9892f8e59340
>> > [  610.376185]  9892f4526900 bf0101173db8 be486df3
>> > [  610.376753]  0005fecffd76
>> > [  610.376762]  00ff9892f11d9820 9892f8e59340 9892
>> > 9892f0a66900
>> > [  610.377274] Call Trace:
>> > [  610.377789]  [] ? __schedule+0x233/0x6e0
>> > [  610.378306]  [] schedule+0x36/0x80
>> > [  610.378992]  [] schedule_preempt_disabled+0xe/0x10

[iproute PATCH 02/18] ss: Drop empty lines in UDP output

2016-11-11 Thread Phil Sutter

When dumping UDP sockets and show_tcpinfo (-i) is active but not
show_mem (-m), print_tcpinfo() does not output anything leading to an
empty line being printed after every socket. Fix this by skipping the
call to print_tcpinfo() and the previous newline printing in that case.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/misc/ss.c b/misc/ss.c
index c20bfbdb01c62..4698683c4e84a 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2240,7 +2240,7 @@ static int inet_show_sock(struct nlmsghdr *nlh,
}
}
 
-   if (show_mem || show_tcpinfo) {
+   if (show_mem || (show_tcpinfo && protocol != IPPROTO_UDP)) {
printf("\n\t");
tcp_show_info(nlh, r, tb);
}
-- 
2.10.0

Re: [patch net 1/2] mlxsw: spectrum: Fix refcount bug on span entries

2016-11-11 Thread Jiri Pirko

Fri, Nov 11, 2016 at 01:49:28PM CET, ido...@idosch.org wrote:
>On Fri, Nov 11, 2016 at 11:20:41AM +0100, Jiri Pirko wrote:
>> From: Yotam Gigi 
>> 
>> When binding port to a newly created span entry, its refcount is set 0
>> even though it has a bound port. That leeds to unexpected behaviour when
>
>s/leeds/leads/
>
>> the user tries to delete that port from the span entry.
>> 
>> Change the binding process to increase the refcount of the bound entry
>> even if the entry is newly created, and add warning on the process of
>> removing bound port from entry when its refcount is 0.
>> 
>> Fixes: 763b4b70afcd3 ("mlxsw: spectrum: Add support in matchall mirror TC 
>> offloading")
>
>You only need the first 12 characters.
>
>> Signed-off-by: Yotam Gigi 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 11 ++-
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> index 1ec0a4c..d75c1ff 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> @@ -269,17 +269,18 @@ static struct mlxsw_sp_span_entry
>>  struct mlxsw_sp_span_entry *span_entry;
>>  
>>  span_entry = mlxsw_sp_span_entry_find(port);
>> -if (span_entry) {
>> -span_entry->ref_count++;
>> -return span_entry;
>> -}
>> +if (!span_entry)
>> +span_entry = mlxsw_sp_span_entry_create(port);
>>  
>> -return mlxsw_sp_span_entry_create(port);
>> +span_entry->ref_count++;
>
>mlxsw_sp_span_entry_create() can return NULL. You can look at
>mlxsw_sp_fib_entry_get() for reference.

Right, missed that. Will fix. Thanks.

[iproute PATCH 07/18] ss: Eliminate unix_use_proc()

2016-11-11 Thread Phil Sutter

This function is used only at a single place anymore, so replace the
call to it by it's content, which makes that specific part of
unix_show() consistent with e.g. tcp_show().

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 36b18ff2ce3cb..8e021731cf71c 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2808,11 +2808,6 @@ static bool unix_type_skip(struct sockstat *s, struct 
filter *f)
return false;
 }
 
-static bool unix_use_proc(void)
-{
-   return getenv("PROC_NET_UNIX") || getenv("PROC_ROOT");
-}
-
 static void unix_stats_print(struct sockstat *s, struct filter *f)
 {
char port_name[30] = {};
@@ -2934,7 +2929,8 @@ static int unix_show(struct filter *f)
if (!filter_af_get(f, AF_UNIX))
return 0;
 
-   if (!unix_use_proc() && unix_show_netlink(f) == 0)
+   if (!getenv("PROC_NET_UNIX") && !getenv("PROC_ROOT")
+   && unix_show_netlink(f) == 0)
return 0;
 
if ((fp = net_unix_open()) == NULL)
-- 
2.10.0

[iproute PATCH 13/18] ss: Get rid of useless goto in handle_follow_request()

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c3a5148e05013..b06b9e6fa9884 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -3443,7 +3443,7 @@ static int generic_show_sock(const struct sockaddr_nl 
*addr,
 
 static int handle_follow_request(struct filter *f)
 {
-   int ret = -1;
+   int ret = 0;
int groups = 0;
struct rtnl_handle rth;
 
@@ -3466,10 +3466,8 @@ static int handle_follow_request(struct filter *f)
rth.local.nl_pid = 0;
 
if (rtnl_dump_filter(, generic_show_sock, f))
-   goto Exit;
+   ret = -1;
 
-   ret = 0;
-Exit:
rtnl_close();
return ret;
 }
-- 
2.10.0

[iproute PATCH 03/18] ss: Add missing tab when printing UNIX details

2016-11-11 Thread Phil Sutter

When dumping UNIX sockets and show_details is active but not show_mem
(ss -xne), the socket details are printed without being prefixed by tab.
Fix this by printing the tab character when either one of '-e' or '-m'
has been specified.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 4698683c4e84a..d0b4f879c4d9f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2907,10 +2907,10 @@ static int unix_show_sock(const struct sockaddr_nl 
*addr, struct nlmsghdr *nlh,
 
unix_stats_print(, f);
 
-   if (show_mem) {
+   if (show_mem || show_details)
printf("\t");
+   if (show_mem)
print_skmeminfo(tb, UNIX_DIAG_MEMINFO);
-   }
if (show_details) {
if (tb[UNIX_DIAG_SHUTDOWN]) {
unsigned char mask;
-- 
2.10.0

[iproute PATCH 06/18] ss: Drop list traversal from unix_stats_print()

2016-11-11 Thread Phil Sutter

Although this complicates the dedicated procfs-based code path in
unix_show() a bit, it's the only sane way to get rid of unix_show_sock()
output diverging from other socket types in that it prints all socket
details in a new line.

As a side effect, it allows to eliminate all procfs specific code in
the same function.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 137 +-
 1 file changed, 64 insertions(+), 73 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index b3475cc96ae7b..36b18ff2ce3cb 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2788,15 +2788,13 @@ int unix_state_map[] = { SS_CLOSE, SS_SYN_SENT,
 
 #define MAX_UNIX_REMEMBER (1024*1024/sizeof(struct sockstat))
 
-static void unix_list_free(struct sockstat *list)
+static void unix_list_drop_first(struct sockstat **list)
 {
-   while (list) {
-   struct sockstat *s = list;
+   struct sockstat *s = *list;
 
-   list = list->next;
-   free(s->name);
-   free(s);
-   }
+   (*list) = (*list)->next;
+   free(s->name);
+   free(s);
 }
 
 static bool unix_type_skip(struct sockstat *s, struct filter *f)
@@ -2815,61 +2813,18 @@ static bool unix_use_proc(void)
return getenv("PROC_NET_UNIX") || getenv("PROC_ROOT");
 }
 
-static void unix_stats_print(struct sockstat *list, struct filter *f)
+static void unix_stats_print(struct sockstat *s, struct filter *f)
 {
-   struct sockstat *s;
-   char *peer;
-   bool use_proc = unix_use_proc();
char port_name[30] = {};
 
-   for (s = list; s; s = s->next) {
-   if (!(f->states & (1 << s->state)))
-   continue;
-   if (unix_type_skip(s, f))
-   continue;
-
-   peer = "*";
-   if (s->peer_name)
-   peer = s->peer_name;
-
-   if (s->rport && use_proc) {
-   struct sockstat *p;
-
-   for (p = list; p; p = p->next) {
-   if (s->rport == p->lport)
-   break;
-   }
-
-   if (!p) {
-   peer = "?";
-   } else {
-   peer = p->name ? : "*";
-   }
-   }
-
-   if (use_proc && f->f) {
-   struct sockstat st = {
-   .local.family = AF_UNIX,
-   .remote.family = AF_UNIX,
-   };
-
-   memcpy(st.local.data, >name, sizeof(s->name));
-   if (strcmp(peer, "*"))
-   memcpy(st.remote.data, , sizeof(peer));
-   if (run_ssfilter(f->f, ) == 0)
-   continue;
-   }
-
-   sock_state_print(s);
+   sock_state_print(s);
 
-   sock_addr_print(s->name ?: "*", " ",
-   int_to_str(s->lport, port_name), NULL);
-   sock_addr_print(peer, " ", int_to_str(s->rport, port_name),
-   NULL);
+   sock_addr_print(s->name ?: "*", " ",
+   int_to_str(s->lport, port_name), NULL);
+   sock_addr_print(s->peer_name ?: "*", " ",
+   int_to_str(s->rport, port_name), NULL);
 
-   proc_ctx_print(s);
-   printf("\n");
-   }
+   proc_ctx_print(s);
 }
 
 static int unix_show_sock(const struct sockaddr_nl *addr, struct nlmsghdr *nlh,
@@ -2916,8 +2871,6 @@ static int unix_show_sock(const struct sockaddr_nl *addr, 
struct nlmsghdr *nlh,
 
unix_stats_print(, f);
 
-   if (show_mem || show_details)
-   printf("\t");
if (show_mem)
print_skmeminfo(tb, UNIX_DIAG_MEMINFO);
if (show_details) {
@@ -2928,8 +2881,7 @@ static int unix_show_sock(const struct sockaddr_nl *addr, 
struct nlmsghdr *nlh,
printf(" %c-%c", mask & 1 ? '-' : '<', mask & 2 ? '-' : 
'>');
}
}
-   if (show_mem || show_details)
-   printf("\n");
+   printf("\n");
 
return 0;
 }
@@ -3020,6 +2972,11 @@ static int unix_show(struct filter *f)
if (u->type == SOCK_DGRAM && u->state == SS_CLOSE && 
u->rport)
u->state = SS_ESTABLISHED;
}
+   if (unix_type_skip(u, f) ||
+   !(f->states & (1 << u->state))) {
+   free(u);
+   continue;
+   }
 
if (!newformat) {
u->rport = 0;
@@ -3027,6 +2984,42 @@ static int unix_show(struct filter *f)
u->wq = 0;
}
 
+   if (name[0]) {
+   u->name = strdup(name);
+

[iproute PATCH 18/18] ss: unix_show: No need to initialize members of calloc'ed structs

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 6e669f7b0593c..1e3ccf28c4e84 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2877,8 +2877,6 @@ static int unix_show(struct filter *f)
 
if (!(u = calloc(1, sizeof(*u
break;
-   u->name = NULL;
-   u->peer_name = NULL;
 
if (sscanf(buf, "%x: %x %x %x %x %x %d %s",
   >rport, >rq, >wq, , >type,
-- 
2.10.0

[iproute PATCH 15/18] ss: Make unix_state_map local to unix_show()

2016-11-11 Thread Phil Sutter

Also make it const, since there won't be any write access happening.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 85fc6096a986f..cf4187310816e 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2727,9 +2727,6 @@ outerr:
} while (0);
 }
 
-int unix_state_map[] = { SS_CLOSE, SS_SYN_SENT,
-SS_ESTABLISHED, SS_CLOSING };
-
 #define MAX_UNIX_REMEMBER (1024*1024/sizeof(struct sockstat))
 
 static void unix_list_drop_first(struct sockstat **list)
@@ -2869,6 +2866,8 @@ static int unix_show(struct filter *f)
int  newformat = 0;
int  cnt;
struct sockstat *list = NULL;
+   const int unix_state_map[] = { SS_CLOSE, SS_SYN_SENT,
+  SS_ESTABLISHED, SS_CLOSING };
 
if (!filter_af_get(f, AF_UNIX))
return 0;
-- 
2.10.0

[iproute PATCH 16/18] ss: Make sstate_name local to sock_state_print()

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index cf4187310816e..14492da256c61 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -634,21 +634,6 @@ static unsigned long long cookie_sk_get(const uint32_t 
*cookie)
return (((unsigned long long)cookie[1] << 31) << 1) | cookie[0];
 }
 
-static const char *sstate_name[] = {
-   "UNKNOWN",
-   [SS_ESTABLISHED] = "ESTAB",
-   [SS_SYN_SENT] = "SYN-SENT",
-   [SS_SYN_RECV] = "SYN-RECV",
-   [SS_FIN_WAIT1] = "FIN-WAIT-1",
-   [SS_FIN_WAIT2] = "FIN-WAIT-2",
-   [SS_TIME_WAIT] = "TIME-WAIT",
-   [SS_CLOSE] = "UNCONN",
-   [SS_CLOSE_WAIT] = "CLOSE-WAIT",
-   [SS_LAST_ACK] = "LAST-ACK",
-   [SS_LISTEN] =   "LISTEN",
-   [SS_CLOSING] = "CLOSING",
-};
-
 static const char *sstate_namel[] = {
"UNKNOWN",
[SS_ESTABLISHED] = "established",
@@ -769,6 +754,20 @@ static char *proto_name(int protocol)
 static void sock_state_print(struct sockstat *s)
 {
const char *sock_name;
+   static const char * const sstate_name[] = {
+   "UNKNOWN",
+   [SS_ESTABLISHED] = "ESTAB",
+   [SS_SYN_SENT] = "SYN-SENT",
+   [SS_SYN_RECV] = "SYN-RECV",
+   [SS_FIN_WAIT1] = "FIN-WAIT-1",
+   [SS_FIN_WAIT2] = "FIN-WAIT-2",
+   [SS_TIME_WAIT] = "TIME-WAIT",
+   [SS_CLOSE] = "UNCONN",
+   [SS_CLOSE_WAIT] = "CLOSE-WAIT",
+   [SS_LAST_ACK] = "LAST-ACK",
+   [SS_LISTEN] =   "LISTEN",
+   [SS_CLOSING] = "CLOSING",
+   };
 
switch (s->local.family) {
case AF_UNIX:
-- 
2.10.0

[iproute PATCH 11/18] ss: Make some variables function-local

2016-11-11 Thread Phil Sutter

addrp_width and screen_width are used in main() only, so no need to have
them globally available.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 3e5c93bb7c6f9..d546a00eb2c24 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -103,10 +103,8 @@ int follow_events;
 
 int netid_width;
 int state_width;
-int addrp_width;
 int addr_width;
 int serv_width;
-int screen_width;
 
 static const char *TCP_PROTO = "tcp";
 static const char *UDP_PROTO = "udp";
@@ -3784,6 +3782,7 @@ int main(int argc, char *argv[])
FILE *filter_fp = NULL;
int ch;
int state_filter = 0;
+   int addrp_width, screen_width = 80;
 
while ((ch = getopt_long(argc, argv, 
"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KH",
 long_opts, NULL)) != EOF) {
@@ -4067,7 +4066,6 @@ int main(int argc, char *argv[])
if (current_filter.states&(current_filter.states-1))
state_width = 10;
 
-   screen_width = 80;
if (isatty(STDOUT_FILENO)) {
struct winsize w;
 
-- 
2.10.0

[iproute PATCH 08/18] ss: Turn generic_proc_open() wrappers into macros

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 89 ++-
 1 file changed, 19 insertions(+), 70 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 8e021731cf71c..e9fecd39a8493 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -306,76 +306,25 @@ static FILE *generic_proc_open(const char *env, const 
char *name)
 
return fopen(p, "r");
 }
-
-static FILE *net_tcp_open(void)
-{
-   return generic_proc_open("PROC_NET_TCP", "net/tcp");
-}
-
-static FILE *net_tcp6_open(void)
-{
-   return generic_proc_open("PROC_NET_TCP6", "net/tcp6");
-}
-
-static FILE *net_udp_open(void)
-{
-   return generic_proc_open("PROC_NET_UDP", "net/udp");
-}
-
-static FILE *net_udp6_open(void)
-{
-   return generic_proc_open("PROC_NET_UDP6", "net/udp6");
-}
-
-static FILE *net_raw_open(void)
-{
-   return generic_proc_open("PROC_NET_RAW", "net/raw");
-}
-
-static FILE *net_raw6_open(void)
-{
-   return generic_proc_open("PROC_NET_RAW6", "net/raw6");
-}
-
-static FILE *net_unix_open(void)
-{
-   return generic_proc_open("PROC_NET_UNIX", "net/unix");
-}
-
-static FILE *net_packet_open(void)
-{
-   return generic_proc_open("PROC_NET_PACKET", "net/packet");
-}
-
-static FILE *net_netlink_open(void)
-{
-   return generic_proc_open("PROC_NET_NETLINK", "net/netlink");
-}
-
-static FILE *slabinfo_open(void)
-{
-   return generic_proc_open("PROC_SLABINFO", "slabinfo");
-}
-
-static FILE *net_sockstat_open(void)
-{
-   return generic_proc_open("PROC_NET_SOCKSTAT", "net/sockstat");
-}
-
-static FILE *net_sockstat6_open(void)
-{
-   return generic_proc_open("PROC_NET_SOCKSTAT6", "net/sockstat6");
-}
-
-static FILE *net_snmp_open(void)
-{
-   return generic_proc_open("PROC_NET_SNMP", "net/snmp");
-}
-
-static FILE *ephemeral_ports_open(void)
-{
-   return generic_proc_open("PROC_IP_LOCAL_PORT_RANGE", 
"sys/net/ipv4/ip_local_port_range");
-}
+#define net_tcp_open() generic_proc_open("PROC_NET_TCP", "net/tcp")
+#define net_tcp6_open()generic_proc_open("PROC_NET_TCP6", 
"net/tcp6")
+#define net_udp_open() generic_proc_open("PROC_NET_UDP", "net/udp")
+#define net_udp6_open()generic_proc_open("PROC_NET_UDP6", 
"net/udp6")
+#define net_raw_open() generic_proc_open("PROC_NET_RAW", "net/raw")
+#define net_raw6_open()generic_proc_open("PROC_NET_RAW6", 
"net/raw6")
+#define net_unix_open()generic_proc_open("PROC_NET_UNIX", 
"net/unix")
+#define net_packet_open()  generic_proc_open("PROC_NET_PACKET", \
+   "net/packet")
+#define net_netlink_open() generic_proc_open("PROC_NET_NETLINK", \
+   "net/netlink")
+#define slabinfo_open()generic_proc_open("PROC_SLABINFO", 
"slabinfo")
+#define net_sockstat_open()generic_proc_open("PROC_NET_SOCKSTAT", \
+   "net/sockstat")
+#define net_sockstat6_open()   generic_proc_open("PROC_NET_SOCKSTAT6", \
+   "net/sockstat6")
+#define net_snmp_open()generic_proc_open("PROC_NET_SNMP", 
"net/snmp")
+#define ephemeral_ports_open() generic_proc_open("PROC_IP_LOCAL_PORT_RANGE", \
+   "sys/net/ipv4/ip_local_port_range")
 
 struct user_ent {
struct user_ent *next;
-- 
2.10.0

[iproute PATCH 01/18] ss: Mark fall through in arg parsing switch()

2016-11-11 Thread Phil Sutter

As there is a certain chance of overlooking this, better add a comment
to draw readers' attention.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/misc/ss.c b/misc/ss.c
index dd77b8153b6da..c20bfbdb01c62 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -4025,6 +4025,7 @@ int main(int argc, char *argv[])
exit(0);
case 'z':
show_sock_ctx++;
+   /* fall through */
case 'Z':
if (is_selinux_enabled() <= 0) {
fprintf(stderr, "ss: SELinux is not 
enabled.\n");
-- 
2.10.0

[iproute PATCH 00/18] ss: Minor code review

2016-11-11 Thread Phil Sutter

This is a series of misc changes to ss code which happened as fall-out
when working on a unified output formatter (still unfinished).

Phil Sutter (18):
  ss: Mark fall through in arg parsing switch()
  ss: Drop empty lines in UDP output
  ss: Add missing tab when printing UNIX details
  ss: Use sockstat->type in all socket types
  ss: introduce proc_ctx_print()
  ss: Drop list traversal from unix_stats_print()
  ss: Eliminate unix_use_proc()
  ss: Turn generic_proc_open() wrappers into macros
  ss: Make tmr_name local to tcp_timer_print()
  ss: Make user_ent_hash_build_init local to user_ent_hash_build()
  ss: Make some variables function-local
  ss: Make slabstat_ids local to get_slabstat()
  ss: Get rid of useless goto in handle_follow_request()
  ss: Get rid of single-fielded struct snmpstat
  ss: Make unix_state_map local to unix_show()
  ss: Make sstate_name local to sock_state_print()
  ss: Make sstate_namel local to scan_state()
  ss: unix_show: No need to initialize members of calloc'ed structs

 misc/ss.c | 524 ++
 1 file changed, 220 insertions(+), 304 deletions(-)

-- 
2.10.0

[iproute PATCH 09/18] ss: Make tmr_name local to tcp_timer_print()

2016-11-11 Thread Phil Sutter

It's used only there, so no need to have it globally defined.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index e9fecd39a8493..477910a842726 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -830,15 +830,6 @@ static void sock_addr_print(const char *addr, char *delim, 
const char *port,
sock_addr_print_width(addr_width, addr, delim, serv_width, port, 
ifname);
 }
 
-static const char *tmr_name[] = {
-   "off",
-   "on",
-   "keepalive",
-   "timewait",
-   "persist",
-   "unknown"
-};
-
 static const char *print_ms_timer(int timeout)
 {
static char buf[64];
@@ -1879,6 +1870,15 @@ static void tcp_stats_print(struct tcpstat *s)
 
 static void tcp_timer_print(struct tcpstat *s)
 {
+   static const char * const tmr_name[] = {
+   "off",
+   "on",
+   "keepalive",
+   "timewait",
+   "persist",
+   "unknown"
+   };
+
if (s->timer) {
if (s->timer > 4)
s->timer = 5;
-- 
2.10.0

[iproute PATCH 05/18] ss: introduce proc_ctx_print()

2016-11-11 Thread Phil Sutter

This consolidates identical code in three places. While the function
name is not quite perfect as there is different proc_ctx printing code
in netlink_show_one() as well, I sadly didn't find a more suitable one.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 49 ++---
 1 file changed, 14 insertions(+), 35 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index e6053467aaf82..b3475cc96ae7b 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1719,14 +1719,9 @@ void *parse_markmask(const char *markmask)
return res;
 }
 
-static void inet_stats_print(struct sockstat *s)
+static void proc_ctx_print(struct sockstat *s)
 {
-   char *buf = NULL;
-
-   sock_state_print(s);
-
-   inet_addr_print(>local, s->lport, s->iface);
-   inet_addr_print(>remote, s->rport, 0);
+   char *buf;
 
if (show_proc_ctx || show_sock_ctx) {
if (find_entry(s->ino, ,
@@ -1743,6 +1738,16 @@ static void inet_stats_print(struct sockstat *s)
}
 }
 
+static void inet_stats_print(struct sockstat *s)
+{
+   sock_state_print(s);
+
+   inet_addr_print(>local, s->lport, s->iface);
+   inet_addr_print(>remote, s->rport, 0);
+
+   proc_ctx_print(s);
+}
+
 static int proc_parse_inet_addr(char *loc, char *rem, int family, struct
sockstat * s)
 {
@@ -2814,7 +2819,6 @@ static void unix_stats_print(struct sockstat *list, 
struct filter *f)
 {
struct sockstat *s;
char *peer;
-   char *ctx_buf = NULL;
bool use_proc = unix_use_proc();
char port_name[30] = {};
 
@@ -2863,19 +2867,7 @@ static void unix_stats_print(struct sockstat *list, 
struct filter *f)
sock_addr_print(peer, " ", int_to_str(s->rport, port_name),
NULL);
 
-   if (show_proc_ctx || show_sock_ctx) {
-   if (find_entry(s->ino, _buf,
-   (show_proc_ctx & show_sock_ctx) ?
-   PROC_SOCK_CTX : PROC_CTX) > 0) {
-   printf(" users:(%s)", ctx_buf);
-   free(ctx_buf);
-   }
-   } else if (show_users) {
-   if (find_entry(s->ino, _buf, USERS) > 0) {
-   printf(" users:(%s)", ctx_buf);
-   free(ctx_buf);
-   }
-   }
+   proc_ctx_print(s);
printf("\n");
}
 }
@@ -3071,7 +3063,6 @@ static int unix_show(struct filter *f)
 
 static int packet_stats_print(struct sockstat *s, const struct filter *f)
 {
-   char *buf = NULL;
const char *addr, *port;
char ll_name[16];
 
@@ -3098,19 +3089,7 @@ static int packet_stats_print(struct sockstat *s, const 
struct filter *f)
sock_addr_print(addr, ":", port, NULL);
sock_addr_print("", "*", "", NULL);
 
-   if (show_proc_ctx || show_sock_ctx) {
-   if (find_entry(s->ino, ,
-   (show_proc_ctx & show_sock_ctx) ?
-   PROC_SOCK_CTX : PROC_CTX) > 0) {
-   printf(" users:(%s)", buf);
-   free(buf);
-   }
-   } else if (show_users) {
-   if (find_entry(s->ino, , USERS) > 0) {
-   printf(" users:(%s)", buf);
-   free(buf);
-   }
-   }
+   proc_ctx_print(s);
 
if (show_details)
sock_details_print(s);
-- 
2.10.0

[iproute PATCH 04/18] ss: Use sockstat->type in all socket types

2016-11-11 Thread Phil Sutter

Unix sockets used that field already to hold info about the socket type.
By replicating this approach in all other socket types, we can get rid
of protocol parameter in inet_stats_print() and have sock_state_print()
figure things out by itself.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 124 +++---
 1 file changed, 70 insertions(+), 54 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index d0b4f879c4d9f..e6053467aaf82 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -793,8 +793,57 @@ struct tcpstat {
struct tcp_bbr_info *bbr_info;
 };
 
-static void sock_state_print(struct sockstat *s, const char *sock_name)
+static const char *unix_netid_name(int type)
+{
+   switch (type) {
+   case SOCK_STREAM:
+   return "u_str";
+   case SOCK_SEQPACKET:
+   return "u_seq";
+   case SOCK_DGRAM:
+   default:
+   return "u_dgr";
+   }
+}
+
+static char *proto_name(int protocol)
 {
+   switch (protocol) {
+   case 0:
+   return "raw";
+   case IPPROTO_UDP:
+   return "udp";
+   case IPPROTO_TCP:
+   return "tcp";
+   case IPPROTO_DCCP:
+   return "dccp";
+   }
+
+   return "???";
+}
+
+static void sock_state_print(struct sockstat *s)
+{
+   const char *sock_name;
+
+   switch (s->local.family) {
+   case AF_UNIX:
+   sock_name = unix_netid_name(s->type);
+   break;
+   case AF_INET:
+   case AF_INET6:
+   sock_name = proto_name(s->type);
+   break;
+   case AF_PACKET:
+   sock_name = s->type == SOCK_RAW ? "p_raw" : "p_dgr";
+   break;
+   case AF_NETLINK:
+   sock_name = "nl";
+   break;
+   default:
+   sock_name = "unknown";
+   }
+
if (netid_width)
printf("%-*s ", netid_width, sock_name);
if (state_width)
@@ -1670,27 +1719,11 @@ void *parse_markmask(const char *markmask)
return res;
 }
 
-static char *proto_name(int protocol)
-{
-   switch (protocol) {
-   case 0:
-   return "raw";
-   case IPPROTO_UDP:
-   return "udp";
-   case IPPROTO_TCP:
-   return "tcp";
-   case IPPROTO_DCCP:
-   return "dccp";
-   }
-
-   return "???";
-}
-
-static void inet_stats_print(struct sockstat *s, int protocol)
+static void inet_stats_print(struct sockstat *s)
 {
char *buf = NULL;
 
-   sock_state_print(s, proto_name(protocol));
+   sock_state_print(s);
 
inet_addr_print(>local, s->lport, s->iface);
inet_addr_print(>remote, s->rport, 0);
@@ -1948,8 +1981,9 @@ static int tcp_show_line(char *line, const struct filter 
*f, int family)
s.rto   = (double)rto;
s.ssthresh  = s.ssthresh == -1 ? 0 : s.ssthresh;
s.rto   = s.rto != 3 * hz  ? s.rto / hz : 0;
+   s.ss.type   = IPPROTO_TCP;
 
-   inet_stats_print(, IPPROTO_TCP);
+   inet_stats_print();
 
if (show_options)
tcp_timer_print();
@@ -2201,8 +2235,7 @@ static void parse_diag_msg(struct nlmsghdr *nlh, struct 
sockstat *s)
 }
 
 static int inet_show_sock(struct nlmsghdr *nlh,
- struct sockstat *s,
- int protocol)
+ struct sockstat *s)
 {
struct rtattr *tb[INET_DIAG_MAX+1];
struct inet_diag_msg *r = NLMSG_DATA(nlh);
@@ -2211,9 +2244,9 @@ static int inet_show_sock(struct nlmsghdr *nlh,
 nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
 
if (tb[INET_DIAG_PROTOCOL])
-   protocol = *(__u8 *)RTA_DATA(tb[INET_DIAG_PROTOCOL]);
+   s->type = *(__u8 *)RTA_DATA(tb[INET_DIAG_PROTOCOL]);
 
-   inet_stats_print(s, protocol);
+   inet_stats_print(s);
 
if (show_options) {
struct tcpstat t = {};
@@ -2240,7 +2273,7 @@ static int inet_show_sock(struct nlmsghdr *nlh,
}
}
 
-   if (show_mem || (show_tcpinfo && protocol != IPPROTO_UDP)) {
+   if (show_mem || (show_tcpinfo && s->type != IPPROTO_UDP)) {
printf("\n\t");
tcp_show_info(nlh, r, tb);
}
@@ -2414,6 +2447,7 @@ static int show_one_inet_sock(const struct sockaddr_nl 
*addr,
return 0;
 
parse_diag_msg(h, );
+   s.type = diag_arg->protocol;
 
if (diag_arg->f->f && run_ssfilter(diag_arg->f->f, ) == 0)
return 0;
@@ -2428,7 +2462,7 @@ static int show_one_inet_sock(const struct sockaddr_nl 
*addr,
}
}
 
-   err = inet_show_sock(h, , diag_arg->protocol);
+   err = inet_show_sock(h, );
if (err < 0)
return err;
 
@@ -2534,11 +2568,12 @@ static int tcp_show_netlink_file(struct filter *f)
}
 
parse_diag_msg(h, );
+   s.type =

[iproute PATCH 12/18] ss: Make slabstat_ids local to get_slabstat()

2016-11-11 Thread Phil Sutter

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index d546a00eb2c24..c3a5148e05013 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -580,21 +580,19 @@ struct slabstat {
 
 static struct slabstat slabstat;
 
-static const char *slabstat_ids[] = {
-
-   "sock",
-   "tcp_bind_bucket",
-   "tcp_tw_bucket",
-   "tcp_open_request",
-   "skbuff_head_cache",
-};
-
 static int get_slabstat(struct slabstat *s)
 {
char buf[256];
FILE *fp;
int cnt;
static int slabstat_valid;
+   static const char * const slabstat_ids[] = {
+   "sock",
+   "tcp_bind_bucket",
+   "tcp_tw_bucket",
+   "tcp_open_request",
+   "skbuff_head_cache",
+   };
 
if (slabstat_valid)
return 0;
-- 
2.10.0

Re: BUG() can be hit in tcp_collapse()

2016-11-11 Thread Vladis Dronov

Hello, Eric,

> Another sk_filter() is used in tcp v6.
> So the correct patch would be :

Thank you much for your research. I'm happy my report
has resulted as the proposed patch.

Best regards,
Vladis Dronov | Red Hat, Inc. | Product Security Engineer

Re: [patch net 2/2] mlxsw: spectrum_router: Correctly dump neighbour activity

2016-11-11 Thread Ido Schimmel

On Fri, Nov 11, 2016 at 11:20:42AM +0100, Jiri Pirko wrote:
> From: Arkadi Sharshevsky 
> 
> During neighbour activity check the device's table is dumped by multiple
> query requests. The query session should end when the response carries
> less records than requested or when a given record is not full. Current
> code only stops the dumping process if the number of returned records is
> zero, which can result in infinite loop in case of activity.
> 
> Fix this by stopping the dumping process according to the above logic.
> 
> Fixes: c723c735fa6b ("mlxsw: spectrum_router: Periodically update the 
> kernel's neigh table")
> Signed-off-by: Arkadi Sharshevsky 
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 
> ---
>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 22 
> +-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 040737e..d437457 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -800,6 +800,26 @@ static void mlxsw_sp_router_neigh_rec_process(struct 
> mlxsw_sp *mlxsw_sp,
>   }
>  }
>  
> +static bool mlxsw_sp_router_rauhtd_is_full(char *rauhtd_pl)
> +{
> + u8 num_rec, last_rec_index, num_entries;
> +
> + num_rec = mlxsw_reg_rauhtd_num_rec_get(rauhtd_pl);
> + last_rec_index = num_rec - 1;
> +
> + if (num_rec < MLXSW_REG_RAUHTD_REC_MAX_NUM)
> + return false;
> + if (mlxsw_reg_rauhtd_rec_type_get(rauhtd_pl, last_rec_index) ==
> + MLXSW_REG_RAUHTD_TYPE_IPV6)
> + return true;
> +
> + num_entries = mlxsw_reg_rauhtd_ipv4_rec_num_entries_get(rauhtd_pl,
> + last_rec_index);
> + if (++num_entries ==  MLXSW_REG_RAUHTD_IPV4_ENT_PER_REC)

Jiri, I just noticed we have an extra space after the '=='. Can you
please remove it in v2? Sorry for not spotting this earlier.

> + return true;
> + return false;
> +}
> +
>  static int mlxsw_sp_router_neighs_update_rauhtd(struct mlxsw_sp *mlxsw_sp)
>  {
>   char *rauhtd_pl;
> @@ -826,7 +846,7 @@ static int mlxsw_sp_router_neighs_update_rauhtd(struct 
> mlxsw_sp *mlxsw_sp)
>   for (i = 0; i < num_rec; i++)
>   mlxsw_sp_router_neigh_rec_process(mlxsw_sp, rauhtd_pl,
> i);
> - } while (num_rec);
> + } while (mlxsw_sp_router_rauhtd_is_full(rauhtd_pl));
>   rtnl_unlock();
>  
>   kfree(rauhtd_pl);
> -- 
> 2.7.4
>

[PATCH] netfilter: x_tables: simplify IS_ERR_OR_NULL to NULL test

2016-11-11 Thread Julia Lawall

Since commit 7926dbfa4bc1 ("netfilter: don't use
mutex_lock_interruptible()"), the function xt_find_table_lock can only
return NULL on an error.  Simplify the call sites and update the
comment before the function.


The semantic patch that change the code is as follows:
(http://coccinelle.lip6.fr/)

// 
@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
  try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- ! IS_ERR_OR_NULL(t)
+ t

@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
  try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- IS_ERR_OR_NULL(t)
+ !t

@@
expression t,e,e1;
@@

t = \(xt_find_table_lock(...)\|
  try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
?- t ? PTR_ERR(t) : e1
+ e1
... when any

// 

Signed-off-by: Julia Lawall 

---
 net/ipv4/netfilter/arp_tables.c |   20 ++--
 net/ipv4/netfilter/ip_tables.c  |   20 ++--
 net/ipv6/netfilter/ip6_tables.c |   20 ++--
 net/netfilter/x_tables.c|2 +-
 4 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index e76ab23..39004da 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -805,7 +805,7 @@ static int get_info(struct net *net, void __user *user,
 #endif
t = try_then_request_module(xt_find_table_lock(net, NFPROTO_ARP, name),
"arptable_%s", name);
-   if (!IS_ERR_OR_NULL(t)) {
+   if (t) {
struct arpt_getinfo info;
const struct xt_table_info *private = t->private;
 #ifdef CONFIG_COMPAT
@@ -834,7 +834,7 @@ static int get_info(struct net *net, void __user *user,
xt_table_unlock(t);
module_put(t->me);
} else
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   ret = -ENOENT;
 #ifdef CONFIG_COMPAT
if (compat)
xt_compat_unlock(NFPROTO_ARP);
@@ -859,7 +859,7 @@ static int get_entries(struct net *net, struct 
arpt_get_entries __user *uptr,
get.name[sizeof(get.name) - 1] = '\0';
 
t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
-   if (!IS_ERR_OR_NULL(t)) {
+   if (t) {
const struct xt_table_info *private = t->private;
 
if (get.size == private->size)
@@ -871,7 +871,7 @@ static int get_entries(struct net *net, struct 
arpt_get_entries __user *uptr,
module_put(t->me);
xt_table_unlock(t);
} else
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   ret = -ENOENT;
 
return ret;
 }
@@ -898,8 +898,8 @@ static int __do_replace(struct net *net, const char *name,
 
t = try_then_request_module(xt_find_table_lock(net, NFPROTO_ARP, name),
"arptable_%s", name);
-   if (IS_ERR_OR_NULL(t)) {
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   if (!t) {
+   ret = -ENOENT;
goto free_newinfo_counters_untrans;
}
 
@@ -1014,8 +1014,8 @@ static int do_add_counters(struct net *net, const void 
__user *user,
return PTR_ERR(paddc);
 
t = xt_find_table_lock(net, NFPROTO_ARP, tmp.name);
-   if (IS_ERR_OR_NULL(t)) {
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   if (!t) {
+   ret = -ENOENT;
goto free;
}
 
@@ -1404,7 +1404,7 @@ static int compat_get_entries(struct net *net,
 
xt_compat_lock(NFPROTO_ARP);
t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
-   if (!IS_ERR_OR_NULL(t)) {
+   if (t) {
const struct xt_table_info *private = t->private;
struct xt_table_info info;
 
@@ -1419,7 +1419,7 @@ static int compat_get_entries(struct net *net,
module_put(t->me);
xt_table_unlock(t);
} else
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   ret = -ENOENT;
 
xt_compat_unlock(NFPROTO_ARP);
return ret;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index de4fa03..46815c8 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -973,7 +973,7 @@ static int get_info(struct net *net, void __user *user,
 #endif
t = try_then_request_module(xt_find_table_lock(net, AF_INET, name),
"iptable_%s", name);
-   if (!IS_ERR_OR_NULL(t)) {
+   if (t) {
struct ipt_getinfo info;
const struct xt_table_info *private = t->private;
 #ifdef CONFIG_COMPAT
@@ -1003,7 +1003,7 @@ static int get_info(struct net *net, void __user *user,
xt_table_unlock(t);
module_put(t->me);
} else
-   ret = t ? PTR_ERR(t) : -ENOENT;
+   ret = -ENOENT;
 #ifdef CONFIG_COMPAT
if (compat)

Re: [patch net 1/2] mlxsw: spectrum: Fix refcount bug on span entries

2016-11-11 Thread Ido Schimmel

On Fri, Nov 11, 2016 at 11:20:41AM +0100, Jiri Pirko wrote:
> From: Yotam Gigi 
> 
> When binding port to a newly created span entry, its refcount is set 0
> even though it has a bound port. That leeds to unexpected behaviour when

s/leeds/leads/

> the user tries to delete that port from the span entry.
> 
> Change the binding process to increase the refcount of the bound entry
> even if the entry is newly created, and add warning on the process of
> removing bound port from entry when its refcount is 0.
> 
> Fixes: 763b4b70afcd3 ("mlxsw: spectrum: Add support in matchall mirror TC 
> offloading")

You only need the first 12 characters.

> Signed-off-by: Yotam Gigi 
> Signed-off-by: Jiri Pirko 
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> index 1ec0a4c..d75c1ff 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> @@ -269,17 +269,18 @@ static struct mlxsw_sp_span_entry
>   struct mlxsw_sp_span_entry *span_entry;
>  
>   span_entry = mlxsw_sp_span_entry_find(port);
> - if (span_entry) {
> - span_entry->ref_count++;
> - return span_entry;
> - }
> + if (!span_entry)
> + span_entry = mlxsw_sp_span_entry_create(port);
>  
> - return mlxsw_sp_span_entry_create(port);
> + span_entry->ref_count++;

mlxsw_sp_span_entry_create() can return NULL. You can look at
mlxsw_sp_fib_entry_get() for reference.

> + return span_entry;
>  }
>  
>  static int mlxsw_sp_span_entry_put(struct mlxsw_sp *mlxsw_sp,
>  struct mlxsw_sp_span_entry *span_entry)
>  {
> + WARN_ON(!span_entry->ref_count);
> +
>   if (--span_entry->ref_count == 0)
>   mlxsw_sp_span_entry_destroy(mlxsw_sp, span_entry);
>   return 0;
> -- 
> 2.7.4
>

Re: [patch net-next 5/8] Introduce sample tc action

2016-11-11 Thread Simon Horman

On Fri, Nov 11, 2016 at 08:28:50AM +, Yotam Gigi wrote:

...

> John, as a result of your question I realized that our hardware does do
> randomized sampling that I was not aware of. I will use the extensibility of
> the API and implement a random keyword, that will be offloaded in our
> hardware. Those changes will be sent on v2.
>
> Eventually, your question was very relevant :) Thanks!

Perhaps I am missing the point but why not just make random the default and
implement the inverse as an extension if it turns out to be needed in
future?

[PATCH v2] net: ethernet: ti: davinci_cpdma: free memory while channel destroy

2016-11-11 Thread Ivan Khoronzhuk

While create/destroy channel operation memory is not freed. It was
supposed that memory is freed while driver remove. But a channel
can be created and destroyed many times while changing number of
channels with ethtool.

Reviewed-by: Grygorii Strashko 
Signed-off-by: Ivan Khoronzhuk 

---

Based on net-next/master

 drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index 05afc05..07fc92d 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -586,7 +586,7 @@ int cpdma_chan_destroy(struct cpdma_chan *chan)
cpdma_chan_stop(chan);
ctlr->channels[chan->chan_num] = NULL;
ctlr->chan_num--;
-
+   devm_kfree(ctlr->dev, chan);
cpdma_chan_split_pool(ctlr);
 
spin_unlock_irqrestore(>lock, flags);
-- 
1.9.1

1 2 >

1 - 100 of 123 matches

Mail list logo