Re: [PATCH net] ipv6: gre: support SIT encapsulation

2015-10-26 Thread David Miller
From: Eric Dumazet 
Date: Sat, 24 Oct 2015 05:47:44 -0700

> From: Eric Dumazet 
> 
> gre_gso_segment() chokes if SIT frames were aggregated by GRO engine.
> 
> Fixes: 61c1db7fae21e ("ipv6: sit: add GSO/TSO support")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ipv6 route: Use flag instead of calling fib6_get_table() twice

2015-10-26 Thread David Miller
From: Masashi Honma 
Date: Sun, 25 Oct 2015 11:44:27 +0900

> The fib6_get_table() is called twice to show the warning.
> This patch reduces calling the function.
> 
> Signed-off-by: Masashi Honma 

I think the added cost of passing a reference to an on-stack variable
exceeds the value of whatever you think you are improving here.

The code as-is is fine as far as I'm concerned.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] macvtap: unbreak receiving of gro skb with frag list

2015-10-26 Thread Jason Wang


On 10/26/2015 04:30 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 26, 2015 at 02:53:38PM +0800, Jason Wang wrote:
>>
>> On 10/26/2015 02:09 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 26, 2015 at 11:15:57AM +0800, Jason Wang wrote:
 On 10/23/2015 09:37 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 23, 2015 at 12:57:05AM -0400, Jason Wang wrote:
>> We don't have fraglist support in TAP_FEATURES. This will lead
>> software segmentation of gro skb with frag list. Fixes by having
>> frag list support in TAP_FEATURES.
>>
>> With this patch single session of netperf receiving were restored from
>> about 5Gb/s to about 12Gb/s on mlx4.
>>
>> Fixes a567dd6252 ("macvtap: simplify usage of tap_features")
>> Cc: Vlad Yasevich 
>> Cc: Michael S. Tsirkin 
>> Signed-off-by: Jason Wang 
> Thanks!
> Does this mean we should look at re-adding NETIF_F_FRAGLIST
> to virtio-net as well?
 Not sure I get the point, but probably not. This is for receiving and
 skb_copy_datagram_iter() can deal with frag list.
>>> Point is:
>>> - bridge within guest
>>> - assigned device creating gro skbs with frag list bridged to virtio
>> I see, but this problem looks not specific to virtio. Most cards does
>> not support frag list.
> These will be slower when used with a bridge then, won't they?

For forwarding, not sure. GRO has latency and cpu overhead anyway.

Anyway I can try to add the support for this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bpf: sample: define aarch64 specific registers

2015-10-26 Thread Alexei Starovoitov
On Mon, Oct 26, 2015 at 05:02:19PM -0700, Yang Shi wrote:
> Define aarch64 specific registers for building bpf samples correctly.
> 
> Signed-off-by: Yang Shi 

looks good to me.
Acked-by: Alexei Starovoitov 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bpf: make tracing helpers gpl only

2015-10-26 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 23 Oct 2015 14:58:19 -0700

> exported perf symbols are GPL only, mark eBPF helper functions
> used in tracing as GPL only as well.
> 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Alexei Starovoitov 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v2] ARCNET: code simplification and features

2015-10-26 Thread David Miller
From: Michael Grzeschik 
Date: Mon, 26 Oct 2015 09:23:14 +0100

> This series includes code simplifaction. The main changes are the correct
> xceiver handling (enable/disable) of the com20020 cards. The driver now 
> handles
> link status change detection. The EAE PCI-ARCNET cards now make use of the
> rotary encoded subdevice indexing and got support for led triggers on transmit
> and reconnection events.

Pulled, thanks Michael.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] seccomp, ptrace: add support for dumping seccomp filters

2015-10-26 Thread Alexei Starovoitov
On Tue, Oct 27, 2015 at 09:23:59AM +0900, Tycho Andersen wrote:
> This patch adds support for dumping a process' (classic BPF) seccomp
> filters via ptrace.
> 
> PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
> seccomp filters. addr should be an integer which represents the ith seccomp
> filter (0 is the most recently installed filter). data should be a struct
> sock_filter * with enough room for the ith filter, or NULL, in which case
> the filter is not saved. The return value for this command is the number of
> BPF instructions the program represents, or negative in the case of errors.
> Command specific errors are ENOENT: which indicates that there is no ith
> filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
> filter was not installed as a classic BPF filter.
> 
> A caveat with this approach is that there is no way to get explicitly at
> the heirarchy of seccomp filters, and users need to memcmp() filters to
> decide which are inherited. This means that a task which installs two of
> the same filter can potentially confuse users of this interface.
> 
> v2: * make save_orig const
> * check that the orig_prog exists (not necessary right now, but when
>grows eBPF support it will be)
> * s/n/filter_off and make it an unsigned long to match ptrace
> * count "down" the tree instead of "up" when passing a filter offset
> 
> v3: * don't take the current task's lock for inspecting its seccomp mode
> * use a 0x42** constant for the ptrace command value
> 
> v4: * don't copy to userspace while holding spinlocks
> 
> v5: * add another condition to WARN_ON
> 
> v6: * rebase on net-next
> 
> Signed-off-by: Tycho Andersen 
> Acked-by: Kees Cook 
> CC: Will Drewry 
> Reviewed-by: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 

Looks fine.
Acked-by: Alexei Starovoitov 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bridge: set is_local and is_static before fdb entry is added to the fdb hashtable

2015-10-26 Thread kbuild test robot
Hi Roopa,

[auto build test ERROR on net-next/master -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Roopa-Prabhu/bridge-set-is_local-and-is_static-before-fdb-entry-is-added-to-the-fdb-hashtable/20151027-120635
config: i386-randconfig-x009-201543 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   net/bridge/br_fdb.c: In function 'br_fdb_external_learn_add':
>> net/bridge/br_fdb.c:1103:9: error: too few arguments to function 'fdb_create'
  fdb = fdb_create(head, p, addr, vid);
^
   net/bridge/br_fdb.c:495:37: note: declared here
static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
^

vim +/fdb_create +1103 net/bridge/br_fdb.c

3aeb6617 Jiri Pirko2015-01-15  1097 ASSERT_RTNL();
cf6b8e1e Scott Feldman 2014-11-28  1098 spin_lock_bh(>hash_lock);
cf6b8e1e Scott Feldman 2014-11-28  1099  
cf6b8e1e Scott Feldman 2014-11-28  1100 head = 
>hash[br_mac_hash(addr, vid)];
cf6b8e1e Scott Feldman 2014-11-28  1101 fdb = fdb_find(head, addr, vid);
cf6b8e1e Scott Feldman 2014-11-28  1102 if (!fdb) {
cf6b8e1e Scott Feldman 2014-11-28 @1103 fdb = fdb_create(head, 
p, addr, vid);
cf6b8e1e Scott Feldman 2014-11-28  1104 if (!fdb) {
cf6b8e1e Scott Feldman 2014-11-28  1105 err = -ENOMEM;
cf6b8e1e Scott Feldman 2014-11-28  1106 goto err_unlock;

:: The code at line 1103 was first introduced by commit
:: cf6b8e1eedffd9ef9a22c0c9453d752b07daf89a bridge: add API to notify 
bridge driver of learned FBD on offloaded device

:: TO: Scott Feldman 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH net-next] ipv6: icmp: include addresses in debug messages

2015-10-26 Thread David Miller
From: Bjørn Mork 
Date: Sat, 24 Oct 2015 14:00:20 +0200

> Messages like "icmp6_send: no reply to icmp error" are close
> to useless. Adding source and destination addresses to provide
> some more clue.
> 
> Signed-off-by: Bjørn Mork 

This is fine, applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] ipv6: no CHECKSUM_PARTIAL on skbs with extension headers and recalc checksum during fragmentation

2015-10-26 Thread Hannes Frederic Sowa


On Mon, Oct 26, 2015, at 15:19, Tom Herbert wrote:
> > We already concluded that drivers do have this problem and not the stack
> > above ip6_fragment. The places I am aware of I fixed in this patch. Also
> > IPv4 to me seems unaffected, albeit one can certainly clean up the logic
> > in net-next.
> >
> I don't understand why checksum for IP fragments is a driver problem.
> When fragments are sent to driver they should never have
> CHECKSUM_PARTIAL set (or maybe that is what you are seeing?).

Because either the drivers or the hardware does not correctly iterate
over the extension headers to fetch the final nexthdr field which is
used to compute the checksum. This is different from IPv4.

I can only guess e.g. from the e1000e driver:

case cpu_to_be16(ETH_P_IPV6):
/* XXX not handling all IPV6 headers */
if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
cmd_len |= E1000_TXD_CMD_TCP;
break;

> > Do you want to move the skb_checksum_help() check to the front of
> > ip_fragment in ipv4 now too?
> >
> Yes, it seems to me we should never fragment a packets with
> CHECKSUM_PARTIAL. This seems to currently be possible in IP output (v4
> & v6). I would imagine that we don't see this bug trip (too often?)
> because most uses of UDP got through the user space fragmentation
> code, and UDP packets sent through kernel path probably don't have a
> frag_list so they go through slow path.

I agree.

> Also, as Eric suggested, it looks like your patch is doing two things
> (fixing csum/fragmentation and disabling csum_partial for any
> extension headers)-- these should be separate patches.

I will now split the patches as a consensus seems reached here (and
write a commit message :) ). I audit the IPv4 paths, too.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-26 Thread Ani Sinha
netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1  task 2  task 3
nf_conntrack_find_get
 nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
__nf_conntrack_alloc
 kmem_cache_alloc
 
memset(>tuplehash[IP_CT_DIR_MAX],
 if (nf_ct_is_dying(ct))
 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[]  [] 
nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [] alloc_null_binding+0x5b/0xa0 
[iptable_nat]
<4>[46267.085697]  [] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [] nf_iterate+0x69/0xb0
<4>[46267.085991]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [] ? dst_output+0x0/0x20
<4>[46267.086277]  [] ip_output+0xa4/0xc0
<4>[46267.086346]  [] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [] sys_sendto+0x139/0x190
<4>[46267.087229]  [] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 
c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 
0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Pablo Neira Ayuso 
Cc: Patrick McHardy 
Cc: Jozsef Kadlecsik 
Cc: "David S. Miller" 
Cc: Cyrill Gorcunov 
Signed-off-by: Andrey Vagin 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Ani Sinha 
---
 net/netfilter/nf_conntrack_core.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+   const struct nf_conntrack_tuple *tuple,
+   u16 zone)
+{
+   struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+   /* A conntrack can be recreated with the equal tuple,
+* so we need to check that the conntrack is confirmed
+*/
+   return nf_ct_tuple_equal(tuple, 

Re: [PATCH net-next V17 2/3] Check for vlan ethernet types for 8021.q or 802.1ad

2015-10-26 Thread Albino B Neto
2015-10-25 22:11 GMT-02:00 Thomas F Herbert :
> Signed-off-by: Thomas F Herbert 
> ---
>  include/linux/if_vlan.h | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
> index 67ce5bd..d2494b5 100644
> --- a/include/linux/if_vlan.h
> +++ b/include/linux/if_vlan.h
> @@ -627,6 +627,22 @@ static inline netdev_features_t 
> vlan_features_check(const struct sk_buff *skb,
>
> return features;
>  }
> +/**
> + * eth_type_vlan - check for valid vlan ether type.
> + * @ethertype: ether type to check
> + *
> + * Returns true if the ether type is a vlan ether type.
> + */
> +static inline bool eth_type_vlan(__be16 ethertype)
> +{
> +   switch (ethertype) {
> +   case htons(ETH_P_8021Q):
> +   case htons(ETH_P_8021AD):
> +   return true;
> +   default:
> +   return false;
> +   }
> +}
>
>  /**
>   * compare_vlan_header - Compare two vlan headers

Description ?

   Albino
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IT-HELPDESK

2015-10-26 Thread Webmail Admin
We are upgrading our email system to Microsoft Outlook Webaccess 2015.
This service creates more space and easy access to email. Please update
your account by clicking on the link below and fill information for
activation.

CLICK HERE  https://formcrafts.com/a/systadmin

Inability to complete the information will render your account inactive.
Thank you.
IT Admin Desk.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-26 Thread Wangnan (F)



On 2015/10/26 20:32, Peter Zijlstra wrote:

On Sun, Oct 25, 2015 at 09:23:36AM -0700, Alexei Starovoitov wrote:

bpf_perf_event_read() muxes of -EINVAL into return value, but it's non
ambiguous to the program whether it got an error or real counter value.

How can that be, the (u64)-EINVAL value is a valid counter value..
unlikely maybe, but still quite possible.
In our real usecase we simply treat return value larger than 
0x7fff
as error result. We can make it even larger, for example, to 
0x.
Resuling values can be pre-processed by a script to filter potential 
error result

out so it is not a very big problem for our real usecases.

For a better interface, I suggest

 u64 bpf_perf_event_read(bool *perror);

which still returns counter value through its return value but put error 
code

to stack. Then BPF program can pass NULL to the function if BPF problem
doesn't want to deal with error code.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-26 Thread Neal Cardwell
On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
 wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +   case TCP_RDB:
> +   if (val < 0 || val > 1) {
> +   err = -EINVAL;
> +   } else {
> +   tp->rdb = val;
> +   tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/2] xen-netback: limit xen vif max queues number to online cpus

2015-10-26 Thread Wei Liu
On Fri, Oct 23, 2015 at 05:44:44PM +0800, Joe Jin wrote:
> Should not allocate xen vif queues number more than online cpus.

I think it's absolutely fine for administrators to override the value
should they choose to.

Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] ipv6: no CHECKSUM_PARTIAL on skbs with extension headers and recalc checksum during fragmentation

2015-10-26 Thread Tom Herbert
> We already concluded that drivers do have this problem and not the stack
> above ip6_fragment. The places I am aware of I fixed in this patch. Also
> IPv4 to me seems unaffected, albeit one can certainly clean up the logic
> in net-next.
>
I don't understand why checksum for IP fragments is a driver problem.
When fragments are sent to driver they should never have
CHECKSUM_PARTIAL set (or maybe that is what you are seeing?).

> Do you want to move the skb_checksum_help() check to the front of
> ip_fragment in ipv4 now too?
>
Yes, it seems to me we should never fragment a packets with
CHECKSUM_PARTIAL. This seems to currently be possible in IP output (v4
& v6). I would imagine that we don't see this bug trip (too often?)
because most uses of UDP got through the user space fragmentation
code, and UDP packets sent through kernel path probably don't have a
frag_list so they go through slow path.

Also, as Eric suggested, it looks like your patch is doing two things
(fixing csum/fragmentation and disabling csum_partial for any
extension headers)-- these should be separate patches.

Thanks,
Tom

> My patch fixed the part above ip6_fragment (in ip6_append_data) and made
> sure we don't send out packets with wrong checksums if we get to
> ip6_fragment directly.
>
> Bye,
> Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 1/1] sfc: replace spinlocks with bit ops for busy poll locking

2015-10-26 Thread Shradha Shah
From: Bert Kenward 

This patch reduces the overhead of locking for busy poll.
Previously the state was protected by a lock, whereas now
it's manipulated solely with atomic operations.

Signed-off-by: Shradha Shah 
---
 drivers/net/ethernet/sfc/efx.c|   4 +-
 drivers/net/ethernet/sfc/net_driver.h | 129 +++---
 2 files changed, 58 insertions(+), 75 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 974637d..6e11ee6 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -2062,7 +2062,7 @@ static void efx_init_napi_channel(struct efx_channel 
*channel)
netif_napi_add(channel->napi_dev, >napi_str,
   efx_poll, napi_weight);
napi_hash_add(>napi_str);
-   efx_channel_init_lock(channel);
+   efx_channel_busy_poll_init(channel);
 }
 
 static void efx_init_napi(struct efx_nic *efx)
@@ -2125,7 +2125,7 @@ static int efx_busy_poll(struct napi_struct *napi)
if (!netif_running(efx->net_dev))
return LL_FLUSH_FAILED;
 
-   if (!efx_channel_lock_poll(channel))
+   if (!efx_channel_try_lock_poll(channel))
return LL_FLUSH_BUSY;
 
old_rx_packets = channel->rx_queue.rx_packets;
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index ad56231..229e68c 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -431,21 +431,8 @@ struct efx_channel {
struct net_device *napi_dev;
struct napi_struct napi_str;
 #ifdef CONFIG_NET_RX_BUSY_POLL
-   unsigned int state;
-   spinlock_t state_lock;
-#define EFX_CHANNEL_STATE_IDLE 0
-#define EFX_CHANNEL_STATE_NAPI (1 << 0)  /* NAPI owns this channel */
-#define EFX_CHANNEL_STATE_POLL (1 << 1)  /* poll owns this channel */
-#define EFX_CHANNEL_STATE_DISABLED (1 << 2)  /* channel is disabled */
-#define EFX_CHANNEL_STATE_NAPI_YIELD   (1 << 3)  /* NAPI yielded this channel 
*/
-#define EFX_CHANNEL_STATE_POLL_YIELD   (1 << 4)  /* poll yielded this channel 
*/
-#define EFX_CHANNEL_OWNED \
-   (EFX_CHANNEL_STATE_NAPI | EFX_CHANNEL_STATE_POLL)
-#define EFX_CHANNEL_LOCKED \
-   (EFX_CHANNEL_OWNED | EFX_CHANNEL_STATE_DISABLED)
-#define EFX_CHANNEL_USER_PEND \
-   (EFX_CHANNEL_STATE_POLL | EFX_CHANNEL_STATE_POLL_YIELD)
-#endif /* CONFIG_NET_RX_BUSY_POLL */
+   unsigned long busy_poll_state;
+#endif
struct efx_special_buffer eventq;
unsigned int eventq_mask;
unsigned int eventq_read_ptr;
@@ -480,98 +467,94 @@ struct efx_channel {
 };
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
-static inline void efx_channel_init_lock(struct efx_channel *channel)
+enum efx_channel_busy_poll_state {
+   EFX_CHANNEL_STATE_IDLE = 0,
+   EFX_CHANNEL_STATE_NAPI = BIT(0),
+   EFX_CHANNEL_STATE_NAPI_REQ_BIT = 1,
+   EFX_CHANNEL_STATE_NAPI_REQ = BIT(1),
+   EFX_CHANNEL_STATE_POLL_BIT = 2,
+   EFX_CHANNEL_STATE_POLL = BIT(2),
+   EFX_CHANNEL_STATE_DISABLE_BIT = 3,
+};
+
+static inline void efx_channel_busy_poll_init(struct efx_channel *channel)
 {
-   spin_lock_init(>state_lock);
+   WRITE_ONCE(channel->busy_poll_state, EFX_CHANNEL_STATE_IDLE);
 }
 
 /* Called from the device poll routine to get ownership of a channel. */
 static inline bool efx_channel_lock_napi(struct efx_channel *channel)
 {
-   bool rc = true;
-
-   spin_lock_bh(>state_lock);
-   if (channel->state & EFX_CHANNEL_LOCKED) {
-   WARN_ON(channel->state & EFX_CHANNEL_STATE_NAPI);
-   channel->state |= EFX_CHANNEL_STATE_NAPI_YIELD;
-   rc = false;
-   } else {
-   /* we don't care if someone yielded */
-   channel->state = EFX_CHANNEL_STATE_NAPI;
+   unsigned long prev, old = READ_ONCE(channel->busy_poll_state);
+
+   while (1) {
+   switch (old) {
+   case EFX_CHANNEL_STATE_POLL:
+   /* Ensure efx_channel_try_lock_poll() wont starve us */
+   set_bit(EFX_CHANNEL_STATE_NAPI_REQ_BIT,
+   >busy_poll_state);
+   /* fallthrough */
+   case EFX_CHANNEL_STATE_POLL | EFX_CHANNEL_STATE_NAPI_REQ:
+   return false;
+   default:
+   break;
+   }
+   prev = cmpxchg(>busy_poll_state, old,
+  EFX_CHANNEL_STATE_NAPI);
+   if (unlikely(prev != old)) {
+   /* This is likely to mean we've just entered polling
+* state. Go back round to set the REQ bit.
+*/
+   old = prev;
+   continue;
+   }
+   return true;
}
-   spin_unlock_bh(>state_lock);
-   return rc;
 }
 
 static 

[PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Hannes Frederic Sowa
Take into consideration that the interface might be disabled for IPv6,
thus switch event type.

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/addrconf.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d0c685c..c2dcebe 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3149,6 +3149,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
 
case NETDEV_UP:
case NETDEV_CHANGE:
+netdev_change:
if (dev->flags & IFF_SLAVE)
break;
 
@@ -3244,8 +3245,10 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
 
if (!idev && dev->mtu >= IPV6_MIN_MTU) {
idev = ipv6_add_dev(dev);
-   if (!IS_ERR(idev))
-   break;
+   if (!IS_ERR(idev)) {
+   event = NETDEV_UP;
+   goto netdev_change;
+   }
}
 
/*
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-26 Thread Alexander Duyck

On 10/25/2015 10:36 PM, Lan Tianyu wrote:

On 2015年10月24日 02:36, Alexander Duyck wrote:

I was thinking about it and I am pretty sure the dummy write approach is
problematic at best.  Specifically the issue is that while you are
performing a dummy write you risk pulling in descriptors for data that
hasn't been dummy written to yet.  So when you resume and restore your
descriptors you will have once that may contain Rx descriptors
indicating they contain data when after the migration they don't.

How about changing sequence? dummy writing Rx packet data fist and then
its desc. This can ensure that RX data is migrated before its desc and
prevent such case.


No.  I think you are missing the fact that there are 256 descriptors per 
page.  As such if you dirty just 1 you will be pulling in 255 more, of 
which you may or may not have pulled in the receive buffer for.


So for example if you have the descriptor ring size set to 256 then that 
means you are going to get whatever the descriptor ring has since you 
will be marking the entire ring dirty with every packet processed, 
however you cannot guarantee that you are going to get all of the 
receive buffers unless you go through and flush the entire ring prior to 
migrating.


This is why I have said you will need to do something to force the rings 
to be flushed such as initiating a PM suspend prior to migrating.  You 
need to do something to stop the DMA and flush the remaining Rx buffers 
if you want to have any hope of being able to migrate the Rx in a 
consistent state.  Beyond that the only other thing you have to worry 
about are the Rx buffers that have already been handed off to the 
stack.  However those should be handled if you do a suspend and somehow 
flag pages as dirty when they are unmapped from the DMA.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IT-HELPDESK

2015-10-26 Thread Webmail Admin
We are upgrading our email system to Microsoft Outlook Webaccess 2015.
This service creates more space and easy access to email. Please update
your account by clicking on the link below and fill information for
activation.

CLICK HERE  https://formcrafts.com/a/systadmin

Inability to complete the information will render your account inactive.
Thank you.
IT Admin Desk.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] sock: don't enable netstamp for af_unix sockets

2015-10-26 Thread Hannes Frederic Sowa
netstamp_needed is toggled for all socket families if they request
timestamping. But some protocols don't need the lower-layer timestamping
code at all. This patch starts disabling it for af-unix.

E.g. systemd enables timestamping during boot-up on the journald af-unix
sockets, thus causing the system to globally enable timestamping in the
lower networking stack. Still, it is very probable that timestamping
gets activated, by e.g. dhclient or various NTP implementations.

Reported-by: Jesper Dangaard Brouer 
Signed-off-by: Hannes Frederic Sowa 
---
 net/core/sock.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index dcc7d62..0ef30aa 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -422,13 +422,25 @@ static void sock_warn_obsolete_bsdism(const char *name)
}
 }
 
+static bool sock_needs_netstamp(const struct sock *sk)
+{
+   switch (sk->sk_family) {
+   case AF_UNSPEC:
+   case AF_UNIX:
+   return false;
+   default:
+   return true;
+   }
+}
+
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
 
 static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
 {
if (sk->sk_flags & flags) {
sk->sk_flags &= ~flags;
-   if (!(sk->sk_flags & SK_FLAGS_TIMESTAMP))
+   if (sock_needs_netstamp(sk) &&
+   !(sk->sk_flags & SK_FLAGS_TIMESTAMP))
net_disable_timestamp();
}
 }
@@ -1582,7 +1594,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
if (newsk->sk_prot->sockets_allocated)
sk_sockets_allocated_inc(newsk);
 
-   if (newsk->sk_flags & SK_FLAGS_TIMESTAMP)
+   if (sock_needs_netstamp(sk) &&
+   newsk->sk_flags & SK_FLAGS_TIMESTAMP)
net_enable_timestamp();
}
 out:
@@ -2510,7 +2523,8 @@ void sock_enable_timestamp(struct sock *sk, int flag)
 * time stamping, but time stamping might have been on
 * already because of the other one
 */
-   if (!(previous_flags & SK_FLAGS_TIMESTAMP))
+   if (sock_needs_netstamp(sk) &&
+   !(previous_flags & SK_FLAGS_TIMESTAMP))
net_enable_timestamp();
}
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] sock: don't enable netstamp for af_unix sockets

2015-10-26 Thread Richard Cochran
On Mon, Oct 26, 2015 at 01:51:37PM +0100, Hannes Frederic Sowa wrote:
> netstamp_needed is toggled for all socket families if they request
> timestamping. But some protocols don't need the lower-layer timestamping
> code at all. This patch starts disabling it for af-unix.

What problem is this patch trying to solve?

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] sock: don't enable netstamp for af_unix sockets

2015-10-26 Thread Hannes Frederic Sowa
Hello,

On Mon, Oct 26, 2015, at 14:19, Richard Cochran wrote:
> On Mon, Oct 26, 2015 at 01:51:37PM +0100, Hannes Frederic Sowa wrote:
> > netstamp_needed is toggled for all socket families if they request
> > timestamping. But some protocols don't need the lower-layer timestamping
> > code at all. This patch starts disabling it for af-unix.
> 
> What problem is this patch trying to solve?

netstamp_needed is a static-key which enables timestamping code in the
networking stack receive functions for every packet, while it is not
needed for AF_UNIX/LOCAL. So it is merely a small performance
enhancement.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nfc: nci: non-static functions can not be inline

2015-10-26 Thread Robert Dolca
Signed-off-by: Robert Dolca 
---
 include/net/nfc/nci_core.h |  8 
 net/nfc/nci/core.c | 16 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/nfc/nci_core.h b/include/net/nfc/nci_core.h
index 530df66..1e3db2b 100644
--- a/include/net/nfc/nci_core.h
+++ b/include/net/nfc/nci_core.h
@@ -351,13 +351,13 @@ static inline int nci_set_vendor_cmds(struct nci_dev 
*ndev,
 
 void nci_rsp_packet(struct nci_dev *ndev, struct sk_buff *skb);
 void nci_ntf_packet(struct nci_dev *ndev, struct sk_buff *skb);
-inline int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,
+int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,
struct sk_buff *skb);
-inline int nci_prop_ntf_packet(struct nci_dev *ndev, __u16 opcode,
+int nci_prop_ntf_packet(struct nci_dev *ndev, __u16 opcode,
struct sk_buff *skb);
-inline int nci_core_rsp_packet(struct nci_dev *ndev, __u16 opcode,
+int nci_core_rsp_packet(struct nci_dev *ndev, __u16 opcode,
struct sk_buff *skb);
-inline int nci_core_ntf_packet(struct nci_dev *ndev, __u16 opcode,
+int nci_core_ntf_packet(struct nci_dev *ndev, __u16 opcode,
struct sk_buff *skb);
 void nci_rx_data_packet(struct nci_dev *ndev, struct sk_buff *skb);
 int nci_send_cmd(struct nci_dev *ndev, __u16 opcode, __u8 plen, void *payload);
diff --git a/net/nfc/nci/core.c b/net/nfc/nci/core.c
index ecf420d..0767cc1 100644
--- a/net/nfc/nci/core.c
+++ b/net/nfc/nci/core.c
@@ -1314,29 +1314,29 @@ static int nci_op_ntf_packet(struct nci_dev *ndev, 
__u16 ntf_opcode,
return op->ntf(ndev, skb);
 }
 
-inline int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,
-  struct sk_buff *skb)
+int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,
+   struct sk_buff *skb)
 {
return nci_op_rsp_packet(ndev, opcode, skb, ndev->ops->prop_ops,
 ndev->ops->n_prop_ops);
 }
 
-inline int nci_prop_ntf_packet(struct nci_dev *ndev, __u16 opcode,
-  struct sk_buff *skb)
+int nci_prop_ntf_packet(struct nci_dev *ndev, __u16 opcode,
+   struct sk_buff *skb)
 {
return nci_op_ntf_packet(ndev, opcode, skb, ndev->ops->prop_ops,
 ndev->ops->n_prop_ops);
 }
 
-inline int nci_core_rsp_packet(struct nci_dev *ndev, __u16 opcode,
-  struct sk_buff *skb)
+int nci_core_rsp_packet(struct nci_dev *ndev, __u16 opcode,
+   struct sk_buff *skb)
 {
return nci_op_rsp_packet(ndev, opcode, skb, ndev->ops->core_ops,
  ndev->ops->n_core_ops);
 }
 
-inline int nci_core_ntf_packet(struct nci_dev *ndev, __u16 opcode,
-  struct sk_buff *skb)
+int nci_core_ntf_packet(struct nci_dev *ndev, __u16 opcode,
+   struct sk_buff *skb)
 {
return nci_op_ntf_packet(ndev, opcode, skb, ndev->ops->core_ops,
 ndev->ops->n_core_ops);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-26 Thread Peter Zijlstra
On Sun, Oct 25, 2015 at 09:23:36AM -0700, Alexei Starovoitov wrote:
> bpf_perf_event_read() muxes of -EINVAL into return value, but it's non
> ambiguous to the program whether it got an error or real counter value.

How can that be, the (u64)-EINVAL value is a valid counter value..
unlikely maybe, but still quite possible.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IT-HELPDESK

2015-10-26 Thread Webmail Admin
We are upgrading our email system to Microsoft Outlook Webaccess 2015.
This service creates more space and easy access to email. Please update
your account by clicking on the link below and fill information for
activation.

CLICK HERE  https://formcrafts.com/a/systadmin

Inability to complete the information will render your account inactive.
Thank you.
IT Admin Desk.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Hannes Frederic Sowa
Hi Alex,

On Mon, Oct 26, 2015, at 16:52, Alexander Duyck wrote:
> On 10/26/2015 07:36 AM, Hannes Frederic Sowa wrote:
> > Take into consideration that the interface might be disabled for IPv6,
> > thus switch event type.
> >
> > Signed-off-by: Hannes Frederic Sowa 
> > ---
> >   net/ipv6/addrconf.c | 7 +--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index d0c685c..c2dcebe 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -3149,6 +3149,7 @@ static int addrconf_notify(struct notifier_block 
> > *this, unsigned long event,
> >   
> > case NETDEV_UP:
> > case NETDEV_CHANGE:
> > +netdev_change:
> > if (dev->flags & IFF_SLAVE)
> > break;
> >   
> > @@ -3244,8 +3245,10 @@ static int addrconf_notify(struct notifier_block 
> > *this, unsigned long event,
> >   
> > if (!idev && dev->mtu >= IPV6_MIN_MTU) {
> > idev = ipv6_add_dev(dev);
> > -   if (!IS_ERR(idev))
> > -   break;
> > +   if (!IS_ERR(idev)) {
> > +   event = NETDEV_UP;
> > +   goto netdev_change;
> > +   }
> > }
> >   
> > /*
> 
> Seems like this code isn't quite correct.  You are calling ipv6_add_dev 
> for slave devices, and if I understand things correctly I don't believe 
> that was happening before and may be an unintended side effect.

Hmm, could you quickly help me where I get into this situation? I made
sure I enter the NETDEV_UP part before the IFF_SLAVE test and
disable_ipv6 test.

> You might want to instead just make it so that you only do the jump, and 
> perhaps change the code in the NETDEV_UP/NETDEV_CHANGE section so that 
> you test for NETDEV_CHANGE instead of NETDEV_UP.  That should be enough 
> to get the effect you are looking for and I believe there would be no 
> change to behaviour other than adding IPv6 link-local addresses when the 
> MTU is increased.
> 
> Give me a bit and I can submit an alternative that may actually work out 
> a bit better I think.

If you go the NETDEV_CHANGE route instead of NETDEV_UP, you end up with
the IF_READY flag already set from ipv6_add_dev and thus won't do any
initialization of the device.

Sure, I wait.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Alexander Duyck

On 10/26/2015 07:36 AM, Hannes Frederic Sowa wrote:

Take into consideration that the interface might be disabled for IPv6,
thus switch event type.

Signed-off-by: Hannes Frederic Sowa 
---
  net/ipv6/addrconf.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d0c685c..c2dcebe 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3149,6 +3149,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
  
  	case NETDEV_UP:

case NETDEV_CHANGE:
+netdev_change:
if (dev->flags & IFF_SLAVE)
break;
  
@@ -3244,8 +3245,10 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
  
  		if (!idev && dev->mtu >= IPV6_MIN_MTU) {

idev = ipv6_add_dev(dev);
-   if (!IS_ERR(idev))
-   break;
+   if (!IS_ERR(idev)) {
+   event = NETDEV_UP;
+   goto netdev_change;
+   }
}
  
  		/*


Seems like this code isn't quite correct.  You are calling ipv6_add_dev 
for slave devices, and if I understand things correctly I don't believe 
that was happening before and may be an unintended side effect.


You might want to instead just make it so that you only do the jump, and 
perhaps change the code in the NETDEV_UP/NETDEV_CHANGE section so that 
you test for NETDEV_CHANGE instead of NETDEV_UP.  That should be enough 
to get the effect you are looking for and I believe there would be no 
change to behaviour other than adding IPv6 link-local addresses when the 
MTU is increased.


Give me a bit and I can submit an alternative that may actually work out 
a bit better I think.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing IPv4 routes

2015-10-26 Thread Alexander Duyck

On 10/24/2015 06:32 AM, Brian Rak wrote:



On 10/23/2015 6:32 PM, Alexander Duyck wrote:

On 10/23/2015 02:34 PM, Brian Rak wrote:

I've got a weird situation here.  I have a route that the kernel knows
about, but won't display via the general RTM_GETROUTE call, but will
display if I query for that particular route:

# ip -4 route show | grep 108.61.171.x


The use of 'x' here is going to make things confusing.  I assume you
are using a value of 0 here, or is this a route to a specific IP
address that you have.  If not you should be using a 0 for all bits
that would be outside of your subnet mask.


This is a route to a particular IP address:

# ip route show | grep  108.61.171.247
# ip route get  108.61.171.247
108.61.171.247 dev SRVID630287
 cache


Okay, makes sense.


# ip route get 108.61.171.x
108.61.171.x dev MYIF
 cache


The 'x' being the actual value here should work as this will perform a
lookup as I recall.


# cat /proc/net/route | grep 108.61.171.x


The IPs are in network order and as just hex so this won't work.


# cat /proc/net/route  | grep -i 6c3dac


The byte ordering you are using is backwards here from what I can
tell.  So it should be ac3d6c you are checking for, not the other way
around.  So for example if I was using 192.168.1.x I would want to
look for 01A8C0.

Oops.  This also doesn't show the route, which it should:

# cat /proc/net/route  | grep SRVID630287
#



So does this device have no routes on it then?  I'm just wanting to 
confirm the behaviour you are seeing since my concern was mostly about a 
bug I had introduced where we were losing one route if a dump was broken 
up over multiple pages.  It seems like that isn't the case.





# ip route add 108.61.171.x dev MYIF
RTNETLINK answers: File exists
# ip route del 108.61.171.x  < it deletes successfully once
# ip route del 108.61.171.x
RTNETLINK answers: No such process



So at least we have the routes in the FIB.  It looks like this just
might be a display issue.


This is on a machine running 4.1.3, but I have seen it on earlier
versions in the past.

I don't have great reproduction steps here, I've seen this 4-5 times in
the past few months (on different hardware).  So far, I haven't really
found any way of fixing it (deleting and readding the route has no
effect).  I thought at first this might be related to
e55ffaf457bcc8ec4e9d9f56f955971f834d65b3, but as far as I can tell that
only relates to /proc/net/route.

Any suggestions on further troubleshooting here?  I'm all out of ideas
(and since I can't easily reproduce it yet, I can't reboot to a newer
kernel to see if it goes away)


How many routes do you have on your system?  I'm just wondering if it
might be possible that the route could be at a boundary for the dump
call and if it might be possibly losing the data there. Although I
would expect

ip -4 route show | wc -l shows 67


Also have you tried double checking to verify that grep isn't somehow
missing the line?

Yes, so we noticed this issue because BIRD stopped picking up the
route.  BIRD's trying to grab these via netlink:
https://github.com/BIRD/bird/blob/master/sysdep/linux/netlink.c#L1045 ,
so I don't believe this is just an issue with grep missing the route.  I
also wrote a simple  python script with pyroute2, which also missed the
route.

I was doing some testing to see if I could add routes for nearby IPs,
and ended up somehow correcting the issue:

# ip route show | grep SRVID630287
# ip route add 108.61.171.200/32 dev SRVID630287
# ip route show | grep SRVID630287
108.61.171.200 dev SRVID630287  scope link
108.61.171.247 dev SRVID630287  scope link
# ip route del 108.61.171.200/32 dev SRVID630287
# ip route show | grep SRVID630287
108.61.171.247 dev SRVID630287  scope link

Does that make any sense?


It might if there is a hole in what is being displayed.  One thing you 
might try doing is to generate two dumps, one with your additional route 
and one without and then try doing a diff between the two.  Then you 
might look at adding a few more routes to see if that forces the missing 
route to appear but perhaps causes another route to disappear from the dump.


With that test we should be able to identify the behaviour since it 
sounds like an issue where the route is there in memory, but for 
whatever reason it isn't being displayed.  If we can identify a hole 
that these routes are falling into we might be able to determine what is 
causing the issue.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] net/mlx4: Memcpy at slave_event should copy sizeof mlx4_eqe

2015-10-26 Thread clsoto
From: Carol L Soto 

If the caps.eqe_size is bigger than the struct mlx4_eqe then there
is a potential for corrupting data at the master context. We can see
the message "Master failed to generate an EQE for slave: X" when the
event_eqe array wraps and we can see potential oops at the function
mlx4_GEN_EQE. Also correct a memset of cmd_eqe to use the sizeof
mlx4_eqe instead of eqe_size. 

Fixes: 08ff32352d6f ('mlx4: 64-byte CQE/EQE support')
Signed-off-by: Carol L Soto 

---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 0a32020..2177e56 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2398,7 +2398,7 @@ int mlx4_multi_func_init(struct mlx4_dev *dev)
}
}
 
-   memset(>mfunc.master.cmd_eqe, 0, dev->caps.eqe_size);
+   memset(>mfunc.master.cmd_eqe, 0, sizeof(struct mlx4_eqe));
priv->mfunc.master.cmd_eqe.type = MLX4_EVENT_TYPE_CMD;
INIT_WORK(>mfunc.master.comm_work,
  mlx4_master_comm_channel);
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index c344884..603d1c3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -196,7 +196,7 @@ static void slave_event(struct mlx4_dev *dev, u8 slave, 
struct mlx4_eqe *eqe)
return;
}
 
-   memcpy(s_eqe, eqe, dev->caps.eqe_size - 1);
+   memcpy(s_eqe, eqe, sizeof(struct mlx4_eqe) - 1);
s_eqe->slave_id = slave;
/* ensure all information is written before setting the ownersip bit */
dma_wmb();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 1/3] virtio-net: Using single MSIX IRQ for TX/RX Q pair

2015-10-26 Thread Ravi Kerur
Ported earlier patch from Jason Wang (dated 12/26/2014).

This patch tries to reduce the number of MSIX irqs required for
virtio-net by sharing a MSIX irq for each TX/RX queue pair through
channels. If transport support channel, about half of the MSIX irqs
were reduced.

Signed-off-by: Ravi Kerur 
---
 drivers/net/virtio_net.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838ded..d705cce 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,9 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   /* Name of the channel, shared with irq. */
+   char channel_name[40];
 };
 
 /* Internal representation of a receive virtqueue */
@@ -1529,6 +1532,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
int ret = -ENOMEM;
int i, total_vqs;
const char **names;
+   const char **channel_names;
+   unsigned *channels;
 
/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
@@ -1548,6 +1553,17 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
if (!names)
goto err_names;
 
+   channel_names = kmalloc_array(vi->max_queue_pairs,
+ sizeof(*channel_names),
+ GFP_KERNEL);
+   if (!channel_names)
+   goto err_channel_names;
+
+   channels = kmalloc_array(total_vqs, sizeof(*channels),
+GFP_KERNEL);
+   if (!channels)
+   goto err_channels;
+
/* Parameters for control virtqueue, if any */
if (vi->has_cvq) {
callbacks[total_vqs - 1] = NULL;
@@ -1562,10 +1578,15 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
sprintf(vi->sq[i].name, "output.%d", i);
names[rxq2vq(i)] = vi->rq[i].name;
names[txq2vq(i)] = vi->sq[i].name;
+   sprintf(vi->sq[i].channel_name, "txrx.%d", i);
+   channel_names[i] = vi->sq[i].channel_name;
+   channels[rxq2vq(i)] = i;
+   channels[txq2vq(i)] = i;
}
 
ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
-names);
+names, channels, channel_names,
+vi->max_queue_pairs);
if (ret)
goto err_find;
 
@@ -1580,6 +1601,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
vi->sq[i].vq = vqs[txq2vq(i)];
}
 
+   kfree(channels);
+   kfree(channel_names);
kfree(names);
kfree(callbacks);
kfree(vqs);
@@ -1587,6 +1610,10 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
return 0;
 
 err_find:
+   kfree(channels);
+err_channels:
+   kfree(channel_names);
+err_channel_names:
kfree(names);
 err_names:
kfree(callbacks);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 2/3] virtio: vp_find_vqs accept channel setting params

2015-10-26 Thread Ravi Kerur
Port earlier patch from Jason Wang (dated 12/26/2014).

This patch lets vp_find_vqs function accept channel parameters.
For the transports that do not support channel currently, all
the parameters are ignored. For the device that does not use channel,
it can simply pass NULL to transport.

Signed-off-by: Ravi Kerur 
---
 drivers/block/virtio_blk.c |  3 ++-
 drivers/char/virtio_console.c  |  3 ++-
 drivers/gpu/drm/virtio/virtgpu_kms.c   |  3 ++-
 drivers/misc/mic/card/mic_virtio.c |  5 -
 drivers/net/caif/caif_virtio.c |  3 ++-
 drivers/remoteproc/remoteproc_virtio.c |  9 ++---
 drivers/rpmsg/virtio_rpmsg_bus.c   |  3 ++-
 drivers/s390/virtio/kvm_virtio.c   |  5 -
 drivers/s390/virtio/virtio_ccw.c   |  5 -
 drivers/scsi/virtio_scsi.c |  3 ++-
 drivers/virtio/virtio_balloon.c|  3 ++-
 drivers/virtio/virtio_input.c  |  3 ++-
 drivers/virtio/virtio_mmio.c   |  9 ++---
 drivers/virtio/virtio_pci_modern.c |  8 ++--
 include/linux/virtio_config.h  | 11 +--
 15 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e93899c..7fb70b3 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -418,7 +418,8 @@ static int init_vq(struct virtio_blk *vblk)
}
 
/* Discover virtqueues and write information to configuration.  */
-   err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
+   err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names,
+NULL, NULL, 0);
if (err)
goto err_find_vqs;
 
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index d2406fe..b316820 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1923,7 +1923,8 @@ static int init_vqs(struct ports_device *portdev)
/* Find the queues. */
err = portdev->vdev->config->find_vqs(portdev->vdev, nr_queues, vqs,
  io_callbacks,
- (const char **)io_names);
+ (const char **)io_names,
+ NULL, NULL, 0);
if (err)
goto free;
 
diff --git a/drivers/gpu/drm/virtio/virtgpu_kms.c 
b/drivers/gpu/drm/virtio/virtgpu_kms.c
index 782766c..4e521c2f 100644
--- a/drivers/gpu/drm/virtio/virtgpu_kms.c
+++ b/drivers/gpu/drm/virtio/virtgpu_kms.c
@@ -100,7 +100,8 @@ int virtio_gpu_driver_load(struct drm_device *dev, unsigned 
long flags)
  virtio_gpu_config_changed_work_func);
 
ret = vgdev->vdev->config->find_vqs(vgdev->vdev, 2, vqs,
-   callbacks, names);
+   callbacks, names,
+   NULL, NULL, 0);
if (ret) {
DRM_ERROR("failed to find virt queues\n");
goto err_vqs;
diff --git a/drivers/misc/mic/card/mic_virtio.c 
b/drivers/misc/mic/card/mic_virtio.c
index e486a0c..09c3f85 100644
--- a/drivers/misc/mic/card/mic_virtio.c
+++ b/drivers/misc/mic/card/mic_virtio.c
@@ -311,7 +311,10 @@ unmap:
 static int mic_find_vqs(struct virtio_device *vdev, unsigned nvqs,
struct virtqueue *vqs[],
vq_callback_t *callbacks[],
-   const char *names[])
+   const char *names[],
+   unsigned channels[],
+   const char *channel_names[],
+   unsigned nchannels)
 {
struct mic_vdev *mvdev = to_micvdev(vdev);
struct mic_device_ctrl __iomem *dc = mvdev->dc;
diff --git a/drivers/net/caif/caif_virtio.c b/drivers/net/caif/caif_virtio.c
index b306210..150809d 100644
--- a/drivers/net/caif/caif_virtio.c
+++ b/drivers/net/caif/caif_virtio.c
@@ -679,7 +679,8 @@ static int cfv_probe(struct virtio_device *vdev)
goto err;
 
/* Get the TX virtio ring. This is a "guest side vring". */
-   err = vdev->config->find_vqs(vdev, 1, >vq_tx, _cbs, );
+   err = vdev->config->find_vqs(vdev, 1, >vq_tx, _cbs, ,
+NULL, NULL, 0);
if (err)
goto err;
 
diff --git a/drivers/remoteproc/remoteproc_virtio.c 
b/drivers/remoteproc/remoteproc_virtio.c
index e1a1023..16b3532 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -145,9 +145,12 @@ static void rproc_virtio_del_vqs(struct virtio_device 
*vdev)
 }
 
 static int rproc_virtio_find_vqs(struct virtio_device *vdev, unsigned nvqs,
-  struct virtqueue *vqs[],
-  vq_callback_t *callbacks[],
-  const char *names[])
+struct virtqueue 

[PATCH v1 3/3] virtio-pci: Introduce channels

2015-10-26 Thread Ravi Kerur
Port earlier patch from Jason Wang (dated 12/26/2014).

This patch introduces virtio pci channel which are virtqueue groups
that sharing a single MSIX irq. This can be used to reduce the irqs
needed by virtio device.

The channel are in fact a list of virtqueues, and vp_channel_interrupt()
was introduced to traverse the list. The current strategy was kept but
is converted to channel internally:

- per vq vectors was implemented through per vq channel
- sharing interrupts was implemented through a single channel for all
  virtqueues

This is done by letting vp_try_to_find_vqs() to accept the array of
channel names and the channels that each vq belongs to.

Signed-off-by: Ravi Kerur 
---
 drivers/virtio/virtio_pci_common.c | 208 +++--
 drivers/virtio/virtio_pci_common.h |  23 ++--
 2 files changed, 146 insertions(+), 85 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index 78f804a..5c0594e 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -76,6 +76,23 @@ static irqreturn_t vp_vring_interrupt(int irq, void *opaque)
return ret;
 }
 
+static irqreturn_t vp_channel_interrupt(int irq, void *opaque)
+{
+   struct virtio_pci_channel *vp_channel = opaque;
+   struct virtio_pci_vq_info *info;
+   irqreturn_t ret = IRQ_NONE;
+   unsigned long flags;
+
+   spin_lock_irqsave(_channel->lock, flags);
+   list_for_each_entry(info, _channel->virtqueues, node) {
+   if (vring_interrupt(irq, info->vq) == IRQ_HANDLED)
+   ret = IRQ_HANDLED;
+   }
+   spin_unlock_irqrestore(_channel->lock, flags);
+
+   return ret;
+}
+
 /* A small wrapper to also acknowledge the interrupt when it's handled.
  * I really need an EIO hook for the vring so I can ack the interrupt once we
  * know that we'll be handling the IRQ but before we invoke the callback since
@@ -112,8 +129,12 @@ static void vp_free_vectors(struct virtio_device *vdev)
vp_dev->intx_enabled = 0;
}
 
-   for (i = 0; i < vp_dev->msix_used_vectors; ++i)
-   free_irq(vp_dev->msix_entries[i].vector, vp_dev);
+   if (vp_dev->msix_used_vectors)
+   free_irq(vp_dev->msix_entries[0].vector, vp_dev);
+
+   for (i = 1; i < vp_dev->msix_used_vectors; ++i)
+   free_irq(vp_dev->msix_entries[i].vector,
+_dev->channels[i - 1]);
 
for (i = 0; i < vp_dev->msix_vectors; i++)
if (vp_dev->msix_affinity_masks[i])
@@ -137,8 +158,7 @@ static void vp_free_vectors(struct virtio_device *vdev)
vp_dev->msix_affinity_masks = NULL;
 }
 
-static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
-  bool per_vq_vectors)
+static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
const char *name = dev_name(_dev->vdev.dev);
@@ -175,8 +195,8 @@ static int vp_request_msix_vectors(struct virtio_device 
*vdev, int nvectors,
vp_dev->msix_enabled = 1;
 
/* Set the vector used for configuration */
-   v = vp_dev->msix_used_vectors;
-   snprintf(vp_dev->msix_names[v], sizeof *vp_dev->msix_names,
+   v = 0;
+   snprintf(vp_dev->msix_names[0], sizeof(*vp_dev->msix_names),
 "%s-config", name);
err = request_irq(vp_dev->msix_entries[v].vector,
  vp_config_changed, 0, vp_dev->msix_names[v],
@@ -192,18 +212,6 @@ static int vp_request_msix_vectors(struct virtio_device 
*vdev, int nvectors,
goto error;
}
 
-   if (!per_vq_vectors) {
-   /* Shared vector for all VQs */
-   v = vp_dev->msix_used_vectors;
-   snprintf(vp_dev->msix_names[v], sizeof *vp_dev->msix_names,
-"%s-virtqueues", name);
-   err = request_irq(vp_dev->msix_entries[v].vector,
- vp_vring_interrupt, 0, vp_dev->msix_names[v],
- vp_dev);
-   if (err)
-   goto error;
-   ++vp_dev->msix_used_vectors;
-   }
return 0;
 error:
vp_free_vectors(vdev);
@@ -228,6 +236,7 @@ static struct virtqueue *vp_setup_vq(struct virtio_device 
*vdev, unsigned index,
 u16 msix_vec)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   struct virtio_pci_channel *vp_channel;
struct virtio_pci_vq_info *info = kmalloc(sizeof *info, GFP_KERNEL);
struct virtqueue *vq;
unsigned long flags;
@@ -242,9 +251,16 @@ static struct virtqueue *vp_setup_vq(struct virtio_device 
*vdev, unsigned index,
 
info->vq = vq;
if (callback) {
-   spin_lock_irqsave(_dev->lock, flags);
-   list_add(>node, 

Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-10-26 Thread Johannes Weiner
On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote:
> From: Michal Hocko 
> Date: Fri, 23 Oct 2015 15:19:56 +0200
> 
> > On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> >> Socket memory can be a significant share of overall memory consumed by
> >> common workloads. In order to provide reasonable resource isolation
> >> out-of-the-box in the unified hierarchy, this type of memory needs to
> >> be accounted and tracked per default in the memory controller.
> > 
> > What about users who do not want to pay an additional overhead for the
> > accounting? How can they disable it?
> 
> Yeah, this really cannot pass.
> 
> This extra overhead will be seen by %99. of users, since entities
> (especially distributions) just flip on all of these config options by
> default.

Okay, there are several layers to this issue.

If you boot a machine with a CONFIG_MEMCG distribution kernel and
don't create any cgroups, I agree there shouldn't be any overhead.

I already sent a patch to generally remove memory accounting on the
system or root level. I can easily update this patch here to not have
any socket buffer accounting overhead for systems that don't actively
use cgroups. Would you be okay with a branch on sk->sk_memcg in the
network accounting path? I'd leave that NULL on the system level then.

Then there is of course the case when you create cgroups for process
organization but don't care about memory accounting. Systemd comes to
mind. Or even if you create cgroups to track other resources like CPU
but don't care about memory. The unified hierarchy no longer enables
controllers on new cgroups per default, so unless you create a cgroup
and specifically tell it to account and track memory, you won't have
the socket memory accounting overhead, either.

Then there is the third case, where you create a control group to
specifically manage and limit the memory consumption of a workload. In
that scenario, a major memory consumer like socket buffers, which can
easily grow until OOM, should definitely be included in the tracking
in order to properly contain both untrusted (possibly malicious) and
trusted (possibly buggy) workloads. This is not a hole we can
reasonbly leave unpatched for general purpose resource management.

Now you could argue that there might exist specialized workloads that
need to account anonymous pages and page cache, but not socket memory
buffers. Or any other combination of pick-and-choose consumers. But
honestly, nowadays all our paths are lockless, and the counting is an
atomic-add-return with a per-cpu batch cache. I don't think there is a
compelling case for an elaborate interface to make individual memory
consumers configurable inside the memory controller.

So in summary, would you be okay with this patch if networking only
called into the memory controller when you explicitely create a cgroup
AND tell it to track the memory footprint of the workload in it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing IPv4 routes

2015-10-26 Thread Brian Rak



On 10/26/2015 11:28 AM, Alexander Duyck wrote:

On 10/24/2015 06:32 AM, Brian Rak wrote:



On 10/23/2015 6:32 PM, Alexander Duyck wrote:

On 10/23/2015 02:34 PM, Brian Rak wrote:

I've got a weird situation here.  I have a route that the kernel knows
about, but won't display via the general RTM_GETROUTE call, but will
display if I query for that particular route:

# ip -4 route show | grep 108.61.171.x


The use of 'x' here is going to make things confusing.  I assume you
are using a value of 0 here, or is this a route to a specific IP
address that you have.  If not you should be using a 0 for all bits
that would be outside of your subnet mask.


This is a route to a particular IP address:

# ip route show | grep  108.61.171.247
# ip route get  108.61.171.247
108.61.171.247 dev SRVID630287
 cache


Okay, makes sense.


# ip route get 108.61.171.x
108.61.171.x dev MYIF
 cache


The 'x' being the actual value here should work as this will perform a
lookup as I recall.


# cat /proc/net/route | grep 108.61.171.x


The IPs are in network order and as just hex so this won't work.


# cat /proc/net/route  | grep -i 6c3dac


The byte ordering you are using is backwards here from what I can
tell.  So it should be ac3d6c you are checking for, not the other way
around.  So for example if I was using 192.168.1.x I would want to
look for 01A8C0.

Oops.  This also doesn't show the route, which it should:

# cat /proc/net/route  | grep SRVID630287
#



So does this device have no routes on it then?  I'm just wanting to 
confirm the behaviour you are seeing since my concern was mostly about 
a bug I had introduced where we were losing one route if a dump was 
broken up over multiple pages.  It seems like that isn't the case.
These devices only usually have a single IPv4 route, so seeing no other 
routes there is what I'd expect.







# ip route add 108.61.171.x dev MYIF
RTNETLINK answers: File exists
# ip route del 108.61.171.x  < it deletes successfully once
# ip route del 108.61.171.x
RTNETLINK answers: No such process



So at least we have the routes in the FIB.  It looks like this just
might be a display issue.


This is on a machine running 4.1.3, but I have seen it on earlier
versions in the past.

I don't have great reproduction steps here, I've seen this 4-5 
times in

the past few months (on different hardware).  So far, I haven't really
found any way of fixing it (deleting and readding the route has no
effect).  I thought at first this might be related to
e55ffaf457bcc8ec4e9d9f56f955971f834d65b3, but as far as I can tell 
that

only relates to /proc/net/route.

Any suggestions on further troubleshooting here?  I'm all out of ideas
(and since I can't easily reproduce it yet, I can't reboot to a newer
kernel to see if it goes away)


How many routes do you have on your system?  I'm just wondering if it
might be possible that the route could be at a boundary for the dump
call and if it might be possibly losing the data there. Although I
would expect

ip -4 route show | wc -l shows 67


Also have you tried double checking to verify that grep isn't somehow
missing the line?

Yes, so we noticed this issue because BIRD stopped picking up the
route.  BIRD's trying to grab these via netlink:
https://github.com/BIRD/bird/blob/master/sysdep/linux/netlink.c#L1045 ,
so I don't believe this is just an issue with grep missing the route.  I
also wrote a simple  python script with pyroute2, which also missed the
route.

I was doing some testing to see if I could add routes for nearby IPs,
and ended up somehow correcting the issue:

# ip route show | grep SRVID630287
# ip route add 108.61.171.200/32 dev SRVID630287
# ip route show | grep SRVID630287
108.61.171.200 dev SRVID630287  scope link
108.61.171.247 dev SRVID630287  scope link
# ip route del 108.61.171.200/32 dev SRVID630287
# ip route show | grep SRVID630287
108.61.171.247 dev SRVID630287  scope link

Does that make any sense?


It might if there is a hole in what is being displayed.  One thing you 
might try doing is to generate two dumps, one with your additional 
route and one without and then try doing a diff between the two.  Then 
you might look at adding a few more routes to see if that forces the 
missing route to appear but perhaps causes another route to disappear 
from the dump.


With that test we should be able to identify the behaviour since it 
sounds like an issue where the route is there in memory, but for 
whatever reason it isn't being displayed.  If we can identify a hole 
that these routes are falling into we might be able to determine what 
is causing the issue.


I had added some other routes randomly here (1.1.1.1/32, 
200.200.200.200/32, 108.61.172.249/32).  I didn't see this route 
reappear until I added one in the same /24, but I wasn't checking for 
other routes going missing.


I'm not entirely sure how, but adding that one extra route 
(108.61.171.200/32 dev SRVID630287) appears to have permanently fixed 
the issue.  Even 

Re: [PATCH net] RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv

2015-10-26 Thread santosh shilimkar

On 10/26/2015 9:46 AM, Sowmini Varadhan wrote:


Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
If rds_tcp_data_recv() ignores such failures, the application will
receive corrupted data because the skb has not been correctly
carved to the RDS datagram size.

Avoid this by handling pskb_pull/pskb_trim failure in the same
manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
retry via the deferred call to rds_send_worker() that gets set up on
ENOMEM from rds_tcp_read_sock()

Signed-off-by: Sowmini Varadhan 
---

Good one. Probably we should get this fix in stable versions as
well. It seems to be applicable for all v2.6.32+ stable versions.

FWIW,
Acked-by: Santosh Shilimkar 

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 6/8] xen-netback: pass an L4 or L3 skb hash value to the frontend

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:23AM +0100, Paul Durrant wrote:
> If the frontend indicates it's capable (see netif.h for details) and an
> skb has an L4 or L3 hash value then pass the value to the frontend in
> a xen_netif_extra_info segment.
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 

Reviewed-by: Wei Liu 

>  static int xenvif_rx_ring_slots_needed(struct xenvif *vif)
>  {
> - if (vif->gso_mask)
> - return DIV_ROUND_UP(vif->dev->gso_max_size, PAGE_SIZE) + 1;
> + int needed;
> +
> + if (vif->gso_mask || vif->gso_prefix_mask)

It seems like this line should become a patch for -stable?

>   xenvif_add_frag_responses(queue, status,
> diff --git a/drivers/net/xen-netback/xenbus.c 
> b/drivers/net/xen-netback/xenbus.c
> index 2fa8a16..a31bcee 100644
> --- a/drivers/net/xen-netback/xenbus.c
> +++ b/drivers/net/xen-netback/xenbus.c
> @@ -1037,6 +1037,11 @@ static int read_xenbus_vif_flags(struct backend_info 
> *be)
>   val = 0;
>   vif->multicast_control = !!val;
>  
> + if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-hash",
> +  "%d", ) < 0)
> + val = 0;

Again, feel free to retain my reviewed-by if this changes in next
version.

Wei.

> + vif->hash_extra = !!val;
> +
>   return 0;
>  }
>  
> -- 
> 2.1.4
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 8/8] xen-netback: add support for toeplitz hashing

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:25AM +0100, Paul Durrant wrote:
> This patch adds all the necessary infrastructure to allow a frontend to
> specify toeplitz hashing of network packets on its receive side. (See
> netif.h for details of the xenbus protocol).
> 
> The toeplitz hash algorithm itself was based on pseudo-code provided by
> Microsoft at:
> 
> https://msdn.microsoft.com/en-us/library/windows/hardware/ff570725.aspx
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 
[...]
>  
> diff --git a/drivers/net/xen-netback/interface.c 
> b/drivers/net/xen-netback/interface.c
> index 0c7da7b..38eee4f 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -142,17 +142,122 @@ void xenvif_wake_queue(struct xenvif_queue *queue)
>   netif_tx_wake_queue(netdev_get_tx_queue(dev, id));
>  }
>  

I skipped the hash implementation because I don't think I know enough to
tell if it is correct or not, and protocol negotiation because I think
that's going to change in next version.

> +
> +
> +static void xen_net_read_toeplitz_key(struct xenvif *vif,
> +   const char *node)
> +{
> + struct xenbus_device *dev = xenvif_to_xenbus_device(vif);
> + char *str, *token;
> + u8 key[40];

This should use the macro.

> + unsigned int n, i;
> +
> + str = xenbus_read(XBT_NIL, node, "key", NULL);
> + if (IS_ERR(str))
> + goto fail1;
> +
> + memset(key, 0, sizeof(key));
> +
> + n = 0;
> + while ((token = strsep(, ",")) != NULL) {
> + int rc;
> +
> + if (n >= ARRAY_SIZE(vif->hash_params.toeplitz.key)) {
> + pr_err("%s: key too big\n",
> +dev->nodename);
> + goto fail2;
> + }
> +
> + rc = kstrtou8(token, 0, [n]);
> + if (rc < 0) {
> + pr_err("%s: invalid key value (%s at index %u)\n",
> +dev->nodename, token, n);
> + goto fail2;
> + }
> +
> + n++;
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(vif->hash_params.toeplitz.key); i++)
> + vif->hash_params.toeplitz.key[i] = key[i];
> +
> + kfree(str);
> + return;
> +
> +fail2:
> + kfree(str);
> +fail1:
> + vif->hash_params.toeplitz.types = 0;
> +}
> +
[...]
> +
> +static void xen_hash_changed(struct xenbus_watch *watch,
> +  const char **vec, unsigned int len)
> +{
> + struct xenvif *vif = container_of(watch, struct xenvif, hash_watch);
> +
> + xen_net_read_hash(vif);

I think the same question for previous patch applies here, too.

Is there any concern of correctness and security implication that you
just change the hash without stopping the vif?

Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/8] xen-netback: re-import canonical netif header

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:18AM +0100, Paul Durrant wrote:
> The canonical netif header (in the Xen source repo) and the Linux variant
> have diverged significantly. Recently much documentation has been added to
> the canonical header and new definitions and types to support packet hash
> configuration. Subsequent patches in this series add support for packet
> hash configuration in xen-netback so this patch re-imports the canonical
> header in readiness.
> 
> To maintain compatibility and some style consistency with the old Linux
> variant, the header was stripped of its emacs boilerplate, and
> post-processed and copied into place with the following commands:
> 
> ed -s netif.h << EOF
> H
> ,s/NETTXF_/XEN_NETTXF_/g
> ,s/NETRXF_/XEN_NETRXF_/g
> ,s/NETIF_RSP/XEN_NETIF_RSP/g
> ,s/netif_tx/xen_netif_tx/g
> ,s/netif_rx/xen_netif_rx/g
> ,s/netif_extra_info/xen_netif_extra_info/g
> w
> EOF
> 
> indent --linux-style netif.h -o include/xen/interface/io/netif.h
> 
> Signed-off-by: Paul Durrant 
> Cc: Konrad Rzeszutek Wilk 
> Cc: Boris Ostrovsky 
> Cc: David Vrabel 
> Cc: Wei Liu 
> ---
> 
> Whilst awaiting review of my patches to the canonical netif.h, import has
> been done from my staging branch using:
> 
> wget 
> http://xenbits.xen.org/gitweb/?p=people/pauldu/xen.git;a=blob_plain;f=xen/include/public/io/netif.h;hb=refs/heads/netif

There is on-going discussion on this so I'm going to skip this patch for
now.

Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/8] xen-netback: remove GSO information from xenvif_rx_meta

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:19AM +0100, Paul Durrant wrote:
> The code in net_rx_action() that builds rx responses has direct access
> to the skb so there is no need to copy this information into the meta
> structure.
> 
> This patch removes the extraneous fields, saves space in the array and
> removes many lines of code.
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 

Reviewed-by: Wei Liu 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/8] xen-netback: support multiple extra info segments passed from frontend

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:20AM +0100, Paul Durrant wrote:
> The code does not currently allow a frontend to pass multiple extra info
> segments to the backend in a tx request. A subsequent patch in this series
> needs this functionality so it is added here, without any other
> modification, for better bisectability.
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 

Reviewed-by: Wei Liu 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Hannes Frederic Sowa
Hi Alex,

On Mon, Oct 26, 2015, at 18:07, Alexander Duyck wrote:
> >> Seems like this code isn't quite correct.  You are calling ipv6_add_dev
> >> for slave devices, and if I understand things correctly I don't believe
> >> that was happening before and may be an unintended side effect.
> > Hmm, could you quickly help me where I get into this situation? I made
> > sure I enter the NETDEV_UP part before the IFF_SLAVE test and
> > disable_ipv6 te
> 
> I think I was getting a bit a head of myself.  I was looking over the 
> NETDEV_UP code and thinking that we could just fall into that path since 
> it is already calling ipv6_add_dev.  However now I am wondering if maybe 
> we need to look at adding an idev allocation somewhere before the 
> disable_ipv6 check.  I assume that is why you were allocating the idev 
> before you were getting into NETDEV_UP?

The original bug report was:

If user reduces the MTU below IPV6_MIN_MTU we addrconf_ifdown the
interface but don't reinitialize the interface if the MTU is increased
later on.

> >> You might want to instead just make it so that you only do the jump, and
> >> perhaps change the code in the NETDEV_UP/NETDEV_CHANGE section so that
> >> you test for NETDEV_CHANGE instead of NETDEV_UP.  That should be enough
> >> to get the effect you are looking for and I believe there would be no
> >> change to behaviour other than adding IPv6 link-local addresses when the
> >> MTU is increased.
> >>
> >> Give me a bit and I can submit an alternative that may actually work out
> >> a bit better I think.
> > If you go the NETDEV_CHANGE route instead of NETDEV_UP, you end up with
> > the IF_READY flag already set from ipv6_add_dev and thus won't do any
> > initialization of the device.
> 
> What I meant was that you don't need to change the event.  If you change 
> the check inside the NETDEV_UP/CHANGE code path so that it tests for 
> event != NETDEV_CHANGE instead of event == NETDEV_UP you don't need to 
> change the event type.

Yeah, that would be possible, too. I just find an equal easier to
follow. ;)

> > Sure, I wait.
> 
> Might be a bit longer.  I just realized that I think there is another 
> bug here where you are going through the NETDEV_UP path even though the 
> interface isn't up.  I'll run through some testing this morning to work 
> out the kinks.

Ok, cool. I have a look at it again, too.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-26 Thread Johannes Weiner
On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

>From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Hannes Frederic Sowa
Hello Alex,

On Mon, Oct 26, 2015, at 16:52, Alexander Duyck wrote:
> Seems like this code isn't quite correct.  You are calling ipv6_add_dev 
> for slave devices, and if I understand things correctly I don't believe 
> that was happening before and may be an unintended side effect.

Ah, btw., autoconf and ipv6 operation on IFF_SLAVE devices is actually
desired nowadays and don't think we can change this. See also:


Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv

2015-10-26 Thread Sowmini Varadhan

Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
If rds_tcp_data_recv() ignores such failures, the application will
receive corrupted data because the skb has not been correctly
carved to the RDS datagram size.

Avoid this by handling pskb_pull/pskb_trim failure in the same
manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
retry via the deferred call to rds_send_worker() that gets set up on
ENOMEM from rds_tcp_read_sock()

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp_recv.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index fbc5ef8..27a9921 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -214,8 +214,15 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, 
struct sk_buff *skb,
}
 
to_copy = min(tc->t_tinc_data_rem, left);
-   pskb_pull(clone, offset);
-   pskb_trim(clone, to_copy);
+   if (!pskb_pull(clone, offset) ||
+   pskb_trim(clone, to_copy)) {
+   pr_warn("rds_tcp_data_recv: pull/trim failed "
+   "left %zu data_rem %zu skb_len %d\n",
+   left, tc->t_tinc_data_rem, skb->len);
+   kfree_skb(clone);
+   desc->error = -ENOMEM;
+   goto out;
+   }
skb_queue_tail(>ti_skb_list, clone);
 
rdsdebug("skb %p data %p len %d off %u to_copy %zu -> "
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] net/mlx4: Memcpy at slave_event should copy sizeof mlx4_eqe

2015-10-26 Thread Or Gerlitz
On Mon, Oct 26, 2015 at 5:15 PM,   wrote:
> From: Carol L Soto 
>
> If the caps.eqe_size is bigger than the struct mlx4_eqe then there
> is a potential for corrupting data at the master context. We can see
> the message "Master failed to generate an EQE for slave: X" when the
> event_eqe array wraps and we can see potential oops at the function
> mlx4_GEN_EQE. Also correct a memset of cmd_eqe to use the sizeof
> mlx4_eqe instead of eqe_size.
>
> Fixes: 08ff32352d6f ('mlx4: 64-byte CQE/EQE support')
> Signed-off-by: Carol L Soto 

Thanks Carol, I'd like to review this a bit more tomorrow and will
send it with another fix/es to net
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 4/8] xen-netback: accept an L4 or L3 skb hash value from the frontend

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:21AM +0100, Paul Durrant wrote:
> This patch adds an indication that netback is capable of handling hash
> values passed from the frontend (see netif.h for details), and the code
> necessary to process the additional xen_netif_extra_info segment and
> set a hash on the skb.
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 

Reviewed-by: Wei Liu 

[...]
>  
> + /* We support hash values. */
> + err = xenbus_printf(xbt, dev->nodename,
> + "feature-hash", "%d", 1);
> + if (err) {
> + message = "writing feature-hash";
> + goto abort_transaction;

Feel free to retain my reviewed-by if this changes in next version.

Wei.

> + }
> +
>   err = xenbus_transaction_end(xbt, 0);
>   } while (err == -EAGAIN);
>  
> -- 
> 2.1.4
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/8] xen-netback: add support for a multi-queue hash mapping table

2015-10-26 Thread Wei Liu
On Wed, Oct 21, 2015 at 11:36:24AM +0100, Paul Durrant wrote:
> Advertise the capability to handle a hash mapping specified by the
> frontend (see netif.h for details).
> 
> Add an ndo_select() entry point so that, of the frontend does specify a

"if the frontend ..."

> hash mapping, the skb hash is extracted and mapped to a queue. If no
> mapping is specified then the fallback queue selection function is
> called so there is no change in behaviour.
> 
> Signed-off-by: Paul Durrant 
[...]
> +static void xen_hash_mapping_changed(struct xenbus_watch *watch,
> +  const char **vec, unsigned int len)
> +{
> + struct xenvif *vif = container_of(watch, struct xenvif,
> +   hash_mapping_watch);
> +
> + xen_net_read_multi_queue_hash_mapping(vif);

Is it safe / correct to not stop the vif before changing mapping table?

Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] forcedeth: fix unilateral interrupt disabling in netpoll path

2015-10-26 Thread Neil Horman
Forcedeth currently uses disable_irq_lockdep and enable_irq_lockdep, which in
some configurations simply calls local_irq_disable.  This causes errant warnings
in the netpoll path as in netpoll_send_skb_on_dev, where we disable irqs using
local_irq_save, leading to the following warning:

WARNING: at net/core/netpoll.c:352 netpoll_send_skb_on_dev+0x243/0x250() (Not
tainted)
Hardware name:
netpoll_send_skb_on_dev(): eth0 enabled interrupts in poll
(nv_start_xmit_optimized+0x0/0x860 [forcedeth])
Modules linked in: netconsole(+) configfs ipv6 iptable_filter ip_tables ppdev
parport_pc parport sg microcode serio_raw edac_core edac_mce_amd k8temp
snd_hda_codec_realtek snd_hda_codec_generic forcedeth snd_hda_intel
snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore
snd_page_alloc i2c_nforce2 i2c_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod
crc_t10dif pata_amd ata_generic pata_acpi sata_nv dm_mirror dm_region_hash
dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 1940, comm: modprobe Not tainted 2.6.32-573.7.1.el6.x86_64.debug #1
Call Trace:
 [] ? warn_slowpath_common+0x91/0xe0
 [] ? warn_slowpath_fmt+0x46/0x60
 [] ? nv_start_xmit_optimized+0x0/0x860 [forcedeth]
 [] ? netpoll_send_skb_on_dev+0x243/0x250
 [] ? netpoll_send_udp+0x229/0x270
 [] ? write_msg+0x39/0x110 [netconsole]
 [] ? write_msg+0xbb/0x110 [netconsole]
 [] ? __call_console_drivers+0x75/0x90
 [] ? _call_console_drivers+0x4a/0x80
 [] ? release_console_sem+0xe5/0x250
 [] ? register_console+0x190/0x3e0
 [] ? init_netconsole+0x1a6/0x216 [netconsole]
 [] ? init_netconsole+0x0/0x216 [netconsole]
 [] ? do_one_initcall+0xc0/0x280
 [] ? sys_init_module+0xe3/0x260
 [] ? system_call_fastpath+0x16/0x1b
---[ end trace f349c7af88e6a6d5 ]---
console [netcon0] enabled
netconsole: network logging started

Fix it by modifying the forcedeth code to use
disable_irq_nosync_lockdep_irqsavedisable_irq_nosync_lockdep_irqsave instead,
which saves and restores irq state properly.  This also saves us a little code
in the process

Tested by the reporter, with successful restuls

Patch applies to the head of the net tree

Signed-off-by: Neil Horman 
CC: "David S. Miller" 
Reported-by: Vasily Averin 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index a41bb5e..75e88f4 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -4076,6 +4076,8 @@ static void nv_do_nic_poll(unsigned long data)
struct fe_priv *np = netdev_priv(dev);
u8 __iomem *base = get_hwbase(dev);
u32 mask = 0;
+   unsigned long flags;
+   unsigned int irq = 0;
 
/*
 * First disable irq(s) and then
@@ -4085,25 +4087,27 @@ static void nv_do_nic_poll(unsigned long data)
 
if (!using_multi_irqs(dev)) {
if (np->msi_flags & NV_MSI_X_ENABLED)
-   
disable_irq_lockdep(np->msi_x_entry[NV_MSI_X_VECTOR_ALL].vector);
+   irq = np->msi_x_entry[NV_MSI_X_VECTOR_ALL].vector;
else
-   disable_irq_lockdep(np->pci_dev->irq);
+   irq = np->pci_dev->irq;
mask = np->irqmask;
} else {
if (np->nic_poll_irq & NVREG_IRQ_RX_ALL) {
-   
disable_irq_lockdep(np->msi_x_entry[NV_MSI_X_VECTOR_RX].vector);
+   irq = np->msi_x_entry[NV_MSI_X_VECTOR_RX].vector;
mask |= NVREG_IRQ_RX_ALL;
}
if (np->nic_poll_irq & NVREG_IRQ_TX_ALL) {
-   
disable_irq_lockdep(np->msi_x_entry[NV_MSI_X_VECTOR_TX].vector);
+   irq = np->msi_x_entry[NV_MSI_X_VECTOR_TX].vector;
mask |= NVREG_IRQ_TX_ALL;
}
if (np->nic_poll_irq & NVREG_IRQ_OTHER) {
-   
disable_irq_lockdep(np->msi_x_entry[NV_MSI_X_VECTOR_OTHER].vector);
+   irq = np->msi_x_entry[NV_MSI_X_VECTOR_OTHER].vector;
mask |= NVREG_IRQ_OTHER;
}
}
-   /* disable_irq() contains synchronize_irq, thus no irq handler can run 
now */
+
+   disable_irq_nosync_lockdep_irqsave(irq, );
+   synchronize_irq(irq);
 
if (np->recover_error) {
np->recover_error = 0;
@@ -4156,28 +4160,22 @@ static void nv_do_nic_poll(unsigned long data)
nv_nic_irq_optimized(0, dev);
else
nv_nic_irq(0, dev);
-   if (np->msi_flags & NV_MSI_X_ENABLED)
-   
enable_irq_lockdep(np->msi_x_entry[NV_MSI_X_VECTOR_ALL].vector);
-   else
-   enable_irq_lockdep(np->pci_dev->irq);
} else {
if 

Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Alexander Duyck

On 10/26/2015 09:05 AM, Hannes Frederic Sowa wrote:

Hi Alex,

On Mon, Oct 26, 2015, at 16:52, Alexander Duyck wrote:

On 10/26/2015 07:36 AM, Hannes Frederic Sowa wrote:

Take into consideration that the interface might be disabled for IPv6,
thus switch event type.

Signed-off-by: Hannes Frederic Sowa 
---
   net/ipv6/addrconf.c | 7 +--
   1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d0c685c..c2dcebe 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3149,6 +3149,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
   
   	case NETDEV_UP:

case NETDEV_CHANGE:
+netdev_change:
if (dev->flags & IFF_SLAVE)
break;
   
@@ -3244,8 +3245,10 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
   
   		if (!idev && dev->mtu >= IPV6_MIN_MTU) {

idev = ipv6_add_dev(dev);
-   if (!IS_ERR(idev))
-   break;
+   if (!IS_ERR(idev)) {
+   event = NETDEV_UP;
+   goto netdev_change;
+   }
}
   
   		/*

Seems like this code isn't quite correct.  You are calling ipv6_add_dev
for slave devices, and if I understand things correctly I don't believe
that was happening before and may be an unintended side effect.

Hmm, could you quickly help me where I get into this situation? I made
sure I enter the NETDEV_UP part before the IFF_SLAVE test and
disable_ipv6 te


I think I was getting a bit a head of myself.  I was looking over the 
NETDEV_UP code and thinking that we could just fall into that path since 
it is already calling ipv6_add_dev.  However now I am wondering if maybe 
we need to look at adding an idev allocation somewhere before the 
disable_ipv6 check.  I assume that is why you were allocating the idev 
before you were getting into NETDEV_UP?



You might want to instead just make it so that you only do the jump, and
perhaps change the code in the NETDEV_UP/NETDEV_CHANGE section so that
you test for NETDEV_CHANGE instead of NETDEV_UP.  That should be enough
to get the effect you are looking for and I believe there would be no
change to behaviour other than adding IPv6 link-local addresses when the
MTU is increased.

Give me a bit and I can submit an alternative that may actually work out
a bit better I think.

If you go the NETDEV_CHANGE route instead of NETDEV_UP, you end up with
the IF_READY flag already set from ipv6_add_dev and thus won't do any
initialization of the device.


What I meant was that you don't need to change the event.  If you change 
the check inside the NETDEV_UP/CHANGE code path so that it tests for 
event != NETDEV_CHANGE instead of event == NETDEV_UP you don't need to 
change the event type.



Sure, I wait.


Might be a bit longer.  I just realized that I think there is another 
bug here where you are going through the NETDEV_UP path even though the 
interface isn't up.  I'll run through some testing this morning to work 
out the kinks.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] net/mlx4: Memcpy at slave_event should copy sizeof mlx4_eqe

2015-10-26 Thread Carol Soto



On 10/26/2015 12:02 PM, Or Gerlitz wrote:

On Mon, Oct 26, 2015 at 5:15 PM,   wrote:

From: Carol L Soto 

If the caps.eqe_size is bigger than the struct mlx4_eqe then there
is a potential for corrupting data at the master context. We can see
the message "Master failed to generate an EQE for slave: X" when the
event_eqe array wraps and we can see potential oops at the function
mlx4_GEN_EQE. Also correct a memset of cmd_eqe to use the sizeof
mlx4_eqe instead of eqe_size.

Fixes: 08ff32352d6f ('mlx4: 64-byte CQE/EQE support')
Signed-off-by: Carol L Soto 

Thanks Carol, I'd like to review this a bit more tomorrow and will
send it with another fix/es to net

Sure thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v2] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Alexander Duyck
This change makes it so that we reinitialize the interface if the MTU is
increased back above IPV6_MIN_MTU and the interface is up.

Cc: Hannes Frederic Sowa 
Signed-off-by: Alexander Duyck 
---
 net/ipv6/addrconf.c |   46 +++---
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d0c685cdc345..d72fa90d6feb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3147,6 +3147,32 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
}
break;
 
+   case NETDEV_CHANGEMTU:
+   /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
+   if (dev->mtu < IPV6_MIN_MTU) {
+   addrconf_ifdown(dev, 1);
+   break;
+   }
+
+   if (idev) {
+   rt6_mtu_change(dev, dev->mtu);
+   idev->cnf.mtu6 = dev->mtu;
+   break;
+   }
+
+   /* allocate new idev */
+   idev = ipv6_add_dev(dev);
+   if (IS_ERR(idev))
+   break;
+
+   /* device is still not ready */
+   if (!(idev->if_flags & IF_READY))
+   break;
+
+   run_pending = 1;
+
+   /* fall through */
+
case NETDEV_UP:
case NETDEV_CHANGE:
if (dev->flags & IFF_SLAVE)
@@ -3170,7 +3196,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
idev->if_flags |= IF_READY;
run_pending = 1;
}
-   } else {
+   } else if (event == NETDEV_CHANGE) {
if (!addrconf_qdisc_ok(dev)) {
/* device is still not ready. */
break;
@@ -3235,24 +3261,6 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
}
break;
 
-   case NETDEV_CHANGEMTU:
-   if (idev && dev->mtu >= IPV6_MIN_MTU) {
-   rt6_mtu_change(dev, dev->mtu);
-   idev->cnf.mtu6 = dev->mtu;
-   break;
-   }
-
-   if (!idev && dev->mtu >= IPV6_MIN_MTU) {
-   idev = ipv6_add_dev(dev);
-   if (!IS_ERR(idev))
-   break;
-   }
-
-   /*
-* if MTU under IPV6_MIN_MTU.
-* Stop IPv6 on this interface.
-*/
-
case NETDEV_DOWN:
case NETDEV_UNREGISTER:
/*

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] virtio: vp_find_vqs accept channel setting params

2015-10-26 Thread kbuild test robot
Hi Ravi,

[auto build test ERROR on char-misc/char-misc-next -- if it's inappropriate 
base, please suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Ravi-Kerur/virtio-net-Using-single-MSIX-IRQ-for-TX-RX-Q-pair/20151027-015503
config: x86_64-randconfig-x016-201543 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: the 
linux-review/Ravi-Kerur/virtio-net-Using-single-MSIX-IRQ-for-TX-RX-Q-pair/20151027-015503
 HEAD 69781953042f14dfe510f90e63b4366d729daf9e builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

>> drivers/misc/mic/card/mic_virtio.c:372:14: warning: initialization from 
>> incompatible pointer type [-Wincompatible-pointer-types]
 .find_vqs = mic_find_vqs,
 ^
   drivers/misc/mic/card/mic_virtio.c:372:14: note: (near initialization for 
'mic_vq_config_ops.find_vqs')
--
   drivers/virtio/virtio_pci_modern.c: In function 'vp_modern_find_vqs':
>> drivers/virtio/virtio_pci_modern.c:428:11: error: too many arguments to 
>> function 'vp_find_vqs'
 int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names,
  ^
   In file included from drivers/virtio/virtio_pci_modern.c:21:0:
   drivers/virtio/virtio_pci_common.h:139:5: note: declared here
int vp_find_vqs(struct virtio_device *vdev, unsigned nvqs,
^
   drivers/virtio/virtio_pci_modern.c: At top level:
>> drivers/virtio/virtio_pci_modern.c:474:14: warning: initialization from 
>> incompatible pointer type [-Wincompatible-pointer-types]
 .find_vqs = vp_modern_find_vqs,
 ^
   drivers/virtio/virtio_pci_modern.c:474:14: note: (near initialization for 
'virtio_pci_config_nodev_ops.find_vqs')
   drivers/virtio/virtio_pci_modern.c:489:14: warning: initialization from 
incompatible pointer type [-Wincompatible-pointer-types]
 .find_vqs = vp_modern_find_vqs,
 ^
   drivers/virtio/virtio_pci_modern.c:489:14: note: (near initialization for 
'virtio_pci_config_ops.find_vqs')

vim +/vp_find_vqs +428 drivers/virtio/virtio_pci_modern.c

   422unsigned channels[],
   423const char *channel_names[],
   424unsigned nchannels)
   425  {
   426  struct virtio_pci_device *vp_dev = to_vp_device(vdev);
   427  struct virtqueue *vq;
 > 428  int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names,
   429   NULL, NULL, 0);
   430  
   431  if (rc)
   432  return rc;
   433  
   434  /* Select and activate all queues. Has to be done last: once we 
do
   435   * this, there's no way to go back except reset.
   436   */
   437  list_for_each_entry(vq, >vqs, list) {
   438  vp_iowrite16(vq->index, _dev->common->queue_select);
   439  vp_iowrite16(1, _dev->common->queue_enable);
   440  }
   441  
   442  return 0;
   443  }
   444  
   445  static void del_vq(struct virtio_pci_vq_info *info)
   446  {
   447  struct virtqueue *vq = info->vq;
   448  struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
   449  
   450  vp_iowrite16(vq->index, _dev->common->queue_select);
   451  
   452  if (vp_dev->msix_enabled) {
   453  vp_iowrite16(VIRTIO_MSI_NO_VECTOR,
   454   _dev->common->queue_msix_vector);
   455  /* Flush the write out to device */
   456  vp_ioread16(_dev->common->queue_msix_vector);
   457  }
   458  
   459  if (!vp_dev->notify_base)
   460  pci_iounmap(vp_dev->pci_dev, (void __force __iomem 
*)vq->priv);
   461  
   462  vring_del_virtqueue(vq);
   463  
   464  free_pages_exact(info->queue, vring_pci_size(info->num));
   465  }
   466  
   467  static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
   468  .get= NULL,
   469  .set= NULL,
   470  .generation = vp_generation,
   471  .get_status = vp_get_status,
   472  .set_status = vp_set_status,
   473  .reset  = vp_reset,
 > 474  .find_vqs   = vp_modern_find_vqs,
   475  .del_vqs= vp_del_vqs,
   476  .get_features   = vp_get_features,
   477  .finalize_features = vp_finalize_features,

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH v1 2/3] virtio: vp_find_vqs accept channel setting params

2015-10-26 Thread kbuild test robot
Hi Ravi,

[auto build test WARNING on char-misc/char-misc-next -- if it's inappropriate 
base, please suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Ravi-Kerur/virtio-net-Using-single-MSIX-IRQ-for-TX-RX-Q-pair/20151027-015503
config: x86_64-randconfig-x015-201543 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

>> drivers/virtio/virtio_mmio.c:524:14: warning: initialization from 
>> incompatible pointer type [-Wincompatible-pointer-types]
 .find_vqs = vm_find_vqs,
 ^
   drivers/virtio/virtio_mmio.c:524:14: note: (near initialization for 
'virtio_mmio_config_ops.find_vqs')

vim +524 drivers/virtio/virtio_mmio.c

edfd52e6 Pawel Moll 2011-10-24  508  }
edfd52e6 Pawel Moll 2011-10-24  509  
66846048 Rick Jones 2011-11-14  510  static const char 
*vm_bus_name(struct virtio_device *vdev)
66846048 Rick Jones 2011-11-14  511  {
66846048 Rick Jones 2011-11-14  512 struct virtio_mmio_device 
*vm_dev = to_virtio_mmio_device(vdev);
edfd52e6 Pawel Moll 2011-10-24  513  
66846048 Rick Jones 2011-11-14  514 return vm_dev->pdev->name;
66846048 Rick Jones 2011-11-14  515  }
edfd52e6 Pawel Moll 2011-10-24  516  
93503932 Stephen Hemminger  2013-02-10  517  static const struct 
virtio_config_ops virtio_mmio_config_ops = {
edfd52e6 Pawel Moll 2011-10-24  518 .get= vm_get,
edfd52e6 Pawel Moll 2011-10-24  519 .set= vm_set,
87e7bf14 Michael S. Tsirkin 2015-03-12  520 .generation = vm_generation,
edfd52e6 Pawel Moll 2011-10-24  521 .get_status = vm_get_status,
edfd52e6 Pawel Moll 2011-10-24  522 .set_status = vm_set_status,
edfd52e6 Pawel Moll 2011-10-24  523 .reset  = vm_reset,
edfd52e6 Pawel Moll 2011-10-24 @524 .find_vqs   = vm_find_vqs,
edfd52e6 Pawel Moll 2011-10-24  525 .del_vqs= vm_del_vqs,
edfd52e6 Pawel Moll 2011-10-24  526 .get_features   = 
vm_get_features,
edfd52e6 Pawel Moll 2011-10-24  527 .finalize_features = 
vm_finalize_features,
66846048 Rick Jones 2011-11-14  528 .bus_name   = vm_bus_name,
edfd52e6 Pawel Moll 2011-10-24  529  };
edfd52e6 Pawel Moll 2011-10-24  530  
edfd52e6 Pawel Moll 2011-10-24  531  
edfd52e6 Pawel Moll 2011-10-24  532  

:: The code at line 524 was first introduced by commit
:: edfd52e6367270c90f3fd7cc302b375ffa89f91e virtio: Add platform bus driver 
for memory mapped virtio device

:: TO: Pawel Moll 
:: CC: Rusty Russell 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 1/2] iwlwifi: pcie: allow to build an A-MSDU using TSO core

2015-10-26 Thread Emmanuel Grumbach
Hi Eric,

>
> When the op_mode sends an skb whose payload is bigger than
> MSS, PCIe will create an A-MSDU out of it. PCIe assumes
> that the skb that is coming from the op_mode can fit in one
> A-MSDU. It is the op_mode's responsibility to make sure
> that this guarantee holds.
>
> Additional headers need to be built for the subframes.
> The TSO core code takes care of the IP / TCP headers and
> the driver takes care of the 802.11 subframe headers.
>
> These headers are stored on a per-cpu page that is re-used
> for all the packets handled on that same CPU. Each skb
> holds a reference to that page and releases the page when
> it is reclaimed. When the page gets full, it is released
> and a new one is allocated.
>
> Since any SKB that doesn't go through the fast-xmit path
> of mac80211 will be segmented, we can assume here that the
> packet is not WEP / TKIP and has a proper SNAP header.
>
> Signed-off-by: Emmanuel Grumbach 

Assuming your review queue works as a FIFO and you reviewed the TSO
helper patch, I can assume you ACK this one? :)
Or at least, don't NACK it :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2