Re: [PATCH net] sctp: partial chunk should be drop without sending abort packet

2015-08-27 Thread lucien xin

 Actually, silently dropping this is _very_ bad.  There reason is that you've 
 already
 processed the leading chunks and may have potentially queued a response...  
 Now, you
 reach the end of the packet and find that the last chunk is partial.  You end 
 up
 dropping the packet, but still handing the responses.  This actually lead to 
 some very
 interesting issues we were seeing.

 It is better to terminate the association in this case.

 -vlad


make sense, I just cannot ensure we are doing this as RFC, after all,
it doesnot say
we should send abort in this case clearly.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread John W. Linville
On Thu, Aug 27, 2015 at 12:16:44AM -0700, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com
 
 In the switchdev model, we use netdevs to represent switchdev ports, but we
 have no representation for the switch itself.  So, introduce a new switchdev
 device class so we can define semantics and programming interfaces for the
 switch itself.  Switchdev device class isn't tied to any particular bus.
 
 This patch set is just the skeleton to get us started.  It adds the sysfs
 object registration for the new class and defines a class-level attr foo.
 With the new class, we could hook PM functions, for example, to handle power
 transitions at the switch level.  I registered rocker and get:
 
$ ls /sys/class/switchdev/525400123501/
foo  power  subsystem  uevent
 
 So what next?  I'd rather not build APIs around sysfs, so we need a netlink 
 API
 we can build on top of this.  It's not really rtnl.  Maybe genl would work?
 What ever it is, we'd need to teach iproute2 about a new 'switch' command.
 
 Netlink API would allow us to represent switch-wide objects such as registers,
 tables, stats, firmware, and maybe even control.  I think with with netlink
 TLVs, we can create a framework for these objects but still allow the switch
 driver provide switch-specific info.  For example, a table object:
 
 [TABLES]
   [TABLE]
   [FIELDS]
   [FIELD]
   [ID, TYPE]
   [DATA]
   [ID, VALUE]
 
 Maybe iproute2 has pretty-printers for specific switches like ethtool has for
 reg dumps.
 
 I don't know about how this overlaps with DSA platform_class.  Florian?
 
 Comments?

I think this makes a lot of sense, for many of the reasons you cite
later in the thread.  Switches are complex devices with multiple
facets that are difficult to map directly to existing abstractions
without creating artificial adaptations or leaving something out.
Giving the switch itself a representation in the device tree seems
like the right way to go.

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread lucien xin

 No, these are 2 distinct instances.  In one instance, the peer is reachable 
 and
 is able to communication 0 rwnd state to us.  Thus we are being nice and 
 granting
 the peer more time to exit the 0 window state.

 In the other state, the peer is unreachable and we just happen to hit the 
 0-window
 condition based on some estimations of the peer window.  In this case, we 
 should
 be subject to the Max.RTX and terminate the association sooner.

 -vlad

okay, I got you,

we can see that local update their peer.rwnd in sctp_packet_append_data() and
sctp_retransmit_mark(), it do that according to a_rwnd and outstanding, so the
root reason is that it's hard to know that peer really closed it's window, maybe
just so many outstanding lead to that.

what we can do is to trust peer.rwnd is the real window in peer.
from another angle,  even though it's not real, at least we can reduce the
* the other state* you mentioned by doing this. especially, if there is only one
small packet keep retransmitting in SHUTDOWN_PENDING state, the
peer.rwnd is more believable to be the real peer window.

I saw bsd code didnot care about Max.Retrans in SHUTDOWN_PENDING,
instead it just start T5 timer. but now that we choose Max.Retrans + T5, it's
better to process more unreachable by using Max.Retrans. I also hope we can
do it better there as Marcelo said, but by now I cannot see it. :)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 09:19 AM, lucien xin wrote:

 No, these are 2 distinct instances.  In one instance, the peer is reachable 
 and
 is able to communication 0 rwnd state to us.  Thus we are being nice and 
 granting
 the peer more time to exit the 0 window state.

 In the other state, the peer is unreachable and we just happen to hit the 
 0-window
 condition based on some estimations of the peer window.  In this case, we 
 should
 be subject to the Max.RTX and terminate the association sooner.

 -vlad

 okay, I got you,
 
 we can see that local update their peer.rwnd in sctp_packet_append_data() and
 sctp_retransmit_mark(), it do that according to a_rwnd and outstanding, so the
 root reason is that it's hard to know that peer really closed it's window, 
 maybe
 just so many outstanding lead to that.
 
 what we can do is to trust peer.rwnd is the real window in peer.
 from another angle,  even though it's not real, at least we can reduce the
 * the other state* you mentioned by doing this. especially, if there is only 
 one
 small packet keep retransmitting in SHUTDOWN_PENDING state, the
 peer.rwnd is more believable to be the real peer window.
 
 I saw bsd code didnot care about Max.Retrans in SHUTDOWN_PENDING,
 instead it just start T5 timer. but now that we choose Max.Retrans + T5, it's
 better to process more unreachable by using Max.Retrans. I also hope we can
 do it better there as Marcelo said, but by now I cannot see it. :)
 

So one potential way is to have peer.rwnd and peer.a_rwnd, where peer.a_rwnd is
the window advertised by peer and peer.rwnd and our estimation based on 
peer.a_rwnd.
This way we will always know where we stand.

Although I am not sure yet if we want to grow the peer structure any more.

Another way is to have an estimate or 0-window probe bit/flags one the send side
and set it when we do 0-window probe.  This way we'd know that when 0-window 
probe
bit is set, peer returned 0 window.

Just some thoughts.
-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v4] sctp: asconf's process should verify address parameter is in the beginning

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 04:26 AM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.
 
 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.
 
 note that this can also detect multiple address parameters, and reject it.
 
 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Marcelo Ricardo Leitner mleit...@redhat.com

Looks good to me.

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad

 ---
  net/sctp/sm_make_chunk.c | 7 +++
  1 file changed, 7 insertions(+)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..a655ddc 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3132,11 +3132,18 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
   case SCTP_PARAM_IPV4_ADDRESS:
   if (length != sizeof(sctp_ipv4addr_param_t))
   return false;
 + /* ensure there is only one addr param and it's in the
 +  * beginning of addip_hdr params, or we reject it.
 +  */
 + if (param.v != addip-addip_hdr.params)
 + return false;
   addr_param_seen = true;
   break;
   case SCTP_PARAM_IPV6_ADDRESS:
   if (length != sizeof(sctp_ipv6addr_param_t))
   return false;
 + if (param.v != addip-addip_hdr.params)
 + return false;
   addr_param_seen = true;
   break;
   case SCTP_PARAM_ADD_IP:
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread Eric Dumazet
On Wed, 2015-08-26 at 13:54 -0700, Michael Marineau wrote:
 On Wed, Aug 26, 2015 at 4:49 AM, Chuck Ebbert cebbert.l...@gmail.com wrote:
  On Wed, 26 Aug 2015 08:46:59 +
  Shaun Crampton shaun.cramp...@metaswitch.com wrote:
 
  Testing our app at scale on Google¹s GCE, running ~1000 CoreOS hosts: over
  approximately 1 hour, I see about 1 in 50 hosts hit one of the Oopses
  below and then reboot (I¹m not sure if the different oopses are related to
  each other).
 
  The app is Project Calico, which is a datacenter networking fabric.
  calico-felix, the process named below, is our per-host agent.  The
  per-host agent is responsible for reading the network information from a
  central server and applying ip route² and iptables updates to the
  kernel.  We¹re running on CoreOS, with about 100  docker containers/veths
  pairs running on each host.  calico-felix is running inside one of those
  containers. We also run the BIRD BGP stack to redistribute routes around
  the datacenter.  The errors happen more frequently while Calico is under
  load.
 
  I¹m not sure where to go from here.  I can reproduce these issues easily
  at that scale but I haven¹t managed to boil it down to a small-scale repro
  scenario for further investigation (yet).
 
 
  What in the world is going on with those call traces? E.g.:
 
  [ 4513.712008]  IRQ
  [ 4513.712008]  [81486751] ? ip_rcv_finish+0x81/0x360
  [ 4513.712008]  [814870e4] ip_rcv+0x2a4/0x400
  [ 4513.712008]  [814866d0] ? inet_del_offload+0x40/0x40
  [ 4513.712008]  [814491b3] __netif_receive_skb_core+0x6c3/0x9a0
  [ 4513.712008]  [8143b667] ? build_skb+0x17/0x90
  [ 4513.712008]  [814494a8] __netif_receive_skb+0x18/0x60
  [ 4513.712008]  [81449523] netif_receive_skb_internal+0x33/0xa0
  [ 4513.712008]  [814495ac] netif_receive_skb_sk+0x1c/0x70
  [ 4513.712008]  [a00f772b] 0xa00f772b
  [ 4513.712008]  [814491b3] ? __netif_receive_skb_core+0x6c3/0x9a0
  [ 4513.712008]  [a00f7d81] 0xa00f7d81
  [ 4513.712008]  [81449979] net_rx_action+0x159/0x340
  [ 4513.712008]  [810715f4] __do_softirq+0xf4/0x290
  [ 4513.712008]  [810719fd] irq_exit+0xad/0xc0
  [ 4513.712008]  [815528ba] do_IRQ+0x5a/0xf0
  [ 4513.712008]  [815507ae] common_interrupt+0x6e/0x6e
  [ 4513.712008]  EOI
 
  There are two functions in the call trace that the kernel knows
  nothing about. How did they get in there?
 
  And there is really executable code in there, as can be seen from a
  later trace:
 
  [ 4123.003006]  IRQ
  [ 4123.003006]  [8147d477] nf_iterate+0x57/0x80
  [ 4123.003006]  [8147d537] nf_hook_slow+0x97/0x100
  [ 4123.003006]  [81486e32] ip_local_deliver+0x92/0xa0
  [ 4123.003006]  [81486a30] ? ip_rcv_finish+0x360/0x360
  [ 4123.003006]  [81486751] ip_rcv_finish+0x81/0x360
  [ 4123.003006]  [814870e4] ip_rcv+0x2a4/0x400
  [ 4123.003006]  [814866d0] ? inet_del_offload+0x40/0x40
  [ 4123.003006]  [814491b3] __netif_receive_skb_core+0x6c3/0x9a0
  [ 4123.003006]  [8143b667] ? build_skb+0x17/0x90
  [ 4123.003006]  [814494a8] __netif_receive_skb+0x18/0x60
  [ 4123.003006]  [81449523] netif_receive_skb_internal+0x33/0xa0
  [ 4123.003006]  [814495ac] netif_receive_skb_sk+0x1c/0x70
  [ 4123.003006]  [a00d472b] 0xa00d472b
  [ 4123.003006]  [a00d4d81] 0xa00d4d81
  [ 4123.003006]  [81449979] net_rx_action+0x159/0x340
  [ 4123.003006]  [810715f4] __do_softirq+0xf4/0x290
  [ 4123.003006]  [810719fd] irq_exit+0xad/0xc0
  [ 4123.003006]  [815528ba] do_IRQ+0x5a/0xf0
  [ 4123.003006]  [815507ae] common_interrupt+0x6e/0x6e
  [ 4123.003006]  EOI
  [ 4123.003006]  [81483a3d] ? __ip_route_output_key+0x31d/0x860
  [ 4123.003006]  [814e2e95] ? xfrm_lookup_route+0x5/0x70
  [ 4123.003006]  [81484224] ? ip_route_output_flow+0x54/0x60
  [ 4123.003006]  [8148ca6a] ip_queue_xmit+0x36a/0x3d0
  [ 4123.003006]  [814a4799] tcp_transmit_skb+0x4b9/0x990
  [ 4123.003006]  [814a4d85] tcp_write_xmit+0x115/0xe90
  [ 4123.003006]  [814a5d72] __tcp_push_pending_frames+0x32/0xd0
  [ 4123.003006]  [8149443f] tcp_push+0xef/0x120
  [ 4123.003006]  [81497cb5] tcp_sendmsg+0xc5/0xb20
  [ 4123.003006]  [810d74c9] ? lock_hrtimer_base.isra.22+0x29/0x50
  [ 4123.003006]  [814c2d04] inet_sendmsg+0x64/0xa0
  [ 4123.003006]  [811e94b5] ? __fget_light+0x25/0x70
  [ 4123.003006]  [8142d74d] sock_sendmsg+0x3d/0x50
  [ 4123.003006]  [8142dc12] SYSC_sendto+0x102/0x1a0
  [ 4123.003006]  [8110f864] ? __audit_syscall_entry+0xb4/0x110
  [ 4123.003006]  [810224fc] ? do_audit_syscall_entry+0x6c/0x70
  [ 4123.003006]  [81023cf3] ?
  syscall_trace_enter_phase1+0x103/0x160
  [ 4123.003006]  [8142e75e] 

Re: [patch net-next v2 2/3] mlxsw: adjust transmit fail log message level in __mlxsw_emad_transmit

2015-08-27 Thread Joe Perches
On Thu, 2015-08-27 at 17:59 +0200, Jiri Pirko wrote:
 When transmit fails, it is an error, not a warning.
[]
 diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
 b/drivers/net/ethernet/mellanox/mlxsw/core.c
[]
 @@ -376,8 +376,8 @@ static int __mlxsw_emad_transmit(struct mlxsw_core 
 *mlxsw_core,
  
   err = mlxsw_core_skb_transmit(mlxsw_core-driver_priv, skb, tx_info);
   if (err) {
 - dev_warn(mlxsw_core-bus_info-dev, Failed to transmit EMAD 
 (tid=%llx)\n,
 -  mlxsw_core-emad.tid);
 + dev_err(mlxsw_core-bus_info-dev, Failed to transmit EMAD 
 (tid=%llx)\n,
 + mlxsw_core-emad.tid);

dev_err_ratelimited?


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pull-request: wireless-drivers-next 2015-08-26

2015-08-27 Thread David Miller
From: Kalle Valo kv...@codeaurora.org
Date: Wed, 26 Aug 2015 18:13:02 +0300

 here's one more smaller pull request I would like to still get to 4.3.
 Nothing really special expect the new firmware API 17 support for
 iwlwifi and qca6164 support for ath10k which would be good to have in
 4.3.
 
 Please let me know if you have any problems.

Pulled, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 10:49 AM, lucien xin wrote:

 So one potential way is to have peer.rwnd and peer.a_rwnd, where peer.a_rwnd 
 is
 the window advertised by peer and peer.rwnd and our estimation based on 
 peer.a_rwnd.
 This way we will always know where we stand.

 Although I am not sure yet if we want to grow the peer structure any more.

 Another way is to have an estimate or 0-window probe bit/flags one the send 
 side
 and set it when we do 0-window probe.  This way we'd know that when 0-window 
 probe
 bit is set, peer returned 0 window.

 I think updating 0-window may happen in sctp_process_init() and
 sctp_outq_sack(),
 I don't think 0-window can be probed, cause unreachable and closing
 window both has
 no reply from peer. but we can update the 0-window bit in those two
 functions. I just do
 not know where there is a available bit we can use if won't change the
 peer structure.

You can ignore INIT as the window will never be 0 (not allowed).

The updates could happen at the end of sctp_outq_sack().   There some spare
bits in peer if you want to go that way.

-vlad


 
 Just some thoughts.
 -vlad

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IGMP: Inhibit reports for local multicast groups

2015-08-27 Thread Philip Downey
The range of addresses between 224.0.0.0 and 224.0.0.255 inclusive, is
reserved for the use of routing protocols and other low-level topology
discovery or maintenance protocols, such as gateway discovery and
group membership reporting.  Multicast routers should not forward any
multicast datagram with destination addresses in this range,
regardless of its TTL.

Currently, IGMP reports are generated for this reserved range of
addresses even though a router will ignore this information since it
has no purpose.  However, the presence of reserved group addresses in
an IGMP membership report uses up network bandwidth and can also
obscure addresses of interest when inspecting membership reports using
packet inspection or debug messages.

Although the RFCs for the various version of IGMP (e.g.RFC 3376 for
v3) do not specify that the reserved addresses be excluded from
membership reports, it should do no harm in doing so.  In particular
there should be no adverse effect in any IGMP snooping functionality
since 224.0.0.x is specifically excluded as per RFC 4541 (IGMP and MLD
Snooping Switches Considerations) section 2.1.2. Data Forwarding
Rules:

2) Packets with a destination IP (DIP) address in the 224.0.0.X
   range which are not IGMP must be forwarded on all ports.

IGMP reports for local multicast groups can now be optionally
inhibited by means of a system control variable (by setting the value
to zero) e.g.:
echo 0  /proc/sys/net/ipv4/igmp_link_local_mcast_reports

To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero e.g.:
echo 1   /proc/sys/net/ipv4/igmp_link_local_mcast_reports

Signed-off-by: Philip Downey pdow...@brocade.com
---
 include/linux/igmp.h   |1 +
 net/ipv4/igmp.c|   26 +-
 net/ipv4/sysctl_net_ipv4.c |7 +++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 193ad48..9084292 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -37,6 +37,7 @@ static inline struct igmpv3_query *
return (struct igmpv3_query *)skb_transport_header(skb);
 }
 
+extern int sysctl_igmp_llm_reports;
 extern int sysctl_igmp_max_memberships;
 extern int sysctl_igmp_max_msf;
 extern int sysctl_igmp_qrv;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 9fdfd9d..d38b8b6 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -110,6 +110,9 @@
 #define IP_MAX_MEMBERSHIPS 20
 #define IP_MAX_MSF 10
 
+/* IGMP reports for link-local multicast groups are enabled by default */
+int sysctl_igmp_llm_reports __read_mostly = 1;
+
 #ifdef CONFIG_IP_MULTICAST
 /* Parameter names and values are taken from igmp-v2-06 draft */
 
@@ -437,6 +440,8 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct 
ip_mc_list *pmc,
 
if (pmc-multiaddr == IGMP_ALL_HOSTS)
return skb;
+   if (ipv4_is_local_multicast(pmc-multiaddr)  !sysctl_igmp_llm_reports)
+   return skb;
 
isquery = type == IGMPV3_MODE_IS_INCLUDE ||
  type == IGMPV3_MODE_IS_EXCLUDE;
@@ -545,6 +550,9 @@ static int igmpv3_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc)
for_each_pmc_rcu(in_dev, pmc) {
if (pmc-multiaddr == IGMP_ALL_HOSTS)
continue;
+   if (ipv4_is_local_multicast(pmc-multiaddr) 
+!sysctl_igmp_llm_reports)
+   continue;
spin_lock_bh(pmc-lock);
if (pmc-sfcount[MCAST_EXCLUDE])
type = IGMPV3_MODE_IS_EXCLUDE;
@@ -678,7 +686,11 @@ static int igmp_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc,
 
if (type == IGMPV3_HOST_MEMBERSHIP_REPORT)
return igmpv3_send_report(in_dev, pmc);
-   else if (type == IGMP_HOST_LEAVE_MESSAGE)
+
+   if (ipv4_is_local_multicast(group)  !sysctl_igmp_llm_reports)
+   return 0;
+
+   if (type == IGMP_HOST_LEAVE_MESSAGE)
dst = IGMP_ALL_ROUTER;
else
dst = group;
@@ -851,6 +863,8 @@ static bool igmp_heard_report(struct in_device *in_dev, 
__be32 group)
 
if (group == IGMP_ALL_HOSTS)
return false;
+   if (ipv4_is_local_multicast(group)  !sysctl_igmp_llm_reports)
+   return false;
 
rcu_read_lock();
for_each_pmc_rcu(in_dev, im) {
@@ -957,6 +971,9 @@ static bool igmp_heard_query(struct in_device *in_dev, 
struct sk_buff *skb,
continue;
if (im-multiaddr == IGMP_ALL_HOSTS)
continue;
+   if (ipv4_is_local_multicast(im-multiaddr) 
+   !sysctl_igmp_llm_reports)
+   continue;
spin_lock_bh(im-lock);
if 

Re: [PATCH v5 net-next 7/8] geneve: Consolidate Geneve functionality in single module.

2015-08-27 Thread John W. Linville
On Wed, Aug 26, 2015 at 11:46:54PM -0700, Pravin B Shelar wrote:
 geneve_core module handles send and receive functionality.
 This way OVS could use the Geneve API. Now with use of
 tunnel meatadata mode OVS can directly use Geneve netdevice.
 So there is no need for separate module for Geneve. Following
 patch consolidates Geneve protocol processing in single module.
 
 Signed-off-by: Pravin B Shelar pshe...@nicira.com

Acked-by: John W. Linville linvi...@tuxdriver.com

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2] net: phy: fixed: propagate fixed link values to struct

2015-08-27 Thread David Miller
From: Madalin Bucur madalin.bu...@freescale.com
Date: Wed, 26 Aug 2015 17:58:47 +0300

 The fixed link values parsed from the device tree are stored in
 the struct fixed_phy member status. The struct phy_device members
 speed, duplex were not updated.
 
 Signed-off-by: Madalin Bucur madalin.bu...@freescale.com
 ---
 v2: always setting phy-link, thanks Stas

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/1] lan78xx: Change default internal PHY ID to 1

2015-08-27 Thread Woojung.Huh
Change default internal PHY ID to 1.

Signed-off-by: Woojung Huh woojung@microchip.com
---
 drivers/net/usb/lan78xx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 39364a4..5197a5f 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -57,7 +57,7 @@
 #define DEFAULT_RX_CSUM_ENABLE (true)
 #define DEFAULT_TSO_CSUM_ENABLE(true)
 #define DEFAULT_VLAN_FILTER_ENABLE (true)
-#define INTERNAL_PHY_ID(2) /* 2: GMII */
+#define INTERNAL_PHY_ID(1)
 #define TX_OVERHEAD(8)
 #define RXW_PADDING2
 
-- 
2.1.4
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] net: Add ethernet header for pass through VRF device

2015-08-27 Thread David Ahern
The change to use a custom dst broke tcpdump captures on the VRF device:

$ tcpdump -n -i vrf10
...
05:32:29.009362 IP 10.2.1.254  10.2.1.2: ICMP echo request, id 21989, seq 1, 
length 64
05:32:29.009855 00:00:40:01:8d:36  45:00:00:54:d6:6f, ethertype Unknown 
(0x0a02), length 84:
0x:  0102 0a02 01fe  9181 55e5 0001 bd11  ..U.
0x0010:  da55   bb5d 0700   1011  .U.]
0x0020:  1213 1415 1617 1819 1a1b 1c1d 1e1f 2021  ...!
0x0030:  2223 2425 2627 2829 2a2b 2c2d 2e2f 3031  #$%'()*+,-./01
0x0040:  3233 3435 3637   234567

Local packets going through the VRF device are missing an ethernet header.
Fix by adding one and then stripping it off before pushing back to the IP
stack. With this patch you get the expected dumps:

...
05:36:15.713944 IP 10.2.1.254  10.2.1.2: ICMP echo request, id 23795, seq 1, 
length 64
05:36:15.714160 IP 10.2.1.2  10.2.1.254: ICMP echo reply, id 23795, seq 1, 
length 64
...

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
v2
- per DaveM's comment modelled after ip_finish_output2 which uses
  dst_neigh_output rather than dev_hard_header

 drivers/net/vrf.c | 48 +---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index b3d9c5546c79..e7094fbd7568 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -27,6 +27,7 @@
 #include linux/hashtable.h
 
 #include linux/inetdevice.h
+#include net/arp.h
 #include net/ip.h
 #include net/ip_fib.h
 #include net/ip6_route.h
@@ -219,6 +220,9 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
 
 static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
 {
+   /* strip the ethernet header added for pass through VRF device */
+   __skb_pull(skb, skb_network_offset(skb));
+
switch (skb-protocol) {
case htons(ETH_P_IP):
return vrf_process_v4_outbound(skb, dev);
@@ -248,9 +252,47 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
return ret;
 }
 
-static netdev_tx_t vrf_finish(struct sock *sk, struct sk_buff *skb)
+/* modelled after ip_finish_output2 */
+static int vrf_finish_output(struct sock *sk, struct sk_buff *skb)
 {
-   return dev_queue_xmit(skb);
+   struct dst_entry *dst = skb_dst(skb);
+   struct rtable *rt = (struct rtable *)dst;
+   struct net_device *dev = dst-dev;
+   unsigned int hh_len = LL_RESERVED_SPACE(dev);
+   struct neighbour *neigh;
+   u32 nexthop;
+   int ret = -EINVAL;
+
+   /* Be paranoid, rather than too clever. */
+   if (unlikely(skb_headroom(skb)  hh_len  dev-header_ops)) {
+   struct sk_buff *skb2;
+
+   skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
+   if (!skb2) {
+   ret = -ENOMEM;
+   goto err;
+   }
+   if (skb-sk)
+   skb_set_owner_w(skb2, skb-sk);
+
+   consume_skb(skb);
+   skb = skb2;
+   }
+
+   rcu_read_lock_bh();
+
+   nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)-daddr);
+   neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
+   if (unlikely(!neigh))
+   neigh = __neigh_create(arp_tbl, nexthop, dev, false);
+   if (!IS_ERR(neigh))
+   ret = dst_neigh_output(dst, neigh, skb);
+
+   rcu_read_unlock_bh();
+err:
+   if (unlikely(ret  0))
+   vrf_tx_error(skb-dev, skb);
+   return ret;
 }
 
 static int vrf_output(struct sock *sk, struct sk_buff *skb)
@@ -264,7 +306,7 @@ static int vrf_output(struct sock *sk, struct sk_buff *skb)
 
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, sk, skb,
NULL, dev,
-   vrf_finish,
+   vrf_finish_output,
!(IPCB(skb)-flags  IPSKB_REROUTED));
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 1/1] sfc: only use vadaptor stats if firmware is capable

2015-08-27 Thread David Miller
From: Shradha Shah ss...@solarflare.com
Date: Wed, 26 Aug 2015 16:39:03 +0100

 From: Bert Kenward bkenw...@solarflare.com
 
 Some of the stats handling code differs based on SR-IOV support,
 and SRIOV support is only available if full-featured firmware is
 used.
 Do not use vadaptor stats if firmware mode is not set to
 full-featured.
 
 Signed-off-by: Shradha Shah ss...@solarflare.com

Applied, thank you.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread lucien xin

 So one potential way is to have peer.rwnd and peer.a_rwnd, where peer.a_rwnd 
 is
 the window advertised by peer and peer.rwnd and our estimation based on 
 peer.a_rwnd.
 This way we will always know where we stand.

 Although I am not sure yet if we want to grow the peer structure any more.

 Another way is to have an estimate or 0-window probe bit/flags one the send 
 side
 and set it when we do 0-window probe.  This way we'd know that when 0-window 
 probe
 bit is set, peer returned 0 window.

I think updating 0-window may happen in sctp_process_init() and
sctp_outq_sack(),
I don't think 0-window can be probed, cause unreachable and closing
window both has
no reply from peer. but we can update the 0-window bit in those two
functions. I just do
not know where there is a available bit we can use if won't change the
peer structure.

 Just some thoughts.
 -vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread Michael Marineau
On Thu, Aug 27, 2015 at 9:30 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Thu, 2015-08-27 at 09:16 -0700, Michael Marineau wrote:


 Oh, interesting. Looks like that patch didn't get CC'd to stable
 though, is there a reason for that or just oversight?

 We never CC stable for networking patches.

 David Miller prefers to take care of this himself.

Ah, right, sorry. forgot about that. :)


 ( this is in Documentation/networking/netdev-FAQ.txt )

 Q: How can I tell what patches are queued up for backporting to the
various stable releases?

 A: Normally Greg Kroah-Hartman collects stable commits himself, but
for networking, Dave collects up patches he deems critical for the
networking subsystem, and then hands them off to Greg.

There is a patchworks queue that you can see here:
 http://patchwork.ozlabs.org/bundle/davem/stable/?state=*

It contains the patches which Dave has selected, but not yet handed
off to Greg.  If Greg already has the patch, then it will be here:
 http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git

A quick way to find whether the patch is in this stable-queue is
to simply clone the repo, and then git grep the mainline commit ID, e.g.

 stable-queue$ git grep -l 284041ef21fdf2e
 releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
 releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
 releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
 stable/stable-queue$



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread Eric Dumazet
On Thu, 2015-08-27 at 09:16 -0700, Michael Marineau wrote:

 
 Oh, interesting. Looks like that patch didn't get CC'd to stable
 though, is there a reason for that or just oversight?

We never CC stable for networking patches.

David Miller prefers to take care of this himself.

( this is in Documentation/networking/netdev-FAQ.txt )

Q: How can I tell what patches are queued up for backporting to the
   various stable releases?

A: Normally Greg Kroah-Hartman collects stable commits himself, but
   for networking, Dave collects up patches he deems critical for the
   networking subsystem, and then hands them off to Greg.

   There is a patchworks queue that you can see here:
http://patchwork.ozlabs.org/bundle/davem/stable/?state=*

   It contains the patches which Dave has selected, but not yet handed
   off to Greg.  If Greg already has the patch, then it will be here:
http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git

   A quick way to find whether the patch is in this stable-queue is
   to simply clone the repo, and then git grep the mainline commit ID, e.g.

stable-queue$ git grep -l 284041ef21fdf2e
releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
stable/stable-queue$



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread lucien xin

 So one potential way is to have peer.rwnd and peer.a_rwnd, where 
 peer.a_rwnd is
 the window advertised by peer and peer.rwnd and our estimation based on 
 peer.a_rwnd.
 This way we will always know where we stand.

 Although I am not sure yet if we want to grow the peer structure any more.

 Another way is to have an estimate or 0-window probe bit/flags one the send 
 side
 and set it when we do 0-window probe.  This way we'd know that when 
 0-window probe
 bit is set, peer returned 0 window.

 I think updating 0-window may happen in sctp_process_init() and
 sctp_outq_sack(),
 I don't think 0-window can be probed, cause unreachable and closing
 window both has
 no reply from peer. but we can update the 0-window bit in those two
 functions. I just do
 not know where there is a available bit we can use if won't change the
 peer structure.

 You can ignore INIT as the window will never be 0 (not allowed).

 The updates could happen at the end of sctp_outq_sack().   There some spare
 bits in peer if you want to go that way.

I find this one *addip_disabled_mask*, but it's really bad to use it.
we should put a extensible mask or point there before.

as to a_rwnd, it's a good idea, but if we just use it here, it may not be worth
doing that. if there are a lot places need it, we can add it.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bridge: Add netlink support for vlan_protocol attribute

2015-08-27 Thread Nikolay Aleksandrov

 On Aug 26, 2015, at 11:00 PM, Toshiaki Makita makita.toshi...@lab.ntt.co.jp 
 wrote:
 
 This enables bridge vlan_protocol to be configured through netlink.
 
 When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
 same way as this feature is not implemented.
 
 Signed-off-by: Toshiaki Makita makita.toshi...@lab.ntt.co.jp
 ---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_netlink.c  | 34 ++
 net/bridge/br_private.h  |  1 +
 net/bridge/br_vlan.c | 35 +--
 4 files changed, 57 insertions(+), 14 deletions(-)
 

Nice, looks good. I have a similar patch as well and was going to ask wouldn’t 
it be
better to make empty stubs which return an error when vlan filtering isn’t 
configured
and drop the ifdefs in the netlink handling code ?
Similar to how vlan_filtering netlink attribute is handled in commit:
a7854037da00 (bridge: netlink: add support for vlan_filtering attribute”)

Potential problem would be the return of the protocol, but I think if 0 is 
returned that
can be handled.

Cheers,
 Nik

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread Michael Marineau
On Thu, Aug 27, 2015 at 6:00 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Wed, 2015-08-26 at 13:54 -0700, Michael Marineau wrote:
 On Wed, Aug 26, 2015 at 4:49 AM, Chuck Ebbert cebbert.l...@gmail.com wrote:
  On Wed, 26 Aug 2015 08:46:59 +
  Shaun Crampton shaun.cramp...@metaswitch.com wrote:
 
  Testing our app at scale on Google¹s GCE, running ~1000 CoreOS hosts: over
  approximately 1 hour, I see about 1 in 50 hosts hit one of the Oopses
  below and then reboot (I¹m not sure if the different oopses are related to
  each other).
 
  The app is Project Calico, which is a datacenter networking fabric.
  calico-felix, the process named below, is our per-host agent.  The
  per-host agent is responsible for reading the network information from a
  central server and applying ip route² and iptables updates to the
  kernel.  We¹re running on CoreOS, with about 100  docker containers/veths
  pairs running on each host.  calico-felix is running inside one of those
  containers. We also run the BIRD BGP stack to redistribute routes around
  the datacenter.  The errors happen more frequently while Calico is under
  load.
 
  I¹m not sure where to go from here.  I can reproduce these issues easily
  at that scale but I haven¹t managed to boil it down to a small-scale repro
  scenario for further investigation (yet).
 
 
  What in the world is going on with those call traces? E.g.:
 
  [ 4513.712008]  IRQ
  [ 4513.712008]  [81486751] ? ip_rcv_finish+0x81/0x360
  [ 4513.712008]  [814870e4] ip_rcv+0x2a4/0x400
  [ 4513.712008]  [814866d0] ? inet_del_offload+0x40/0x40
  [ 4513.712008]  [814491b3] __netif_receive_skb_core+0x6c3/0x9a0
  [ 4513.712008]  [8143b667] ? build_skb+0x17/0x90
  [ 4513.712008]  [814494a8] __netif_receive_skb+0x18/0x60
  [ 4513.712008]  [81449523] netif_receive_skb_internal+0x33/0xa0
  [ 4513.712008]  [814495ac] netif_receive_skb_sk+0x1c/0x70
  [ 4513.712008]  [a00f772b] 0xa00f772b
  [ 4513.712008]  [814491b3] ? 
  __netif_receive_skb_core+0x6c3/0x9a0
  [ 4513.712008]  [a00f7d81] 0xa00f7d81
  [ 4513.712008]  [81449979] net_rx_action+0x159/0x340
  [ 4513.712008]  [810715f4] __do_softirq+0xf4/0x290
  [ 4513.712008]  [810719fd] irq_exit+0xad/0xc0
  [ 4513.712008]  [815528ba] do_IRQ+0x5a/0xf0
  [ 4513.712008]  [815507ae] common_interrupt+0x6e/0x6e
  [ 4513.712008]  EOI
 
  There are two functions in the call trace that the kernel knows
  nothing about. How did they get in there?
 
  And there is really executable code in there, as can be seen from a
  later trace:
 
  [ 4123.003006]  IRQ
  [ 4123.003006]  [8147d477] nf_iterate+0x57/0x80
  [ 4123.003006]  [8147d537] nf_hook_slow+0x97/0x100
  [ 4123.003006]  [81486e32] ip_local_deliver+0x92/0xa0
  [ 4123.003006]  [81486a30] ? ip_rcv_finish+0x360/0x360
  [ 4123.003006]  [81486751] ip_rcv_finish+0x81/0x360
  [ 4123.003006]  [814870e4] ip_rcv+0x2a4/0x400
  [ 4123.003006]  [814866d0] ? inet_del_offload+0x40/0x40
  [ 4123.003006]  [814491b3] __netif_receive_skb_core+0x6c3/0x9a0
  [ 4123.003006]  [8143b667] ? build_skb+0x17/0x90
  [ 4123.003006]  [814494a8] __netif_receive_skb+0x18/0x60
  [ 4123.003006]  [81449523] netif_receive_skb_internal+0x33/0xa0
  [ 4123.003006]  [814495ac] netif_receive_skb_sk+0x1c/0x70
  [ 4123.003006]  [a00d472b] 0xa00d472b
  [ 4123.003006]  [a00d4d81] 0xa00d4d81
  [ 4123.003006]  [81449979] net_rx_action+0x159/0x340
  [ 4123.003006]  [810715f4] __do_softirq+0xf4/0x290
  [ 4123.003006]  [810719fd] irq_exit+0xad/0xc0
  [ 4123.003006]  [815528ba] do_IRQ+0x5a/0xf0
  [ 4123.003006]  [815507ae] common_interrupt+0x6e/0x6e
  [ 4123.003006]  EOI
  [ 4123.003006]  [81483a3d] ? __ip_route_output_key+0x31d/0x860
  [ 4123.003006]  [814e2e95] ? xfrm_lookup_route+0x5/0x70
  [ 4123.003006]  [81484224] ? ip_route_output_flow+0x54/0x60
  [ 4123.003006]  [8148ca6a] ip_queue_xmit+0x36a/0x3d0
  [ 4123.003006]  [814a4799] tcp_transmit_skb+0x4b9/0x990
  [ 4123.003006]  [814a4d85] tcp_write_xmit+0x115/0xe90
  [ 4123.003006]  [814a5d72] __tcp_push_pending_frames+0x32/0xd0
  [ 4123.003006]  [8149443f] tcp_push+0xef/0x120
  [ 4123.003006]  [81497cb5] tcp_sendmsg+0xc5/0xb20
  [ 4123.003006]  [810d74c9] ? lock_hrtimer_base.isra.22+0x29/0x50
  [ 4123.003006]  [814c2d04] inet_sendmsg+0x64/0xa0
  [ 4123.003006]  [811e94b5] ? __fget_light+0x25/0x70
  [ 4123.003006]  [8142d74d] sock_sendmsg+0x3d/0x50
  [ 4123.003006]  [8142dc12] SYSC_sendto+0x102/0x1a0
  [ 4123.003006]  [8110f864] ? __audit_syscall_entry+0xb4/0x110
  [ 4123.003006]  [810224fc] ? do_audit_syscall_entry+0x6c/0x70
  [ 4123.003006]  [81023cf3] ?
  

[PATCH net-next] net/xen-netfront: only napi_synchronize() if running

2015-08-27 Thread Charles (Chas) Williams
From: Chas Williams 3ch...@gmail.com

If an interface isn't running napi_synchronize() will hang forever.

[  392.248403] rmmod   R  running task0   359343 0x
[  392.257671]  88003760fc88 880037193b40 880037193160 
88003760fc88
[  392.267644]  88003761 88003760fcd8 000100014c22 
81f75c40
[  392.277524]  00bc7010 88003760fca8 81796927 
81f75c40
[  392.287323] Call Trace:
[  392.291599]  [81796927] schedule+0x37/0x90
[  392.298553]  [8179985b] schedule_timeout+0x14b/0x280
[  392.306421]  [810f91b9] ? irq_free_descs+0x69/0x80
[  392.314006]  [811084d0] ? internal_add_timer+0xb0/0xb0
[  392.322125]  [81109d07] msleep+0x37/0x50
[  392.329037]  [a00ec79a] 
xennet_disconnect_backend.isra.24+0xda/0x390 [xen_netfront]
[  392.339658]  [a00ecadc] xennet_remove+0x2c/0x80 [xen_netfront]
[  392.348516]  [81481c69] xenbus_dev_remove+0x59/0xc0
[  392.356257]  [814e7217] __device_release_driver+0x87/0x120
[  392.364645]  [814e7cf8] driver_detach+0xb8/0xc0
[  392.371989]  [814e6e69] bus_remove_driver+0x59/0xe0
[  392.379883]  [814e84f0] driver_unregister+0x30/0x70
[  392.387495]  [814814b2] xenbus_unregister_driver+0x12/0x20
[  392.395908]  [a00ed89b] netif_exit+0x10/0x775 [xen_netfront]
[  392.404877]  [81124e08] SyS_delete_module+0x1d8/0x230
[  392.412804]  [8179a8ee] system_call_fastpath+0x12/0x71

Signed-off-by: Chas Williams 3ch...@gmail.com
---
 drivers/net/xen-netfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f948c46..5ff0cfd 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1348,7 +1348,8 @@ static void xennet_disconnect_backend(struct 
netfront_info *info)
queue-tx_evtchn = queue-rx_evtchn = 0;
queue-tx_irq = queue-rx_irq = 0;
 
-   napi_synchronize(queue-napi);
+   if (netif_running(info-netdev))
+   napi_synchronize(queue-napi);
 
xennet_release_tx_bufs(queue);
xennet_release_rx_bufs(queue);
-- 
2.1.0



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread David Miller
From: Michael Marineau michael.marin...@coreos.com
Date: Thu, 27 Aug 2015 09:16:06 -0700

 On Thu, Aug 27, 2015 at 6:00 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 Make sure you backported commit
 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
 (udp: fix dst races with multicast early demux)
 
 Oh, interesting. Looks like that patch didn't get CC'd to stable
 though, is there a reason for that or just oversight?

All networking bug fixes are submitted to -stable by hand by me at a
time of my choosing.  We do not use the CC: stable facility, as I
feel it pushes patches into -stable way too quickly and before the
change gets sufficient exposure for regressions in Linus's tree.

The patch in question got submitted last night.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread Scott Feldman
On Thu, Aug 27, 2015 at 2:06 AM, Andrew Lunn and...@lunn.ch wrote:
 On Thu, Aug 27, 2015 at 01:42:24AM -0700, Scott Feldman wrote:
 On Thu, Aug 27, 2015 at 12:45 AM, Andrew Lunn and...@lunn.ch wrote:
  I don't know about how this overlaps with DSA platform_class.  Florian?
 
  There is some overlap with DSA, but the current DSA model, with
  respect to probing, is broken. So this might be interesting as a way
  towards fix that.
 
  One thing to keep in mind is the D in DSA. You talk about switch,
  singular. DSA has a number of switches in a cluster. We currently
  export a single switchdev interface for the cluster, but there are
  some properties which are per switch, e.g. temperature, eeprom
  contents, statistics, power management etc.

 Export a single 'switchdev' or 'netdev' for the cluster?  I hope that
 was a typo.

 I probably expressed that badly. The hardware i have on my desk has
 three Marvell switches in a chain, with one end of the chain connected
 to a host Ethernet interface.

 From the switchdev ops level, you don't see anything of this
 chain. But some of the operations do need to be aware of this chain,
 for example vlans which span multiple chips in this chain.

 With switchdev device class, you'd instantiate one per
 phy switch, and have per-switch props (temp, eeprom, etc) thru each
 switchdev instance.

 O.K. This is fine, but we need people to understand that a switchdev
 device class represents some middle layer in the hierarchy, not the
 top layer. Otherwise false assumptions might be made.

So with kobj, a device can have a parent.  So I experimented with my
RFC patch and changed register_switchdev to take a parent switchdev
arg, which is NULL for leaf switchdevs:

int register_switchdev(struct switchdev *sdev, const char *name,
   struct switchdev *parent)
{
struct device *dev = sdev-dev;
int err;

device_initialize(dev);

dev-class = switchdev_class;
if (parent)
dev-parent = parent-dev;

err = dev_set_name(dev, %s, name);
if (err)
return err;

return device_add(dev);
}

Then I tried this with rocker and it works as expected.  On module
load, I create the master switchdev, and then on PCI probe for each
phys switch dev, I put the slave switchdev in the master using
register_switchdev.  Here's one slave in the master rockers switch:

tree /sys/class/switchdev/rockers
/sys/class/switchdev/rockers
├── 525400123501
│   ├── device - ../../rocker
│   ├── foo
│   ├── power
│   │   ├── async
│   │   ├── autosuspend_delay_ms
│   │   ├── control
│   │   ├── runtime_active_kids
│   │   ├── runtime_active_time
│   │   ├── runtime_enabled
│   │   ├── runtime_status
│   │   ├── runtime_suspended_time
│   │   └── runtime_usage
│   ├── subsystem - ../../../../../class/switchdev
│   └── uevent
├── foo
├── power
│   ├── async
│   ├── autosuspend_delay_ms
│   ├── control
│   ├── runtime_active_kids
│   ├── runtime_active_time
│   ├── runtime_enabled
│   ├── runtime_status
│   ├── runtime_suspended_time
│   └── runtime_usage
├── subsystem - ../../../../class/switchdev
└── uevent

With this, we can stack switchdevs, I guess as high as we want.  Does
this look usable for DSA?   An attr set on the master would get pushed
down to the leaves.  We'd can do it with the same style of recursive
algos we use for switchdev port attrs.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] r8169: Add software counter for multicast packages

2015-08-27 Thread Corinna Vinschen
The multicast hardware counter on 8168/8111 chips is only 32 bit while the
statistics in struct rtnl_link_stats64 are 64 bit.  Given that statistics
are requested on an irregular basis, an overflow of the hardware counter
can go unnoticed.  To count even very large numbers of multicast packets
reliably, add a software counter and remove previously applied code to
fill the multicast field requested by @rtl8169_get_stats64 with the values
read from the rx_multicast hardware counter.

Signed-off-by: Corinna Vinschen vinsc...@redhat.com
---
 drivers/net/ethernet/realtek/r8169.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index d6d39df..24dcbe6 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -754,7 +754,6 @@ struct rtl8169_tc_offsets {
boolinited;
__le64  tx_errors;
__le32  tx_multi_collision;
-   __le32  rx_multicast;
__le16  tx_aborted;
 };
 
@@ -2326,7 +2325,6 @@ static bool rtl8169_init_counter_offsets(struct 
net_device *dev)
 
tp-tc_offset.tx_errors = tp-counters.tx_errors;
tp-tc_offset.tx_multi_collision = tp-counters.tx_multi_collision;
-   tp-tc_offset.rx_multicast = tp-counters.rx_multicast;
tp-tc_offset.tx_aborted = tp-counters.tx_aborted;
tp-tc_offset.inited = true;
 
@@ -7480,6 +7478,9 @@ process_pkt:
tp-rx_stats.packets++;
tp-rx_stats.bytes += pkt_size;
u64_stats_update_end(tp-rx_stats.syncp);
+
+   if (skb-pkt_type == PACKET_MULTICAST)
+   dev-stats.multicast++;
}
 release_descriptor:
desc-opts2 = 0;
@@ -7790,7 +7791,6 @@ rtl8169_get_stats64(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
stats-rx_bytes = tp-rx_stats.bytes;
} while (u64_stats_fetch_retry_irq(tp-rx_stats.syncp, start));
 
-
do {
start = u64_stats_fetch_begin_irq(tp-tx_stats.syncp);
stats-tx_packets = tp-tx_stats.packets;
@@ -7804,6 +7804,7 @@ rtl8169_get_stats64(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
stats-rx_crc_errors= dev-stats.rx_crc_errors;
stats-rx_fifo_errors   = dev-stats.rx_fifo_errors;
stats-rx_missed_errors = dev-stats.rx_missed_errors;
+   stats-multicast= dev-stats.multicast;
 
/*
 * Fetch additonal counter values missing in stats collected by driver
@@ -7819,8 +7820,6 @@ rtl8169_get_stats64(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
le64_to_cpu(tp-tc_offset.tx_errors);
stats-collisions = le32_to_cpu(tp-counters.tx_multi_collision) -
le32_to_cpu(tp-tc_offset.tx_multi_collision);
-   stats-multicast = le32_to_cpu(tp-counters.rx_multicast) -
-   le32_to_cpu(tp-tc_offset.rx_multicast);
stats-tx_aborted_errors = le16_to_cpu(tp-counters.tx_aborted) -
le16_to_cpu(tp-tc_offset.tx_aborted);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v3 1/2] device property: Return -ENXIO if there is no suitable FW interface

2015-08-27 Thread Jeremy Linton

On 08/26/2015 10:27 PM, Guenter Roeck wrote:

Return -ENXIO if device property array access functions don't find
a suitable firmware interface.

This lets drivers decide if they should use available platform data
instead.


Works fine on an ACPI based ARM system.

Thanks, for taking care of this.

Tested-by: Jeremy Linton jeremy.lin...@arm.com




--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch net-next v2 3/3] mlxsw: Make mailboxes 4KB aligned

2015-08-27 Thread Jiri Pirko
From: Ido Schimmel ido...@mellanox.com

The HW-SW contract requires mailboxes passed to the firmware to be 4KB
aligned. Previously, these mailboxes were mapped using streaming DMA
routines, which do not guarantee the bus addresses to be 4KB aligned.
Under certain conditions this constraint was indeed violated and errors
were observed.

By using consistent DMA mapping routines together with a mailbox size of
4KB we are guaranteed not to violate the constraint.

Signed-off-by: Ido Schimmel ido...@mellanox.com
Signed-off-by: Jiri Pirko j...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 83 +++
 1 file changed, 50 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c 
b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 045f98f..462cea3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -46,6 +46,7 @@
 #include linux/log2.h
 #include linux/debugfs.h
 #include linux/seq_file.h
+#include linux/string.h
 
 #include pci.h
 #include core.h
@@ -174,6 +175,8 @@ struct mlxsw_pci {
struct mlxsw_pci_mem_item *items;
} fw_area;
struct {
+   struct mlxsw_pci_mem_item out_mbox;
+   struct mlxsw_pci_mem_item in_mbox;
struct mutex lock; /* Lock access to command registers */
bool nopoll;
wait_queue_head_t wait;
@@ -1341,6 +1344,32 @@ static irqreturn_t mlxsw_pci_eq_irq_handler(int irq, 
void *dev_id)
return IRQ_HANDLED;
 }
 
+static int mlxsw_pci_mbox_alloc(struct mlxsw_pci *mlxsw_pci,
+   struct mlxsw_pci_mem_item *mbox)
+{
+   struct pci_dev *pdev = mlxsw_pci-pdev;
+   int err = 0;
+
+   mbox-size = MLXSW_CMD_MBOX_SIZE;
+   mbox-buf = pci_alloc_consistent(pdev, MLXSW_CMD_MBOX_SIZE,
+mbox-mapaddr);
+   if (!mbox-buf) {
+   dev_err(pdev-dev, Failed allocating memory for mailbox\n);
+   err = -ENOMEM;
+   }
+
+   return err;
+}
+
+static void mlxsw_pci_mbox_free(struct mlxsw_pci *mlxsw_pci,
+   struct mlxsw_pci_mem_item *mbox)
+{
+   struct pci_dev *pdev = mlxsw_pci-pdev;
+
+   pci_free_consistent(pdev, MLXSW_CMD_MBOX_SIZE, mbox-buf,
+   mbox-mapaddr);
+}
+
 static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
  const struct mlxsw_config_profile *profile)
 {
@@ -1358,6 +1387,15 @@ static int mlxsw_pci_init(void *bus_priv, struct 
mlxsw_core *mlxsw_core,
mbox = mlxsw_cmd_mbox_alloc();
if (!mbox)
return -ENOMEM;
+
+   err = mlxsw_pci_mbox_alloc(mlxsw_pci, mlxsw_pci-cmd.in_mbox);
+   if (err)
+   goto mbox_put;
+
+   err = mlxsw_pci_mbox_alloc(mlxsw_pci, mlxsw_pci-cmd.out_mbox);
+   if (err)
+   goto err_out_mbox_alloc;
+
err = mlxsw_cmd_query_fw(mlxsw_core, mbox);
if (err)
goto err_query_fw;
@@ -1420,6 +1458,9 @@ err_fw_area_init:
 err_doorbell_page_bar:
 err_iface_rev:
 err_query_fw:
+   mlxsw_pci_mbox_free(mlxsw_pci, mlxsw_pci-cmd.out_mbox);
+err_out_mbox_alloc:
+   mlxsw_pci_mbox_free(mlxsw_pci, mlxsw_pci-cmd.in_mbox);
 mbox_put:
mlxsw_cmd_mbox_free(mbox);
return err;
@@ -1432,6 +1473,8 @@ static void mlxsw_pci_fini(void *bus_priv)
free_irq(mlxsw_pci-msix_entry.vector, mlxsw_pci);
mlxsw_pci_aqs_fini(mlxsw_pci);
mlxsw_pci_fw_area_fini(mlxsw_pci);
+   mlxsw_pci_mbox_free(mlxsw_pci, mlxsw_pci-cmd.out_mbox);
+   mlxsw_pci_mbox_free(mlxsw_pci, mlxsw_pci-cmd.in_mbox);
 }
 
 static struct mlxsw_pci_queue *
@@ -1524,8 +1567,8 @@ static int mlxsw_pci_cmd_exec(void *bus_priv, u16 opcode, 
u8 opcode_mod,
  u8 *p_status)
 {
struct mlxsw_pci *mlxsw_pci = bus_priv;
-   dma_addr_t in_mapaddr = 0;
-   dma_addr_t out_mapaddr = 0;
+   dma_addr_t in_mapaddr = mlxsw_pci-cmd.in_mbox.mapaddr;
+   dma_addr_t out_mapaddr = mlxsw_pci-cmd.out_mbox.mapaddr;
bool evreq = mlxsw_pci-cmd.nopoll;
unsigned long timeout = msecs_to_jiffies(MLXSW_PCI_CIR_TIMEOUT_MSECS);
bool *p_wait_done = mlxsw_pci-cmd.wait_done;
@@ -1537,27 +1580,11 @@ static int mlxsw_pci_cmd_exec(void *bus_priv, u16 
opcode, u8 opcode_mod,
if (err)
return err;
 
-   if (in_mbox) {
-   in_mapaddr = pci_map_single(mlxsw_pci-pdev, in_mbox,
-   in_mbox_size, PCI_DMA_TODEVICE);
-   if (unlikely(pci_dma_mapping_error(mlxsw_pci-pdev,
-  in_mapaddr))) {
-   err = -EIO;
-   goto err_in_mbox_map;
-   }
-   }
+   if (in_mbox)
+   memcpy(mlxsw_pci-cmd.in_mbox.buf, in_mbox, in_mbox_size);

[patch net-next v2 2/3] mlxsw: adjust transmit fail log message level in __mlxsw_emad_transmit

2015-08-27 Thread Jiri Pirko
From: Jiri Pirko j...@mellanox.com

When transmit fails, it is an error, not a warning.

Signed-off-by: Jiri Pirko j...@mellanox.com
Signed-off-by: Ido Schimmel ido...@mellanox.com
Signed-off-by: Elad Raz el...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 0415ff6..dbcaf5d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -376,8 +376,8 @@ static int __mlxsw_emad_transmit(struct mlxsw_core 
*mlxsw_core,
 
err = mlxsw_core_skb_transmit(mlxsw_core-driver_priv, skb, tx_info);
if (err) {
-   dev_warn(mlxsw_core-bus_info-dev, Failed to transmit EMAD 
(tid=%llx)\n,
-mlxsw_core-emad.tid);
+   dev_err(mlxsw_core-bus_info-dev, Failed to transmit EMAD 
(tid=%llx)\n,
+   mlxsw_core-emad.tid);
dev_kfree_skb(skb);
return err;
}
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch net-next v2 0/3] mlxsw: small driver update

2015-08-27 Thread Jiri Pirko
From: Jiri Pirko j...@mellanox.com

Ido Schimmel (2):
  mlxsw: Remove duplicate included header
  mlxsw: Make mailboxes 4KB aligned

Jiri Pirko (1):
  mlxsw: adjust transmit fail log message level in __mlxsw_emad_transmit

 drivers/net/ethernet/mellanox/mlxsw/core.c |  5 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c  | 83 ++
 2 files changed, 52 insertions(+), 36 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch net-next v2 1/3] mlxsw: Remove duplicate included header

2015-08-27 Thread Jiri Pirko
From: Ido Schimmel ido...@mellanox.com

Signed-off-by: Ido Schimmel ido...@mellanox.com
Signed-off-by: Jiri Pirko j...@mellanox.com
Signed-off-by: Elad Raz el...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 09325b7..0415ff6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -48,7 +48,6 @@
 #include linux/skbuff.h
 #include linux/etherdevice.h
 #include linux/types.h
-#include linux/wait.h
 #include linux/string.h
 #include linux/gfp.h
 #include linux/random.h
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 net-next 7/8] geneve: Consolidate Geneve functionality in single module.

2015-08-27 Thread Jesse Gross
On Wed, Aug 26, 2015 at 11:46 PM, Pravin B Shelar pshe...@nicira.com wrote:
 geneve_core module handles send and receive functionality.
 This way OVS could use the Geneve API. Now with use of
 tunnel meatadata mode OVS can directly use Geneve netdevice.
 So there is no need for separate module for Geneve. Following
 patch consolidates Geneve protocol processing in single module.

 Signed-off-by: Pravin B Shelar pshe...@nicira.com

Reviewed-by: Jesse Gross je...@nicira.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-08-27 Thread Michael Marineau
On Thu, Aug 27, 2015 at 9:40 AM, David Miller da...@davemloft.net wrote:
 From: Michael Marineau michael.marin...@coreos.com
 Date: Thu, 27 Aug 2015 09:16:06 -0700

 On Thu, Aug 27, 2015 at 6:00 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 Make sure you backported commit
 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
 (udp: fix dst races with multicast early demux)

 Oh, interesting. Looks like that patch didn't get CC'd to stable
 though, is there a reason for that or just oversight?

 All networking bug fixes are submitted to -stable by hand by me at a
 time of my choosing.  We do not use the CC: stable facility, as I
 feel it pushes patches into -stable way too quickly and before the
 change gets sufficient exposure for regressions in Linus's tree.

 The patch in question got submitted last night.

Great, thank you!
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3] netlink: add NETLINK_CAP_ACK socket option

2015-08-27 Thread Christophe Ricard
Since commit c05cdb1b864f (netlink: allow large data transfers from
user-space), the kernel may fail to allocate the necessary room for the
acknowledgment message back to userspace. This patch introduces a new
socket option that trims off the payload of the original netlink message.

The netlink message header is still included, so the user can guess from
the sequence number what is the message that has triggered the
acknowledgment.

Cc: sta...@vger.kernel.org
Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org
Signed-off-by: Christophe Ricard christophe-h.ric...@st.com
---
 include/uapi/linux/netlink.h |  1 +
 net/netlink/af_netlink.c | 27 ---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index cf6a65c..6f3fe16 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -110,6 +110,7 @@ struct nlmsgerr {
 #define NETLINK_TX_RING7
 #define NETLINK_LISTEN_ALL_NSID8
 #define NETLINK_LIST_MEMBERSHIPS   9
+#define NETLINK_CAP_ACK10
 
 struct nl_pktinfo {
__u32   group;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 67d2104..131d1a4 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -84,6 +84,7 @@ struct listeners {
 #define NETLINK_F_BROADCAST_SEND_ERROR 0x4
 #define NETLINK_F_RECV_NO_ENOBUFS  0x8
 #define NETLINK_F_LISTEN_ALL_NSID  0x10
+#define NETLINK_F_CAP_ACK  0x20
 
 static inline int netlink_is_kernel(struct sock *sk)
 {
@@ -2258,6 +2259,13 @@ static int netlink_setsockopt(struct socket *sock, int 
level, int optname,
nlk-flags = ~NETLINK_F_LISTEN_ALL_NSID;
err = 0;
break;
+   case NETLINK_CAP_ACK:
+   if (val)
+   nlk-flags |= NETLINK_F_CAP_ACK;
+   else
+   nlk-flags = ~NETLINK_F_CAP_ACK;
+   err = 0;
+   break;
default:
err = -ENOPROTOOPT;
}
@@ -2332,6 +2340,16 @@ static int netlink_getsockopt(struct socket *sock, int 
level, int optname,
netlink_table_ungrab();
break;
}
+   case NETLINK_CAP_ACK:
+   if (len  sizeof(int))
+   return -EINVAL;
+   len = sizeof(int);
+   val = nlk-flags  NETLINK_F_CAP_ACK ? 1 : 0;
+   if (put_user(len, optlen) ||
+   put_user(val, optval))
+   return -EFAULT;
+   err = 0;
+   break;
default:
err = -ENOPROTOOPT;
}
@@ -2873,9 +2891,12 @@ void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr 
*nlh, int err)
struct nlmsghdr *rep;
struct nlmsgerr *errmsg;
size_t payload = sizeof(*errmsg);
+   struct netlink_sock *nlk = nlk_sk(NETLINK_CB(in_skb).sk);
 
-   /* error messages get the original request appened */
-   if (err)
+   /* Error messages get the original request appened, unless the user
+* requests to cap the error message.
+*/
+   if (!(nlk-flags  NETLINK_F_CAP_ACK)  err)
payload += nlmsg_len(nlh);
 
skb = netlink_alloc_skb(in_skb-sk, nlmsg_total_size(payload),
@@ -2898,7 +2919,7 @@ void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr 
*nlh, int err)
  NLMSG_ERROR, payload, 0);
errmsg = nlmsg_data(rep);
errmsg-error = err;
-   memcpy(errmsg-msg, nlh, err ? nlh-nlmsg_len : sizeof(*nlh));
+   memcpy(errmsg-msg, nlh, payload  sizeof(*errmsg) ? nlh-nlmsg_len : 
sizeof(*nlh));
netlink_unicast(in_skb-sk, skb, NETLINK_CB(in_skb).portid, 
MSG_DONTWAIT);
 }
 EXPORT_SYMBOL(netlink_ack);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bgmac: support up to 3 cores (devices) on a bus

2015-08-27 Thread David Miller
From: Rafał Miłecki zaj...@gmail.com
Date: Wed, 26 Aug 2015 17:53:45 +0200

 Broadcom buses may have more than 1 Ethernet device. This is used e.g.
 to have few interfaces connected to different switch ports. So far we
 saw chipsets with only 2 devices (e.g. BCM4706) but recent ones have
 up to 3 (e.g. Netgear R8000 uses 3rd interface for most of switch
 traffic, lower interfaces are for some kind of offloading).
 
 Signed-off-by: Rafał Miłecki zaj...@gmail.com

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] cxgb4: continue in debug mode, if probe fails

2015-08-27 Thread Thadeu Lima de Souza Cascardo
On Wed, Aug 26, 2015 at 10:30:35PM +0530, Hariprasad Shenai wrote:
 If adapter is flashed with incorrect firmware, probe can fail.
 If probe fails, continue in debug mode, so one can also use the debug
 interface to update the firmware via ethtool.
 
 Signed-off-by: Hariprasad Shenai haripra...@chelsio.com

What do you mean by incorrect firmware? I know the driver can cope with older
firmware if force_old_init is used, for example. Isn't it possible to detect
those old firmware versions and do the same as force_old_init does?

Cascardo.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3] netlink_ack: send a capped message in case of error

2015-08-27 Thread Christophe Ricard
Hi,

After Jiri Benc feedback on my seconds proposal, please find a reworked patch 
still based on
Pablo Neira Ayuso proposal.

On this patch, I found the sender's socket was saved in netlink_unicast_kernel 
in NETLINK_CB(skb).sk.
This information now prevent me to look up the socket for every ack.

Also i believe it could be good to make it reach stable as it is somehow a bug 
fix.

Do you have any other comment ?

Best Regards
Christophe

Christophe Ricard (1):
  netlink: add NETLINK_CAP_ACK socket option

 include/uapi/linux/netlink.h |  1 +
 net/netlink/af_netlink.c | 27 ---
 2 files changed, 25 insertions(+), 3 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] route: fix breakage after moving lwtunnel state

2015-08-27 Thread David Miller
From: Jiri Benc jb...@redhat.com
Date: Wed, 26 Aug 2015 18:19:26 +0200

 Please let me know if you disagree with my analysis above.

I agree with your analysis, thanks for taking the time to investigate.

As for the %40 degradation, as Thomas mentioned it's because of the
extra FIB lookups.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-27 Thread David Miller
From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Wed, 26 Aug 2015 23:07:33 +0530

 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
 void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int 
 attrtype,
int bytes)
  {
 + u64 buff[IPSTATS_MIB_MAX] = {0,};
 +
   switch (attrtype) {
   case IFLA_INET6_STATS:
 - __snmp6_fill_stats64(stats, idev-stats.ipv6,

I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] route: fix breakage after moving lwtunnel state

2015-08-27 Thread Tom Herbert
On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf tg...@suug.ch wrote:
 On 08/26/15 at 06:19pm, Jiri Benc wrote:
 might be a noise. However, there's definitely room for performance
 improvement here, the lwtunnel vxlan throughput is at about ~40% of the
 non-vxlan throughput. I did not spend too much time on analyzing this, yet,
 but it's clear the dst_entry layout is not our biggest concern here.

 I'm currently working on reducing the overhead for VXLAN and Gre and
 effectively Geneve once Pravin's work is in. The main disadvantage
 of lwt based flow tunneling is the additional fib_lookup() performed
 for each packet. It seems tempting to cache the tunnel endpoint dst in
 the lwt state of the overlay route. It will usually point to the same
 dst for every packet. The cache behaviour if dependant on no fib rules
 are and the route is a single nexthop route.

Or set nexthop appropriately. This what we do for ILA. Works great
without any other dst references, but might put to much weight in the
administrator to configure nexthop per encapsulating destination.

Tom

 Did you test with a card that features UDP encapsulation offloads?
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] cxgb4: continue in debug mode, if probe fails

2015-08-27 Thread David Miller
From: Hariprasad Shenai haripra...@chelsio.com
Date: Wed, 26 Aug 2015 22:30:35 +0530

 If adapter is flashed with incorrect firmware, probe can fail.
 If probe fails, continue in debug mode, so one can also use the debug
 interface to update the firmware via ethtool.
 
 Signed-off-by: Hariprasad Shenai haripra...@chelsio.com

If the init fails, there are all of these software datastructure that
have not been allocated at all which -open() blindly assumes it can
dereference.

Sorry, you're going to have to do a lot more work than this to allow
this situation to proceed.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6 net-next 00/10] OVS conntrack support

2015-08-27 Thread David Miller
From: Joe Stringer joestrin...@nicira.com
Date: Wed, 26 Aug 2015 11:31:43 -0700

 The goal of this series is to allow OVS to send packets through the Linux
 kernel connection tracker, and subsequently match on fields populated by
 conntrack. This functionality is enabled through a new
 CONFIG_OPENVSWITCH_CONNTRACK option.
 
 This version addresses the feedback from v5, primarily checking the behaviour
 is correct with different configurations such as disabling
 CONFIG_OPENVSWITCH_CONNTRACK or disabling individual conntrack features like
 connlabels.
 
 The branch below has been updated with the corresponding userspace pieces:
 https://github.com/joestringer/ovs dev/ct_20150818

Series applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next:master 1369/1375] net/openvswitch/actions.c:705:16: error: implicit declaration of function 'nf_get_ipv6_ops'

2015-08-27 Thread kbuild test robot
tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   21c721fd0b991b1871ea5dd517be1b5375c5f8f7
commit: 7f8a436eaa2c3ddd8e1ff2fbca267e6275085536 [1369/1375] openvswitch: Add 
conntrack action
config: i386-randconfig-i0-201534 (attached as .config)
reproduce:
  git checkout 7f8a436eaa2c3ddd8e1ff2fbca267e6275085536
  # save the attached .config to linux build tree
  make ARCH=i386 

All error/warnings (new ones prefixed by ):

   net/openvswitch/actions.c: In function 'ovs_fragment':
 net/openvswitch/actions.c:705:16: error: implicit declaration of function 
 'nf_get_ipv6_ops' [-Werror=implicit-function-declaration]
  const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
   ^
 net/openvswitch/actions.c:705:37: warning: initialization makes pointer from 
 integer without a cast
  const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
^
 net/openvswitch/actions.c:707:19: error: storage size of 'ovs_rt' isn't known
  struct rt6_info ovs_rt;
  ^
 net/openvswitch/actions.c:724:8: error: dereferencing pointer to incomplete 
 type
  v6ops-fragment(skb-sk, skb, ovs_vport_output);
   ^
 net/openvswitch/actions.c:707:19: warning: unused variable 'ovs_rt' 
 [-Wunused-variable]
  struct rt6_info ovs_rt;
  ^
   cc1: some warnings being treated as errors

vim +/nf_get_ipv6_ops +705 net/openvswitch/actions.c

   699  skb_dst_set_noref(skb, ovs_dst);
   700  IPCB(skb)-frag_max_size = mru;
   701  
   702  ip_do_fragment(skb-sk, skb, ovs_vport_output);
   703  refdst_drop(orig_dst);
   704  } else if (ethertype == htons(ETH_P_IPV6)) {
  705  const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
   706  unsigned long orig_dst;
  707  struct rt6_info ovs_rt;
   708  
   709  if (!v6ops) {
   710  kfree_skb(skb);
   711  return;
   712  }
   713  
   714  prepare_frag(vport, skb);
   715  memset(ovs_rt, 0, sizeof(ovs_rt));
   716  dst_init(ovs_rt.dst, ovs_dst_ops, NULL, 1,
   717   DST_OBSOLETE_NONE, DST_NOCOUNT);
   718  ovs_rt.dst.dev = vport-dev;
   719  
   720  orig_dst = skb-_skb_refdst;
   721  skb_dst_set_noref(skb, ovs_rt.dst);
   722  IP6CB(skb)-frag_max_size = mru;
   723  
  724  v6ops-fragment(skb-sk, skb, ovs_vport_output);
   725  refdst_drop(orig_dst);
   726  } else {
   727  WARN_ONCE(1, Failed fragment -%s: eth=%04x, MRU=%d, 
MTU=%d.,

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 4.2.0-rc7 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT=elf32-i386
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/i386_defconfig
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-ecx -fcall-saved-edx
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=3
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_KERNEL_LZ4=y
CONFIG_DEFAULT_HOSTNAME=(none)
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_CROSS_MEMORY_ATTACH is not set
CONFIG_FHANDLE=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ 

[PATCH net-next 2/2] net/mlx4_core: Fix unintialized variable used in error path

2015-08-27 Thread clsoto
From: Carol L Soto cls...@linux.vnet.ibm.com

The uninitialized value name in mlx4_en_activate_cq was used in order
to print an error message. Fixing it by replacing it with cq-vector.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 63769df..a1918e2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -100,7 +100,6 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
 {
struct mlx4_en_dev *mdev = priv-mdev;
int err = 0;
-   char name[25];
int timestamp_en = 0;
bool assigned_eq = false;
 
@@ -119,8 +118,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
err = mlx4_assign_eq(mdev-dev, priv-port,
 cq-vector);
if (err) {
-   mlx4_err(mdev, Failed assigning an EQ to %s\n,
-name);
+   mlx4_err(mdev, Failed assigning an EQ to CQ 
vector %d\n,
+cq-vector);
goto free_eq;
}
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/2] net/mlx4_core: Capping number of requested MSIXs to MAX_MSIX

2015-08-27 Thread clsoto
From: Carol L Soto cls...@linux.vnet.ibm.com

We currently manage IRQs in pool_bm which is a bit field
of MAX_MSIX bits. Thus, allocating more than MAX_MSIX
interrupts can't be managed in pool_bm.
Fixing this by capping number of requested MSIXs to
MAX_MSIX.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 121c579..006757f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2669,9 +2669,14 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 
if (msi_x) {
int nreq = dev-caps.num_ports * num_online_cpus() + 1;
+   bool shared_ports = false;
 
nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs,
 nreq);
+   if (nreq  MAX_MSIX) {
+   nreq = MAX_MSIX;
+   shared_ports = true;
+   }
 
entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
if (!entries)
@@ -2694,6 +2699,9 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
bitmap_zero(priv-eq_table.eq[MLX4_EQ_ASYNC].actv_ports.ports,
dev-caps.num_ports);
 
+   if (MLX4_IS_LEGACY_EQ_MODE(dev-caps))
+   shared_ports = true;
+
for (i = 0; i  dev-caps.num_comp_vectors + 1; i++) {
if (i == MLX4_EQ_ASYNC)
continue;
@@ -2701,7 +2709,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
priv-eq_table.eq[i].irq =
entries[i + 1 - !!(i  MLX4_EQ_ASYNC)].vector;
 
-   if (MLX4_IS_LEGACY_EQ_MODE(dev-caps)) {
+   if (shared_ports) {

bitmap_fill(priv-eq_table.eq[i].actv_ports.ports,
dev-caps.num_ports);
/* We don't set affinity hint when there
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH 4/4] net: sched: simplify attach_one_default_qdisc()

2015-08-27 Thread Phil Sutter
Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 net/sched/sch_generic.c | 41 -
 1 file changed, 12 insertions(+), 29 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d5c7c0d..cb5d4ad 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -435,24 +435,6 @@ struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
.owner  =   THIS_MODULE,
 };
 
-static struct Qdisc noqueue_qdisc;
-static struct netdev_queue noqueue_netdev_queue = {
-   .qdisc  =   noqueue_qdisc,
-   .qdisc_sleeping =   noqueue_qdisc,
-};
-
-static struct Qdisc noqueue_qdisc = {
-   .enqueue=   NULL,
-   .dequeue=   noop_dequeue,
-   .flags  =   TCQ_F_BUILTIN,
-   .ops=   noqueue_qdisc_ops,
-   .list   =   LIST_HEAD_INIT(noqueue_qdisc.list),
-   .q.lock =   __SPIN_LOCK_UNLOCKED(noqueue_qdisc.q.lock),
-   .dev_queue  =   noqueue_netdev_queue,
-   .busylock   =   __SPIN_LOCK_UNLOCKED(noqueue_qdisc.busylock),
-};
-
-
 static const u8 prio2band[TC_PRIO_MAX + 1] = {
1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1
 };
@@ -743,18 +725,19 @@ static void attach_one_default_qdisc(struct net_device 
*dev,
 struct netdev_queue *dev_queue,
 void *_unused)
 {
-   struct Qdisc *qdisc = noqueue_qdisc;
+   struct Qdisc *qdisc;
+   const struct Qdisc_ops *ops = default_qdisc_ops;
 
-   if (!(dev-priv_flags  IFF_NO_QUEUE)) {
-   qdisc = qdisc_create_dflt(dev_queue,
- default_qdisc_ops, TC_H_ROOT);
-   if (!qdisc) {
-   netdev_info(dev, activation failed\n);
-   return;
-   }
-   if (!netif_is_multiqueue(dev))
-   qdisc-flags |= TCQ_F_ONETXQUEUE;
+   if (dev-priv_flags  IFF_NO_QUEUE)
+   ops = noqueue_qdisc_ops;
+
+   qdisc = qdisc_create_dflt(dev_queue, ops, TC_H_ROOT);
+   if (!qdisc) {
+   netdev_info(dev, activation failed\n);
+   return;
}
+   if (!netif_is_multiqueue(dev))
+   qdisc-flags |= TCQ_F_ONETXQUEUE;
dev_queue-qdisc_sleeping = qdisc;
 }
 
@@ -790,7 +773,7 @@ static void transition_one_qdisc(struct net_device *dev,
clear_bit(__QDISC_STATE_DEACTIVATED, new_qdisc-state);
 
rcu_assign_pointer(dev_queue-qdisc, new_qdisc);
-   if (need_watchdog_p  new_qdisc != noqueue_qdisc) {
+   if (need_watchdog_p) {
dev_queue-trans_start = 0;
*need_watchdog_p = 1;
}
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH 3/4] net: sched: register noqueue qdisc

2015-08-27 Thread Phil Sutter
This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 include/net/sch_generic.h |  1 +
 net/sched/sch_api.c   |  1 +
 net/sched/sch_generic.c   | 12 +++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 2eab08c..444faa8 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -340,6 +340,7 @@ extern struct Qdisc noop_qdisc;
 extern struct Qdisc_ops noop_qdisc_ops;
 extern struct Qdisc_ops pfifo_fast_ops;
 extern struct Qdisc_ops mq_qdisc_ops;
+extern struct Qdisc_ops noqueue_qdisc_ops;
 extern const struct Qdisc_ops *default_qdisc_ops;
 
 struct Qdisc_class_common {
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f06aa01..6a35551 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1947,6 +1947,7 @@ static int __init pktsched_init(void)
register_qdisc(bfifo_qdisc_ops);
register_qdisc(pfifo_head_drop_qdisc_ops);
register_qdisc(mq_qdisc_ops);
+   register_qdisc(noqueue_qdisc_ops);
 
rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL, NULL);
rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL, NULL);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f501b74..d5c7c0d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -416,9 +416,19 @@ struct Qdisc noop_qdisc = {
 };
 EXPORT_SYMBOL(noop_qdisc);
 
-static struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
+static int noqueue_init(struct Qdisc *qdisc, struct nlattr *opt)
+{
+   /* register_qdisc() assigns a default of noop_enqueue if unset,
+* but __dev_queue_xmit() treats noqueue only as such
+* if this is NULL - so clear it here. */
+   qdisc-enqueue = NULL;
+   return 0;
+}
+
+struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
.id =   noqueue,
.priv_size  =   0,
+   .init   =   noqueue_init,
.enqueue=   noop_enqueue,
.dequeue=   noop_dequeue,
.peek   =   noop_dequeue,
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH 0/4] fixup IFF_NO_QUEUE conversion

2015-08-27 Thread Phil Sutter
This series serves two purposes:

On one hand it fixes a quite embarrassing bug around the warning I added for
drivers still setting tx_queue_len = 0 to achieve noqueue operation. It turned
out to be quite useless as due to using alloc_netdev(), many in-kernel drivers
fell into the trap by accident, as well. Instead this place serves pretty well
as a sanitizing point to set IFF_NO_QUEUE for drivers not initializing
tx_queue_len, which in turn allows to drop all special treatment of the latter
being zero since that can not happen anymore without IFF_NO_QUEUE being set.

On the other hand, it provides a better solution for Eric Dumazet's concern
regarding how to assign noqueue to an interface which does not default to it
already. In order to make this possible, noqueue is being registered so users
can 'tc qd add dev eth0 root noqueue'. In addition, it resolves the ugly
situation of 'tc qd show' not showing noqueue. Finally, the former changes
allow for some code cleanup.

Phil Sutter (4):
  net: fix IFF_NO_QUEUE for drivers using alloc_netdev
  net: sched: ignore tx_queue_len when assigning default qdisc
  net: sched: register noqueue qdisc
  net: sched: simplify attach_one_default_qdisc()

 include/net/sch_generic.h |  1 +
 net/core/dev.c|  2 +-
 net/sched/sch_api.c   |  1 +
 net/sched/sch_generic.c   | 54 ---
 4 files changed, 26 insertions(+), 32 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH 2/4] net: sched: ignore tx_queue_len when assigning default qdisc

2015-08-27 Thread Phil Sutter
Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev-priv flags always contains IFF_NO_QUEUE.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 net/sched/sch_generic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 942fea8..f501b74 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -735,7 +735,7 @@ static void attach_one_default_qdisc(struct net_device *dev,
 {
struct Qdisc *qdisc = noqueue_qdisc;
 
-   if (dev-tx_queue_len  !(dev-priv_flags  IFF_NO_QUEUE)) {
+   if (!(dev-priv_flags  IFF_NO_QUEUE)) {
qdisc = qdisc_create_dflt(dev_queue,
  default_qdisc_ops, TC_H_ROOT);
if (!qdisc) {
@@ -756,7 +756,6 @@ static void attach_default_qdiscs(struct net_device *dev)
txq = netdev_get_tx_queue(dev, 0);
 
if (!netif_is_multiqueue(dev) ||
-   dev-tx_queue_len == 0 ||
dev-priv_flags  IFF_NO_QUEUE) {
netdev_for_each_tx_queue(dev, attach_one_default_qdisc, NULL);
dev-qdisc = txq-qdisc_sleeping;
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH 1/4] net: fix IFF_NO_QUEUE for drivers using alloc_netdev

2015-08-27 Thread Phil Sutter
Printing a warning in alloc_netdev_mqs() if tx_queue_len is zero and
IFF_NO_QUEUE not set is not appropriate since drivers may use one of the
alloc_netdev* macros instead of alloc_etherdev*, thereby not
intentionally leaving tx_queue_len uninitialized. Instead check here if
tx_queue_len is zero and set IFF_NO_QUEUE, so the value of tx_queue_len
can be ignored in net/sched_generic.c.

Fixes: 906470c (net: warn if drivers set tx_queue_len = 0)
Signed-off-by: Phil Sutter p...@nwl.cc
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b1f3f48..68156ef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6998,7 +6998,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
setup(dev);
 
if (!dev-tx_queue_len)
-   printk(KERN_WARNING %s uses DEPRECATED zero tx_queue_len - 
convert driver to use IFF_NO_QUEUE instead.\n, name);
+   dev-priv_flags |= IFF_NO_QUEUE;
 
dev-num_tx_queues = txqs;
dev-real_num_tx_queues = txqs;
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v4] sctp: asconf's process should verify address parameter is in the beginning

2015-08-27 Thread David Miller
From: Xin Long lucien@gmail.com
Date: Thu, 27 Aug 2015 16:26:34 +0800

 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.
 
 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.
 
 note that this can also detect multiple address parameters, and reject it.
 
 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Marcelo Ricardo Leitner mleit...@redhat.com

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] route: fix breakage after moving lwtunnel state

2015-08-27 Thread Thomas Graf
On 08/27/15 at 12:47pm, Tom Herbert wrote:
 On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf tg...@suug.ch wrote:
  On 08/26/15 at 06:19pm, Jiri Benc wrote:
  might be a noise. However, there's definitely room for performance
  improvement here, the lwtunnel vxlan throughput is at about ~40% of the
  non-vxlan throughput. I did not spend too much time on analyzing this, yet,
  but it's clear the dst_entry layout is not our biggest concern here.
 
  I'm currently working on reducing the overhead for VXLAN and Gre and
  effectively Geneve once Pravin's work is in. The main disadvantage
  of lwt based flow tunneling is the additional fib_lookup() performed
  for each packet. It seems tempting to cache the tunnel endpoint dst in
  the lwt state of the overlay route. It will usually point to the same
  dst for every packet. The cache behaviour if dependant on no fib rules
  are and the route is a single nexthop route.
 
 Or set nexthop appropriately. This what we do for ILA. Works great
 without any other dst references, but might put to much weight in the
 administrator to configure nexthop per encapsulating destination.

I assume you mean something like this, right?

ip route [...] encap vxlan dst 10.1.1.1 dev eth0

The IP metadata encap at FIB level is currently encap agnostic
and requires an intermediate encap device which then defines the
actual encap protocol:

ip route overlay/prefix encap ip dst 10.1.1.1 dev vxlan0
ip route 10.1.1.1/prefix dev eth0

I like it because we don't have to embed all the options as metadata
and can still set the through the device. An option would also be
to allow for both and add the following alternative:

ip route overlay/prefix encap ip type vxlan dst 10.1.1.1 dev eth0
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] route: fix breakage after moving lwtunnel state

2015-08-27 Thread Tom Herbert
On Thu, Aug 27, 2015 at 2:00 PM, Thomas Graf tg...@suug.ch wrote:
 On 08/27/15 at 12:47pm, Tom Herbert wrote:
 On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf tg...@suug.ch wrote:
  On 08/26/15 at 06:19pm, Jiri Benc wrote:
  might be a noise. However, there's definitely room for performance
  improvement here, the lwtunnel vxlan throughput is at about ~40% of the
  non-vxlan throughput. I did not spend too much time on analyzing this, 
  yet,
  but it's clear the dst_entry layout is not our biggest concern here.
 
  I'm currently working on reducing the overhead for VXLAN and Gre and
  effectively Geneve once Pravin's work is in. The main disadvantage
  of lwt based flow tunneling is the additional fib_lookup() performed
  for each packet. It seems tempting to cache the tunnel endpoint dst in
  the lwt state of the overlay route. It will usually point to the same
  dst for every packet. The cache behaviour if dependant on no fib rules
  are and the route is a single nexthop route.
 
 Or set nexthop appropriately. This what we do for ILA. Works great
 without any other dst references, but might put to much weight in the
 administrator to configure nexthop per encapsulating destination.

 I assume you mean something like this, right?

 ip route [...] encap vxlan dst 10.1.1.1 dev eth0

I'm doing:

ip route add :0:0:1::0:2:0/128 encap ila 2001:0:0:2 via
2401:db00:20:911a:face:0:27:0

so that 2401:db00:20:911a:face:0:27:0 is the next hop route for
destination 2001:0:0:2::0:2:0. The dst_output for lwt just calls
the original dest_output after transforming the packet without the use
of any additional routes. So in this way ILA LWT is just acting as a
pass-through packet transformation mechanism. Such a model might
have additional utility: LWT occurs before iptables so that iptables
sees the translated or encapsulated packet (davem mentioned this is
probably what we want), we may want to defer translation until IP
fragmentation (Roopa mentioned she needs this for MPLS).

 The IP metadata encap at FIB level is currently encap agnostic
 and requires an intermediate encap device which then defines the
 actual encap protocol:

 ip route overlay/prefix encap ip dst 10.1.1.1 dev vxlan0
 ip route 10.1.1.1/prefix dev eth0

But then your outputting through another device, multiple routes are
involved, performance drops :-( What not just set the route through
VXLAN in that case?

 I like it because we don't have to embed all the options as metadata
 and can still set the through the device. An option would also be
 to allow for both and add the following alternative:

 ip route overlay/prefix encap ip type vxlan dst 10.1.1.1 dev eth0

Better, we should be able to send encapsulated packets with needing a device.

Tom
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: sched: consolidate tc_classify{,_compat}

2015-08-27 Thread David Miller
From: Daniel Borkmann dan...@iogearbox.net
Date: Wed, 26 Aug 2015 23:00:06 +0200

 For classifiers getting invoked via tc_classify(), we always need an
 extra function call into tc_classify_compat(), as both are being
 exported as symbols and tc_classify() itself doesn't do much except
 handling of reclassifications when tp-classify() returned with
 TC_ACT_RECLASSIFY.
 
 CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
 all others use tc_classify(). When tc actions are being configured
 out in the kernel, tc_classify() effectively does nothing besides
 delegating.
 
 We could spare this layer and consolidate both functions. pktgen on
 single CPU constantly pushing skbs directly into the netif_receive_skb()
 path with a dummy classifier on ingress qdisc attached, improves
 slightly from 22.3Mpps to 23.1Mpps.
 
 Signed-off-by: Daniel Borkmann dan...@iogearbox.net

Applied, thanks Daniel.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 net-next 3/4] tcp: add in_flight to tcp_skb_cb

2015-08-27 Thread Yuchung Cheng
On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:
 Add in_flight (bytes in flight when packet was sent) field
 to tx component of tcp_skb_cb and make it available to
 congestion modules' pkts_acked() function through the
 ack_sample function argument.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
 ---
  include/net/tcp.h | 2 ++
  net/ipv4/tcp_input.c  | 5 -
  net/ipv4/tcp_output.c | 4 +++-
  3 files changed, 9 insertions(+), 2 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index a086a98..cdd93e5 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -757,6 +757,7 @@ struct tcp_skb_cb {
 union {
 struct {
 /* There is space for up to 20 bytes */
 +   __u32 in_flight;/* Bytes in flight when packet sent */
 } tx;   /* only used for outgoing skbs */
 union {
 struct inet_skb_parmh4;
 @@ -842,6 +843,7 @@ union tcp_cc_info;
  struct ack_sample {
 u32 pkts_acked;
 s32 rtt_us;
 +   u32 in_flight;
  };

  struct tcp_congestion_ops {
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
 index f506a0a..338e6bb 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3069,6 +3069,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
 prior_fackets,
 long ca_rtt_us = -1L;
 struct sk_buff *skb;
 u32 pkts_acked = 0;
 +   u32 last_in_flight = 0;
 bool rtt_update;
 int flag = 0;

 @@ -3108,6 +3109,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
 prior_fackets,
 if (!first_ackt.v64)
 first_ackt = last_ackt;

 +   last_in_flight = TCP_SKB_CB(skb)-tx.in_flight;
 reord = min(pkts_acked, reord);
 if (!after(scb-end_seq, tp-high_seq))
 flag |= FLAG_ORIG_SACK_ACKED;
 @@ -3197,7 +3199,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
 prior_fackets,
 }

 if (icsk-icsk_ca_ops-pkts_acked) {
 -   struct ack_sample sample = {pkts_acked, ca_rtt_us};
 +   struct ack_sample sample = {pkts_acked, ca_rtt_us,
 +   last_in_flight};

 icsk-icsk_ca_ops-pkts_acked(sk, sample);
 }
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 444ab5b..244d201 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -920,9 +920,12 @@ static int tcp_transmit_skb(struct sock *sk, struct 
 sk_buff *skb, int clone_it,
 int err;

 BUG_ON(!skb || !tcp_skb_pcount(skb));
 +   tp = tcp_sk(sk);

 if (clone_it) {
 skb_mstamp_get(skb-skb_mstamp);
 +   TCP_SKB_CB(skb)-tx.in_flight = TCP_SKB_CB(skb)-end_seq
 +   - tp-snd_una;
what if skb is a retransmitted packet? e.g. the first retransmission
in fast recovery would always record an inflight of 1 packet?


 if (unlikely(skb_cloned(skb)))
 skb = pskb_copy(skb, gfp_mask);
 @@ -933,7 +936,6 @@ static int tcp_transmit_skb(struct sock *sk, struct 
 sk_buff *skb, int clone_it,
 }

 inet = inet_sk(sk);
 -   tp = tcp_sk(sk);
 tcb = TCP_SKB_CB(skb);
 memset(opts, 0, sizeof(opts));

 --
 1.8.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TRIVIAL PATCH V2] smsc9194: Remove uncompilable #if 0'd use of pr_dbg

2015-08-27 Thread David Miller
From: Joe Perches j...@perches.com
Date: Wed, 26 Aug 2015 11:49:35 -0700

 No pr_dbg method exists.
 
 While this code is #if 0'd, it'd be nicer to
 use the generic hex_dump, so use it instead.
 
 Signed-off-by: Joe Perches j...@perches.com

Applied to net-next, thanks Joe.

I don't know what to do with really old drivers.  People still use
them, some via qemu or whatever.

But yeah this particular case is likely unused by anyone.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-27 Thread Nikolay Aleksandrov

 On Aug 26, 2015, at 9:57 PM, roopa ro...@cumulusnetworks.com wrote:
 
 On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
 On Aug 25, 2015, at 11:06 PM, David Miller da...@davemloft.net wrote:
 
 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 Date: Tue, 25 Aug 2015 22:28:16 -0700
 
 Certainly, that should be done and I will look into it, but the
 essence of this patch is a bit different. The problem here is not
 the size of the fdb entries, it’s more the number of them - having
 96000 entries (even if they were 1 byte ones) is just way too much
 especially when the fdb hash size is small and static. We could work
 on making it dynamic though, but still these type of local entries
 per vlan per port can easily be avoided with this option.
 96000 bits can be stored in 12k.  Get where I'm going with this?
 
 Look at the problem sideways.
 Oh okay, I misunderstood your previous comment. I’ll look into that.
 
 I just wanted to add the other problems we have had with keeping these macs 
 (mostly from userspace POV):
 - add/del netlink notification storms
 - and large netlink dumps
 
 In addition to in-kernel optimizations, will be nice to have a solution that 
 reduces the burden on userspace. That will need a newer netlink dump format 
 for fdbs. Considering all the changes needed, Nikolays patch seems less 
 intrusive.

Right, we need to take these into account as well. I’ll continue the discussion 
on this (or restart it) because
I looked into using a bitmap for the local entries only and while it fixes the 
scalability issue, it presents
a few new ones which are mostly related to the fact that these entries now 
exist only without a vlan
and if a new mac comes along which matches one of these but is in a vlan, the 
entry will get created
in br_fdb_update() unless we add a second lookup, but that will slow down the 
learning path.
Also this change requires an update of every fdb function that uses the vid as 
a key (every fdb function?!)
because now we can have the mac in two places instead of one which is a pretty 
big churn with lots
of conditionals all over the place and I don’t like it. Adding this complexity 
for the local addresses only
seems like an overkill, so I think to drop this issue for now.
This patch (that works around the initial problem) also has these issues.
Note that one way to take care of this in a more straight-forward way would be 
to have each entry
with some sort of a bitmap (like Vlad has tried earlier) and then we can 
combine the paths so most
of these issues disappear, but that will not be easy as was already commented 
earlier. I’ve looked
briefly into doing this with rhashtable so we can keep the memory footprint for 
each entry relatively
small but it still affects the performance and we can have thousands of resizes 
happening. 

On the notification side if we can fix that, we can actually delete the 96000 
entries without creating a
huge notification storm and do a user-land workaround of the original issue, so 
I’ll look into that next.

Any comments or ideas are very welcome.

Thank you,
 Nik

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] drivers: net: xgene: Add TSO support

2015-08-27 Thread David Miller
From: Iyappan Subramanian isubraman...@apm.com
Date: Wed, 26 Aug 2015 11:48:04 -0700

 Adding TSO support for 10GbE
 
 iperf Tx data rate without TSO: 3.42 Gbps
   with TSO: 9.41 Gbps
 
 v2: Address review comments from v1
 - skb_linearize() if headers doesn't fit in 3 hardware buffers
 
 v1:
 * Initial version
 
 Signed-off-by: Iyappan Subramanian isubraman...@apm.com

Series applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3] sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state

2015-08-27 Thread lucien xin
hi, Vlad, plz help to ACK this one
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next v2 00/14][pull request] Intel Wired LAN Driver Updates 2015-08-26

2015-08-27 Thread David Miller
From: Jeff Kirsher jeffrey.t.kirs...@intel.com
Date: Wed, 26 Aug 2015 15:49:19 -0700

 This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] bridge: fdb: rearrange net_bridge_fdb_entry

2015-08-27 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov niko...@cumulusnetworks.com

While looking into fixing the local entries scalability issue I noticed
that the structure is badly arranged because vlan_id would fall in a
second cache line while keeping rcu which is used only when deleting
in the first, so re-arrange the structure and push rcu to the end so we
can get 16 bytes which can be used for other fields (by pushing rcu
fully in the second 64 byte chunk). With this change all the core
necessary information when doing fdb lookups will be available in a
single cache line.

pahole before (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node  hlist;/* 016 */
struct net_bridge_port *   dst;  /*16 8 */
struct callback_head   rcu;  /*2416 */
long unsigned int  updated;  /*40 8 */
long unsigned int  used; /*48 8 */
mac_addr   addr; /*56 6 */
unsigned char  is_local:1;   /*62: 7  1 */
unsigned char  is_static:1;  /*62: 6  1 */
unsigned char  added_by_user:1;  /*62: 5  1 */
unsigned char  added_by_external_learn:1; /*62: 4  1 */

/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */

/* --- cacheline 1 boundary (64 bytes) --- */
__u16  vlan_id;  /*64 2 */

/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 1 */
/* bit holes: 1, sum bit holes: 4 bits */
/* padding: 6 */
/* last cacheline: 8 bytes */
}

pahole after (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node  hlist;/* 016 */
struct net_bridge_port *   dst;  /*16 8 */
long unsigned int  updated;  /*24 8 */
long unsigned int  used; /*32 8 */
mac_addr   addr; /*40 6 */
__u16  vlan_id;  /*46 2 */
unsigned char  is_local:1;   /*48: 7  1 */
unsigned char  is_static:1;  /*48: 6  1 */
unsigned char  added_by_user:1;  /*48: 5  1 */
unsigned char  added_by_external_learn:1; /*48: 4  1 */

/* XXX 4 bits hole, try to pack */
/* XXX 7 bytes hole, try to pack */

struct callback_head   rcu;  /*5616 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */

/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* bit holes: 1, sum bit holes: 4 bits */
/* last cacheline: 8 bytes */
}

Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com
---
 net/bridge/br_private.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3d95647039d0..8c97a22b1790 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -95,15 +95,15 @@ struct net_bridge_fdb_entry
struct hlist_node   hlist;
struct net_bridge_port  *dst;
 
-   struct rcu_head rcu;
unsigned long   updated;
unsigned long   used;
mac_addraddr;
+   __u16   vlan_id;
unsigned char   is_local:1,
is_static:1,
added_by_user:1,
added_by_external_learn:1;
-   __u16   vlan_id;
+   struct rcu_head rcu;
 };
 
 struct net_bridge_port_group {
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 net-next 1/4] tcp: replace cnt rtt with struct in pkts_acked()

2015-08-27 Thread Yuchung Cheng
On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:
 Replace 2 arguments (cnt and rtt) in the congestion control modules'
 pkts_acked() function with a struct. This will allow adding more
 information without having to modify existing congestion control
 modules (tcp_nv in particular needs bytes in flight when packet
 was sent).

 As proposed by Neal Cardwell in his comments to the tcp_nv patch.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
Acked-by: Yuchung Cheng ych...@google.com

 ---
  include/net/tcp.h   |  7 ++-
  net/ipv4/tcp_bic.c  |  6 +++---
  net/ipv4/tcp_cdg.c  | 14 +++---
  net/ipv4/tcp_cubic.c|  6 +++---
  net/ipv4/tcp_htcp.c | 10 +-
  net/ipv4/tcp_illinois.c | 20 ++--
  net/ipv4/tcp_input.c|  7 +--
  net/ipv4/tcp_lp.c   |  6 +++---
  net/ipv4/tcp_vegas.c|  6 +++---
  net/ipv4/tcp_vegas.h|  2 +-
  net/ipv4/tcp_veno.c |  7 ---
  net/ipv4/tcp_westwood.c |  7 ---
  net/ipv4/tcp_yeah.c |  7 ---
  13 files changed, 58 insertions(+), 47 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index 364426a..0121529 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -834,6 +834,11 @@ enum tcp_ca_ack_event_flags {

  union tcp_cc_info;

 +struct ack_sample {
 +   u32 pkts_acked;
 +   s32 rtt_us;
 +};
 +
  struct tcp_congestion_ops {
 struct list_headlist;
 u32 key;
 @@ -857,7 +862,7 @@ struct tcp_congestion_ops {
 /* new value of cwnd after loss (optional) */
 u32  (*undo_cwnd)(struct sock *sk);
 /* hook for packet ack accounting (optional) */
 -   void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us);
 +   void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
 /* get info for inet_diag (optional) */
 size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
union tcp_cc_info *info);
 diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
 index fd1405d..f469f1b 100644
 --- a/net/ipv4/tcp_bic.c
 +++ b/net/ipv4/tcp_bic.c
 @@ -197,15 +197,15 @@ static void bictcp_state(struct sock *sk, u8 new_state)
  /* Track delayed acknowledgment ratio using sliding window
   * ratio = (15*ratio + sample) / 16
   */
 -static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt)
 +static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
  {
 const struct inet_connection_sock *icsk = inet_csk(sk);

 if (icsk-icsk_ca_state == TCP_CA_Open) {
 struct bictcp *ca = inet_csk_ca(sk);

 -   cnt -= ca-delayed_ack  ACK_RATIO_SHIFT;
 -   ca-delayed_ack += cnt;
 +   ca-delayed_ack += sample-pkts_acked -
 +   (ca-delayed_ack  ACK_RATIO_SHIFT);
 }
  }

 diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
 index 167b6a3..b4e5af7 100644
 --- a/net/ipv4/tcp_cdg.c
 +++ b/net/ipv4/tcp_cdg.c
 @@ -294,12 +294,12 @@ static void tcp_cdg_cong_avoid(struct sock *sk, u32 
 ack, u32 acked)
 ca-shadow_wnd = max(ca-shadow_wnd, ca-shadow_wnd + incr);
  }

 -static void tcp_cdg_acked(struct sock *sk, u32 num_acked, s32 rtt_us)
 +static void tcp_cdg_acked(struct sock *sk, const struct ack_sample *sample)
  {
 struct cdg *ca = inet_csk_ca(sk);
 struct tcp_sock *tp = tcp_sk(sk);

 -   if (rtt_us = 0)
 +   if (sample-rtt_us = 0)
 return;

 /* A heuristic for filtering delayed ACKs, adapted from:
 @@ -307,20 +307,20 @@ static void tcp_cdg_acked(struct sock *sk, u32 
 num_acked, s32 rtt_us)
  * delay and rate based TCP mechanisms. TR 100219A. CAIA, 2010.
  */
 if (tp-sacked_out == 0) {
 -   if (num_acked == 1  ca-delack) {
 +   if (sample-pkts_acked == 1  ca-delack) {
 /* A delayed ACK is only used for the minimum if it is
  * provenly lower than an existing non-zero minimum.
  */
 -   ca-rtt.min = min(ca-rtt.min, rtt_us);
 +   ca-rtt.min = min(ca-rtt.min, sample-rtt_us);
 ca-delack--;
 return;
 -   } else if (num_acked  1  ca-delack  5) {
 +   } else if (sample-pkts_acked  1  ca-delack  5) {
 ca-delack++;
 }
 }

 -   ca-rtt.min = min_not_zero(ca-rtt.min, rtt_us);
 -   ca-rtt.max = max(ca-rtt.max, rtt_us);
 +   ca-rtt.min = min_not_zero(ca-rtt.min, sample-rtt_us);
 +   ca-rtt.max = max(ca-rtt.max, sample-rtt_us);
  }

  static u32 tcp_cdg_ssthresh(struct sock *sk)
 diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
 index 28011fb..c5d0ba5 100644
 --- a/net/ipv4/tcp_cubic.c
 +++ b/net/ipv4/tcp_cubic.c
 @@ -416,21 +416,21 @@ static void hystart_update(struct sock *sk, u32 delay)
  /* Track delayed acknowledgment ratio using sliding 

Re: [RFC PATCH v6 net-next 2/4] tcp: refactor struct tcp_skb_cb

2015-08-27 Thread Yuchung Cheng
On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:

 Refactor tcp_skb_cb to create two overlaping areas to store
 state for incoming or outgoing skbs based on comments by
 Neal Cardwell to tcp_nv patch:

AFAICT this patch would not require an increase in the size of
sk_buff cb[] if it were to take advantage of the fact that the
tcp_skb_cb header.h4 and header.h6 fields are only used in the packet
reception code path, and this in_flight field is only used on the
transmit side.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
Acked-by: Yuchung Cheng ych...@google.com

 ---
  include/net/tcp.h | 11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index 0121529..a086a98 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -755,11 +755,16 @@ struct tcp_skb_cb {
 /* 1 byte hole */
 __u32   ack_seq;/* Sequence number ACK'd*/
 union {
 -   struct inet_skb_parmh4;
 +   struct {
 +   /* There is space for up to 20 bytes */
 +   } tx;   /* only used for outgoing skbs */
 +   union {
 +   struct inet_skb_parmh4;
  #if IS_ENABLED(CONFIG_IPV6)
 -   struct inet6_skb_parm   h6;
 +   struct inet6_skb_parm   h6;
  #endif
 -   } header;   /* For incoming frames  */
 +   } header;   /* For incoming skbs */
 +   };
  };

  #define TCP_SKB_CB(__skb)  ((struct tcp_skb_cb *)((__skb)-cb[0]))
 --
 1.8.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/5] net: Introduce ipv4_addr_hash and use it for tcp metrics

2015-08-27 Thread David Ahern
Refactors a common line into helper function.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/ip.h   |  5 +
 net/ipv4/tcp_metrics.c | 12 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index bee5f3582e38..7b9e1c782aa3 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -458,6 +458,11 @@ static __inline__ void inet_reset_saddr(struct sock *sk)
 
 #endif
 
+static inline unsigned int ipv4_addr_hash(__be32 ip)
+{
+   return (__force unsigned int) ip;
+}
+
 bool ip_call_ra_chain(struct sk_buff *skb);
 
 /*
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b3d64f61d922..3a4289268f97 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -249,7 +249,7 @@ static struct tcp_metrics_block 
*__tcp_get_metrics_req(struct request_sock *req,
case AF_INET:
saddr.addr.a4 = inet_rsk(req)-ir_loc_addr;
daddr.addr.a4 = inet_rsk(req)-ir_rmt_addr;
-   hash = (__force unsigned int) daddr.addr.a4;
+   hash = ipv4_addr_hash(inet_rsk(req)-ir_rmt_addr);
break;
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
@@ -289,7 +289,7 @@ static struct tcp_metrics_block 
*__tcp_get_metrics_tw(struct inet_timewait_sock
saddr.addr.a4 = tw-tw_rcv_saddr;
daddr.family = AF_INET;
daddr.addr.a4 = tw-tw_daddr;
-   hash = (__force unsigned int) daddr.addr.a4;
+   hash = ipv4_addr_hash(tw-tw_daddr);
}
 #if IS_ENABLED(CONFIG_IPV6)
else if (tw-tw_family == AF_INET6) {
@@ -298,7 +298,7 @@ static struct tcp_metrics_block 
*__tcp_get_metrics_tw(struct inet_timewait_sock
saddr.addr.a4 = tw-tw_rcv_saddr;
daddr.family = AF_INET;
daddr.addr.a4 = tw-tw_daddr;
-   hash = (__force unsigned int) daddr.addr.a4;
+   hash = ipv4_addr_hash(tw-tw_daddr);
} else {
saddr.family = AF_INET6;
saddr.addr.in6 = tw-tw_v6_rcv_saddr;
@@ -339,7 +339,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct 
sock *sk,
saddr.addr.a4 = inet_sk(sk)-inet_saddr;
daddr.family = AF_INET;
daddr.addr.a4 = inet_sk(sk)-inet_daddr;
-   hash = (__force unsigned int) daddr.addr.a4;
+   hash = ipv4_addr_hash(inet_sk(sk)-inet_daddr);
}
 #if IS_ENABLED(CONFIG_IPV6)
else if (sk-sk_family == AF_INET6) {
@@ -348,7 +348,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct 
sock *sk,
saddr.addr.a4 = inet_sk(sk)-inet_saddr;
daddr.family = AF_INET;
daddr.addr.a4 = inet_sk(sk)-inet_daddr;
-   hash = (__force unsigned int) daddr.addr.a4;
+   hash = ipv4_addr_hash(inet_sk(sk)-inet_daddr);
} else {
saddr.family = AF_INET6;
saddr.addr.in6 = sk-sk_v6_rcv_saddr;
@@ -959,7 +959,7 @@ static int __parse_nl_addr(struct genl_info *info, struct 
inetpeer_addr *addr,
addr-family = AF_INET;
addr-addr.a4 = nla_get_in_addr(a);
if (hash)
-   *hash = (__force unsigned int) addr-addr.a4;
+   *hash = ipv4_addr_hash(addr-addr.a4);
return 0;
}
a = info-attrs[v6];
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/5] net: Add set,get helpers for inetpeer addresses

2015-08-27 Thread David Ahern
Use inetpeer set,get helpers in tcp_metrics rather than peeking into
the inetpeer_addr struct.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 23 ++
 net/ipv4/tcp_metrics.c | 65 +-
 2 files changed, 50 insertions(+), 38 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 002f0bd27001..f75b9e7036a2 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -71,6 +71,29 @@ void inet_initpeers(void) __init;
 
 #define INETPEER_METRICS_NEW   (~(u32) 0)
 
+static inline void inetpeer_set_addr_v4(struct inetpeer_addr *iaddr, __be32 ip)
+{
+   iaddr-addr.a4 = ip;
+   iaddr-family = AF_INET;
+}
+
+static inline __be32 inetpeer_get_addr_v4(struct inetpeer_addr *iaddr)
+{
+   return iaddr-addr.a4;
+}
+
+static inline void inetpeer_set_addr_v6(struct inetpeer_addr *iaddr,
+   struct in6_addr *in6)
+{
+   iaddr-addr.in6 = *in6;
+   iaddr-family = AF_INET6;
+}
+
+static inline struct in6_addr *inetpeer_get_addr_v6(struct inetpeer_addr 
*iaddr)
+{
+   return iaddr-addr.in6;
+}
+
 /* can be called with or without local BH being disabled */
 struct inet_peer *inet_getpeer(struct inet_peer_base *base,
   const struct inetpeer_addr *daddr,
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 3a4289268f97..4ef4dd4bf38c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -247,14 +247,14 @@ static struct tcp_metrics_block 
*__tcp_get_metrics_req(struct request_sock *req,
daddr.family = req-rsk_ops-family;
switch (daddr.family) {
case AF_INET:
-   saddr.addr.a4 = inet_rsk(req)-ir_loc_addr;
-   daddr.addr.a4 = inet_rsk(req)-ir_rmt_addr;
+   inetpeer_set_addr_v4(saddr, inet_rsk(req)-ir_loc_addr);
+   inetpeer_set_addr_v4(daddr, inet_rsk(req)-ir_rmt_addr);
hash = ipv4_addr_hash(inet_rsk(req)-ir_rmt_addr);
break;
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
-   saddr.addr.in6 = inet_rsk(req)-ir_v6_loc_addr;
-   daddr.addr.in6 = inet_rsk(req)-ir_v6_rmt_addr;
+   inetpeer_set_addr_v6(saddr, inet_rsk(req)-ir_v6_loc_addr);
+   inetpeer_set_addr_v6(daddr, inet_rsk(req)-ir_v6_rmt_addr);
hash = ipv6_addr_hash(inet_rsk(req)-ir_v6_rmt_addr);
break;
 #endif
@@ -285,25 +285,19 @@ static struct tcp_metrics_block 
*__tcp_get_metrics_tw(struct inet_timewait_sock
struct net *net;
 
if (tw-tw_family == AF_INET) {
-   saddr.family = AF_INET;
-   saddr.addr.a4 = tw-tw_rcv_saddr;
-   daddr.family = AF_INET;
-   daddr.addr.a4 = tw-tw_daddr;
+   inetpeer_set_addr_v4(saddr, tw-tw_rcv_saddr);
+   inetpeer_set_addr_v4(daddr, tw-tw_daddr);
hash = ipv4_addr_hash(tw-tw_daddr);
}
 #if IS_ENABLED(CONFIG_IPV6)
else if (tw-tw_family == AF_INET6) {
if (ipv6_addr_v4mapped(tw-tw_v6_daddr)) {
-   saddr.family = AF_INET;
-   saddr.addr.a4 = tw-tw_rcv_saddr;
-   daddr.family = AF_INET;
-   daddr.addr.a4 = tw-tw_daddr;
+   inetpeer_set_addr_v4(saddr, tw-tw_rcv_saddr);
+   inetpeer_set_addr_v4(daddr, tw-tw_daddr);
hash = ipv4_addr_hash(tw-tw_daddr);
} else {
-   saddr.family = AF_INET6;
-   saddr.addr.in6 = tw-tw_v6_rcv_saddr;
-   daddr.family = AF_INET6;
-   daddr.addr.in6 = tw-tw_v6_daddr;
+   inetpeer_set_addr_v6(saddr, tw-tw_v6_rcv_saddr);
+   inetpeer_set_addr_v6(daddr, tw-tw_v6_daddr);
hash = ipv6_addr_hash(tw-tw_v6_daddr);
}
}
@@ -335,25 +329,19 @@ static struct tcp_metrics_block *tcp_get_metrics(struct 
sock *sk,
struct net *net;
 
if (sk-sk_family == AF_INET) {
-   saddr.family = AF_INET;
-   saddr.addr.a4 = inet_sk(sk)-inet_saddr;
-   daddr.family = AF_INET;
-   daddr.addr.a4 = inet_sk(sk)-inet_daddr;
+   inetpeer_set_addr_v4(saddr, inet_sk(sk)-inet_saddr);
+   inetpeer_set_addr_v4(daddr, inet_sk(sk)-inet_daddr);
hash = ipv4_addr_hash(inet_sk(sk)-inet_daddr);
}
 #if IS_ENABLED(CONFIG_IPV6)
else if (sk-sk_family == AF_INET6) {
if (ipv6_addr_v4mapped(sk-sk_v6_daddr)) {
-   saddr.family = AF_INET;
-   saddr.addr.a4 = inet_sk(sk)-inet_saddr;
-   daddr.family = AF_INET;
-   daddr.addr.a4 = inet_sk(sk)-inet_daddr;
+   

Re: [PATCH net v3] sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state

2015-08-27 Thread Marcelo Ricardo Leitner
On Thu, Aug 27, 2015 at 04:52:20AM +0800, Xin Long wrote:
 Commit f8d960524328 (sctp: Enforce retransmission limit during shutdown)
 fixed a problem with excessive retransmissions in the SHUTDOWN_PENDING by not
 resetting the association overall_error_count.  This allowed the association
 to better enforce assoc.max_retrans limit.
 
 However, the same issue still exists when the association is in 
 SHUTDOWN_RECEIVED
 state.  In this state, HB-ACKs will continue to reset the overall_error_count
 for the association would extend the lifetime of association unnecessarily.
 
 This patch solves this by resetting the overall_error_count whenever the 
 current
 state is small then SCTP_STATE_SHUTDOWN_PENDING.  As a small side-effect, we
 end up also handling SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT
 states, but they are not really impacted because we disable Heartbeats in 
 those
 states.
 
 Fixes: Commit f8d960524328 (sctp: Enforce retransmission limit during 
 shutdown)
 Signed-off-by: Xin Long lucien@gmail.com

Acked-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com
thx

 ---
  net/sctp/sm_sideeffect.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index fef2acd..85e6f03 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -702,7 +702,7 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
* outstanding data and rely on the retransmission limit be reached
* to shutdown the association.
*/
 - if (t-asoc-state != SCTP_STATE_SHUTDOWN_PENDING)
 + if (t-asoc-state  SCTP_STATE_SHUTDOWN_PENDING)
   t-asoc-overall_error_count = 0;
  
   /* Clear the hb_sent flag to signal that we had a good
 -- 
 2.1.0
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/5] net: Add support for VRFs to inetpeer cache

2015-08-27 Thread David Ahern
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 17 -
 net/ipv4/icmp.c|  3 ++-
 net/ipv4/ip_fragment.c |  3 ++-
 net/ipv4/route.c   |  7 +--
 4 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index e96bada9d19a..b372004dca3e 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,11 +15,17 @@
 #include net/ipv6.h
 #include linux/atomic.h
 
+/* IPv4 address key for cache lookups */
+struct ipv4_addr_key {
+   __be32 addr;
+   int vif;
+};
+
 #define INETPEER_MAXKEYSZ   (sizeof(struct in6_addr) / sizeof(u32))
 
 struct inetpeer_addr {
union {
-   __be32  a4;
+   struct ipv4_addr_keya4;
struct in6_addr a6;
u32 key[INETPEER_MAXKEYSZ];
};
@@ -71,13 +77,13 @@ void inet_initpeers(void) __init;
 
 static inline void inetpeer_set_addr_v4(struct inetpeer_addr *iaddr, __be32 ip)
 {
-   iaddr-a4 = ip;
+   iaddr-a4.addr = ip;
iaddr-family = AF_INET;
 }
 
 static inline __be32 inetpeer_get_addr_v4(struct inetpeer_addr *iaddr)
 {
-   return iaddr-a4;
+   return iaddr-a4.addr;
 }
 
 static inline void inetpeer_set_addr_v6(struct inetpeer_addr *iaddr,
@@ -99,11 +105,12 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base,
 
 static inline struct inet_peer *inet_getpeer_v4(struct inet_peer_base *base,
__be32 v4daddr,
-   int create)
+   int vif, int create)
 {
struct inetpeer_addr daddr;
 
-   daddr.a4 = v4daddr;
+   daddr.a4.addr = v4daddr;
+   daddr.a4.vif = vif;
daddr.family = AF_INET;
return inet_getpeer(base, daddr, create);
 }
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f16488efa1c8..79fe05befcae 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -309,9 +309,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct 
rtable *rt,
 
rc = false;
if (icmp_global_allow()) {
+   int vif = vrf_master_ifindex(dst-dev);
struct inet_peer *peer;
 
-   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, vif, 1);
rc = inet_peer_xrlim_allow(peer,
   net-ipv4.sysctl_icmp_ratelimit);
if (peer)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 15762e758861..fa7f15305f9a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -151,7 +151,8 @@ static void ip4_frag_init(struct inet_frag_queue *q, const 
void *a)
qp-vif = arg-vif;
qp-user = arg-user;
qp-peer = sysctl_ipfrag_max_dist ?
-   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL;
+   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, arg-vif, 1) :
+   NULL;
 }
 
 static void ip4_frag_free(struct inet_frag_queue *q)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0b8a6531ef03..e1a60f6c1aad 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -838,6 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
struct inet_peer *peer;
struct net *net;
int log_martians;
+   int vif;
 
rcu_read_lock();
in_dev = __in_dev_get_rcu(rt-dst.dev);
@@ -846,10 +847,11 @@ void ip_rt_send_redirect(struct sk_buff *skb)
return;
}
log_martians = IN_DEV_LOG_MARTIANS(in_dev);
+   vif = vrf_master_ifindex_rcu(rt-dst.dev);
rcu_read_unlock();
 
net = dev_net(rt-dst.dev);
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, vif, 1);
if (!peer) {
icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
  rt_nexthop(rt, ip_hdr(skb)-daddr));
@@ -938,7 +940,8 @@ static int ip_error(struct sk_buff *skb)
break;
}
 
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr,
+  vrf_master_ifindex(skb-dev), 1);
 
send = true;
if (peer) {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 4/5] net: Refactor inetpeer address struct

2015-08-27 Thread David Ahern
Move the inetpeer_addr_base union to inetpeer_addr and drop
inetpeer_addr_base.

Both the a6 and in6_addr overlays are not needed; drop the __be32 version
and rename in6 to a6 for consistency with ipv4. Add a new u32 array to
the union which removes the need for the typecast in the compare function
and the use of a consistent arg for both ipv4 and ipv6 addresses which
makes the compare function more readable.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 9d9b3446731d..e96bada9d19a 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,16 +15,14 @@
 #include net/ipv6.h
 #include linux/atomic.h
 
-struct inetpeer_addr_base {
+#define INETPEER_MAXKEYSZ   (sizeof(struct in6_addr) / sizeof(u32))
+
+struct inetpeer_addr {
union {
__be32  a4;
-   __be32  a6[4];
-   struct in6_addr in6;
+   struct in6_addr a6;
+   u32 key[INETPEER_MAXKEYSZ];
};
-};
-
-struct inetpeer_addr {
-   struct inetpeer_addr_base   addr;
__u16   family;
 };
 
@@ -73,25 +71,25 @@ void inet_initpeers(void) __init;
 
 static inline void inetpeer_set_addr_v4(struct inetpeer_addr *iaddr, __be32 ip)
 {
-   iaddr-addr.a4 = ip;
+   iaddr-a4 = ip;
iaddr-family = AF_INET;
 }
 
 static inline __be32 inetpeer_get_addr_v4(struct inetpeer_addr *iaddr)
 {
-   return iaddr-addr.a4;
+   return iaddr-a4;
 }
 
 static inline void inetpeer_set_addr_v6(struct inetpeer_addr *iaddr,
struct in6_addr *in6)
 {
-   iaddr-addr.in6 = *in6;
+   iaddr-a6 = *in6;
iaddr-family = AF_INET6;
 }
 
 static inline struct in6_addr *inetpeer_get_addr_v6(struct inetpeer_addr 
*iaddr)
 {
-   return iaddr-addr.in6;
+   return iaddr-a6;
 }
 
 /* can be called with or without local BH being disabled */
@@ -105,7 +103,7 @@ static inline struct inet_peer *inet_getpeer_v4(struct 
inet_peer_base *base,
 {
struct inetpeer_addr daddr;
 
-   daddr.addr.a4 = v4daddr;
+   daddr.a4 = v4daddr;
daddr.family = AF_INET;
return inet_getpeer(base, daddr, create);
 }
@@ -116,7 +114,7 @@ static inline struct inet_peer *inet_getpeer_v6(struct 
inet_peer_base *base,
 {
struct inetpeer_addr daddr;
 
-   daddr.addr.in6 = *v6daddr;
+   daddr.a6 = *v6daddr;
daddr.family = AF_INET6;
return inet_getpeer(base, daddr, create);
 }
@@ -124,12 +122,12 @@ static inline struct inet_peer *inet_getpeer_v6(struct 
inet_peer_base *base,
 static inline int inetpeer_addr_cmp(const struct inetpeer_addr *a,
const struct inetpeer_addr *b)
 {
-   int i, n = (a-family == AF_INET ? 1 : 4);
+   int i, n = (a-family == AF_INET ? sizeof(a-a4) : sizeof(a-a6));
 
for (i = 0; i  n; i++) {
-   if (a-addr.a6[i] == b-addr.a6[i])
+   if (a-key[i] == b-key[i])
continue;
-   if ((__force u32)a-addr.a6[i]  (__force u32)b-addr.a6[i])
+   if (a-key[i]  b-key[i])
return -1;
return 1;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/5] net: Refactor inetpeer cache and add support for VRFs

2015-08-27 Thread David Ahern
Per Dave's comment on the version 1 patch adding VRF support to inetpeer
cache by explicitly making the address + index a key. Refactored the
inetpeer code in the process.

David Ahern (5):
  net: Introduce ipv4_addr_hash and use it for tcp metrics
  net: Add set,get helpers for inetpeer addresses
  net: Add helper function to compare inetpeer addresses
  net: Refactor inetpeer address struct
  net: Add support for VRFs to inetpeer cache

 include/net/inetpeer.h | 64 ---
 include/net/ip.h   |  5 
 net/ipv4/icmp.c|  3 +-
 net/ipv4/inetpeer.c| 20 ++---
 net/ipv4/ip_fragment.c |  3 +-
 net/ipv4/route.c   |  7 +++--
 net/ipv4/tcp_metrics.c | 81 --
 7 files changed, 103 insertions(+), 80 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/5] net: Add helper function to compare inetpeer addresses

2015-08-27 Thread David Ahern
tcp_metrics and inetpeer both have functions to compare inetpeer
addresses. Consolidate into 1 version.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 16 
 net/ipv4/inetpeer.c| 20 ++--
 net/ipv4/tcp_metrics.c |  6 +-
 3 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index f75b9e7036a2..9d9b3446731d 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -121,6 +121,22 @@ static inline struct inet_peer *inet_getpeer_v6(struct 
inet_peer_base *base,
return inet_getpeer(base, daddr, create);
 }
 
+static inline int inetpeer_addr_cmp(const struct inetpeer_addr *a,
+   const struct inetpeer_addr *b)
+{
+   int i, n = (a-family == AF_INET ? 1 : 4);
+
+   for (i = 0; i  n; i++) {
+   if (a-addr.a6[i] == b-addr.a6[i])
+   continue;
+   if ((__force u32)a-addr.a6[i]  (__force u32)b-addr.a6[i])
+   return -1;
+   return 1;
+   }
+
+   return 0;
+}
+
 /* can be called from BH context or outside */
 void inet_putpeer(struct inet_peer *p);
 bool inet_peer_xrlim_allow(struct inet_peer *peer, int timeout);
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 241afd743d2c..86fa45809540 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -157,22 +157,6 @@ void __init inet_initpeers(void)
INIT_DEFERRABLE_WORK(gc_work, inetpeer_gc_worker);
 }
 
-static int addr_compare(const struct inetpeer_addr *a,
-   const struct inetpeer_addr *b)
-{
-   int i, n = (a-family == AF_INET ? 1 : 4);
-
-   for (i = 0; i  n; i++) {
-   if (a-addr.a6[i] == b-addr.a6[i])
-   continue;
-   if ((__force u32)a-addr.a6[i]  (__force u32)b-addr.a6[i])
-   return -1;
-   return 1;
-   }
-
-   return 0;
-}
-
 #define rcu_deref_locked(X, BASE)  \
rcu_dereference_protected(X, lockdep_is_held((BASE)-lock.lock))
 
@@ -188,7 +172,7 @@ static int addr_compare(const struct inetpeer_addr *a,
*stackptr++ = _base-root; \
for (u = rcu_deref_locked(_base-root, _base);  \
 u != peer_avl_empty;) {\
-   int cmp = addr_compare(_daddr, u-daddr);  \
+   int cmp = inetpeer_addr_cmp(_daddr, u-daddr); \
if (cmp == 0)   \
break;  \
if (cmp == -1)  \
@@ -215,7 +199,7 @@ static struct inet_peer *lookup_rcu(const struct 
inetpeer_addr *daddr,
int count = 0;
 
while (u != peer_avl_empty) {
-   int cmp = addr_compare(daddr, u-daddr);
+   int cmp = inetpeer_addr_cmp(daddr, u-daddr);
if (cmp == 0) {
/* Before taking a reference, check if this entry was
 * deleted (refcnt=-1)
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 4ef4dd4bf38c..c8cbc2b4b792 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -81,11 +81,7 @@ static void tcp_metric_set(struct tcp_metrics_block *tm,
 static bool addr_same(const struct inetpeer_addr *a,
  const struct inetpeer_addr *b)
 {
-   if (a-family != b-family)
-   return false;
-   if (a-family == AF_INET)
-   return a-addr.a4 == b-addr.a4;
-   return ipv6_addr_equal(a-addr.in6, b-addr.in6);
+   return inetpeer_addr_cmp(a, b) == 0;
 }
 
 struct tcpm_hash_bucket {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v3 2/2] smsc911x: Ignore error return from device_get_phy_mode()

2015-08-27 Thread David Miller
From: Guenter Roeck li...@roeck-us.net
Date: Wed, 26 Aug 2015 20:27:05 -0700

 Commit 62ee783bf1f8 (smsc911x: Fix crash seen if neither ACPI nor OF is
 configured or used) introduces an error check for the return value from
 device_get_phy_mode() and bails out if there is an error. Unfortunately,
 there are configurations where no phy is configured. Those configurations
 now fail.
 
 To fix the problem, accept error returns from device_get_phy_mode(),
 and use the return value from device_property_read_u32() to determine
 if there is a suitable firmware interface to read the configuration.
 
 Fixes: 62ee783bf1f8 (smsc911x: Fix crash seen if neither ACPI nor OF is 
 configured or used)
 Tested-by: Tony Lindgren t...@atomide.com
 Signed-off-by: Guenter Roeck li...@roeck-us.net

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v3 1/2] device property: Return -ENXIO if there is no suitable FW interface

2015-08-27 Thread David Miller
From: Guenter Roeck li...@roeck-us.net
Date: Wed, 26 Aug 2015 20:27:04 -0700

 Return -ENXIO if device property array access functions don't find
 a suitable firmware interface.
 
 This lets drivers decide if they should use available platform data
 instead.
 
 Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
 Signed-off-by: Guenter Roeck li...@roeck-us.net

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 net-next 5/8] geneve: Add support to collect tunnel metadata.

2015-08-27 Thread Pravin Shelar
On Thu, Aug 27, 2015 at 2:18 AM, Thomas Graf tg...@suug.ch wrote:
 On 08/26/15 at 11:46pm, Pravin B Shelar wrote:
 + if (ip_tunnel_collect_metadata() || geneve-collect_md) {
 + __be16 flags;
 + void *opts;
 +
 + flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT |
 + (gnvh-oam ? TUNNEL_OAM : 0) |
 + (gnvh-critical ? TUNNEL_CRIT_OPT : 0);
 +
 + tun_dst = udp_tun_rx_dst(skb, AF_INET, flags,
 +  vni_to_tunnel_id(gnvh-vni),
 +  gnvh-opt_len * 4);
 + if (!tun_dst)
 + goto drop;
 +
 + /* Update tunnel dst according to Geneve options. */
 + opts = ip_tunnel_info_opts(tun_dst-u.tun_info,
 +gnvh-opt_len * 4);
 + memcpy(opts, gnvh-options, gnvh-opt_len * 4);
 + } else {
 + /* Drop packets w/ critical options,
 +  * since we don't support any...
 +  */
 + if (gnvh-critical)
 + goto drop;
 + }

   skb_reset_mac_header(skb);
   skb_scrub_packet(skb, !net_eq(geneve-net, dev_net(geneve-dev)));
   skb-protocol = eth_type_trans(skb, geneve-dev);
   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);

 + if (tun_dst)
 + skb_dst_set(skb, tun_dst-dst);

 It is slightly non obvious that introducing an error condition above
 this and before udp_tun_rx_dst() would introduce a memory leak. Other
 than this looks great now.

I can not move this into if condition block since skb-scrub-packet
drops skb dst entry.

 Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] bridge: Add netlink support for vlan_protocol attribute

2015-08-27 Thread David Miller
From: Toshiaki Makita makita.toshi...@lab.ntt.co.jp
Date: Thu, 27 Aug 2015 15:32:26 +0900

 This enables bridge vlan_protocol to be configured through netlink.
 
 When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
 same way as this feature is not implemented.
 
 Signed-off-by: Toshiaki Makita makita.toshi...@lab.ntt.co.jp
 ---
 v2: Fix u16 to __be16

Applied, thank you.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bpf: add support for %s specifier to bpf_trace_printk()

2015-08-27 Thread David Miller
From: Alexei Starovoitov a...@plumgrid.com
Date: Thu, 27 Aug 2015 16:06:14 -0700

 Fair or you still think it should be per byte copy?

I'm terribly surprised we don't have an equivalent of strncpy()
for unsafe kernel pointers.

You probably won't be the last person to want this, and it's silly
to optimize it in one place and then wait for cutpaste into the
next guy.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v2 0/6] rocker: make master change handling nicer

2015-08-27 Thread David Miller
From: Jiri Pirko j...@resnulli.us
Date: Thu, 27 Aug 2015 09:31:17 +0200

 From: Jiri Pirko j...@mellanox.com
 
 Jiri Pirko (6):
   net: introduce change upper device notifier change info
   net: add netif_is_bridge_master helper
   net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag
   net: kill long time unused bonding private flags
   rocker: use new helper to figure out master kind
   rocker: use change upper info

Series applied, thanks Jiri.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v3] netlink: add NETLINK_CAP_ACK socket option

2015-08-27 Thread David Miller
From: Christophe Ricard christophe.ric...@gmail.com
Date: Thu, 27 Aug 2015 21:31:31 +0200

 Since commit c05cdb1b864f (netlink: allow large data transfers from
 user-space), the kernel may fail to allocate the necessary room for the
 acknowledgment message back to userspace. This patch introduces a new
 socket option that trims off the payload of the original netlink message.
 
 The netlink message header is still included, so the user can guess from
 the sequence number what is the message that has triggered the
 acknowledgment.
 
 Cc: sta...@vger.kernel.org

Please do not CC: stable for networking changes, that is not how we handle
-stable submissions.

Instead, please just explicitly ask me to queue it up for -stable when
you make a bonafide non-RFC submission of a change.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/2] OPENVSWITCH !NETFILTER build fix.

2015-08-27 Thread Joe Stringer
Fix issues reported by kbuild test robot:

All error/warnings (new ones prefixed by ):

   net/openvswitch/actions.c: In function 'ovs_fragment':
 net/openvswitch/actions.c:705:16: error: implicit declaration of
function 'nf_get_ipv6_ops' [-Werror=implicit-function-declaration]
  const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
   ^
 net/openvswitch/actions.c:705:37: warning: initialization makes
pointer from integer without a cast
  const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
^
 net/openvswitch/actions.c:707:19: error: storage size of 'ovs_rt'
isn't known
  struct rt6_info ovs_rt;
  ^
 net/openvswitch/actions.c:724:8: error: dereferencing pointer to
incomplete type
  v6ops-fragment(skb-sk, skb, ovs_vport_output);
   ^
 net/openvswitch/actions.c:707:19: warning: unused variable 'ovs_rt'
[-Wunused-variable]
  struct rt6_info ovs_rt;
  ^
   cc1: some warnings being treated as errors

Joe Stringer (2):
  netfilter: Define v6ops in !CONFIG_NETFILTER case.
  openvswitch: Include ip6_fib.h.

 include/linux/netfilter_ipv6.h | 18 +-
 net/openvswitch/actions.c  |  1 +
 2 files changed, 10 insertions(+), 9 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/2] openvswitch: Include ip6_fib.h.

2015-08-27 Thread Joe Stringer
kbuild test robot reports that certain configurations will not
automatically pick up on the struct rt6_info definition, so explicitly
include the header for this structure.

Fixes: 7f8a436 openvswitch: Add conntrack action
Signed-off-by: Joe Stringer joestrin...@nicira.com
---
 net/openvswitch/actions.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 736a113..4487543 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -33,6 +33,7 @@
 #include net/dst.h
 #include net/ip.h
 #include net/ipv6.h
+#include net/ip6_fib.h
 #include net/checksum.h
 #include net/dsfield.h
 #include net/mpls.h
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/2] netfilter: Define v6ops in !CONFIG_NETFILTER case.

2015-08-27 Thread Joe Stringer
When CONFIG_OPENVSWITCH is set, and CONFIG_NETFILTER is not set, the
openvswitch IPv6 fragmentation handling cannot refer to ipv6_ops because
it isn't defined. Add a dummy version to avoid #ifdefs in source files.

Fixes: 7f8a436 openvswitch: Add conntrack action
Signed-off-by: Joe Stringer joestrin...@nicira.com
---
 include/linux/netfilter_ipv6.h | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/netfilter_ipv6.h b/include/linux/netfilter_ipv6.h
index 8b7d28f..7715746 100644
--- a/include/linux/netfilter_ipv6.h
+++ b/include/linux/netfilter_ipv6.h
@@ -9,15 +9,6 @@
 
 #include uapi/linux/netfilter_ipv6.h
 
-
-#ifdef CONFIG_NETFILTER
-int ip6_route_me_harder(struct sk_buff *skb);
-__sum16 nf_ip6_checksum(struct sk_buff *skb, unsigned int hook,
-   unsigned int dataoff, u_int8_t protocol);
-
-int ipv6_netfilter_init(void);
-void ipv6_netfilter_fini(void);
-
 /*
  * Hook functions for ipv6 to allow xt_* modules to be built-in even
  * if IPv6 is a module.
@@ -30,6 +21,14 @@ struct nf_ipv6_ops {
int (*output)(struct sock *, struct sk_buff *));
 };
 
+#ifdef CONFIG_NETFILTER
+int ip6_route_me_harder(struct sk_buff *skb);
+__sum16 nf_ip6_checksum(struct sk_buff *skb, unsigned int hook,
+   unsigned int dataoff, u_int8_t protocol);
+
+int ipv6_netfilter_init(void);
+void ipv6_netfilter_fini(void);
+
 extern const struct nf_ipv6_ops __rcu *nf_ipv6_ops;
 static inline const struct nf_ipv6_ops *nf_get_ipv6_ops(void)
 {
@@ -39,6 +38,7 @@ static inline const struct nf_ipv6_ops *nf_get_ipv6_ops(void)
 #else /* CONFIG_NETFILTER */
 static inline int ipv6_netfilter_init(void) { return 0; }
 static inline void ipv6_netfilter_fini(void) { return; }
+static inline const struct nf_ipv6_ops *nf_get_ipv6_ops(void) { return NULL; }
 #endif /* CONFIG_NETFILTER */
 
 #endif /*__LINUX_IP6_NETFILTER_H*/
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bnx2x: Add new device ids under the Qlogic vendor

2015-08-27 Thread David Miller
From: Yuval Mintz yuval.mi...@qlogic.com
Date: Thu, 27 Aug 2015 08:03:08 +0300

 This adds support for 3 new PCI device combinations -
 1077:16a1, 1077:16a4 and 1077:16ad.
 
 Signed-off-by: Yuval Mintz yuval.mi...@qlogic.com

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-08-27 Thread Cong Wang
On Thu, Aug 27, 2015 at 3:42 PM, David Miller da...@davemloft.net wrote:

 Long term it's the wrong fix, trust me.

So we have plan to convert some non-defaultable qdisc to defaultable?
I don't see a reason here.


 If you fix it properly, by making every qdisc capable of being -init()'d
 without explicit parameters, it will be the best behavior overall.

The problem is -init() is not even called when setting it as default,
since setting a default qdisc doesn't need to create a qdisc. This is
why the flag has to be in ops-flags rather than qdisc-flags.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] virtio-net: avoid unnecessary sg initialzation

2015-08-27 Thread David Miller
From: Jason Wang jasow...@redhat.com
Date: Thu, 27 Aug 2015 14:53:06 +0800

 Usually an skb does not have up to MAX_SKB_FRAGS frags. So no need to
 initialize the unuse part of sg. This patch initialize the sg based on
 the real number it will used:
 
 - during xmit, it could be inferred from nr_frags and can_push.
 - for small receive buffer, it will also be 2.
 
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com

This looks fine, thanks Jason.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 net-next 3/4] tcp: add in_flight to tcp_skb_cb

2015-08-27 Thread Yuchung Cheng
On Thu, Aug 27, 2015 at 3:44 PM, Lawrence Brakmo bra...@fb.com wrote:
 Yuchung, thank you for reviewing these patches. Response inline below.

 On 8/27/15, 3:00 PM, Yuchung Cheng ych...@google.com wrote:

On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:
 Add in_flight (bytes in flight when packet was sent) field
 to tx component of tcp_skb_cb and make it available to
 congestion modules' pkts_acked() function through the
 ack_sample function argument.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
Acked-by: Yuchung Cheng ych...@google.com

 ---
  include/net/tcp.h | 2 ++
  net/ipv4/tcp_input.c  | 5 -
  net/ipv4/tcp_output.c | 4 +++-
  3 files changed, 9 insertions(+), 2 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index a086a98..cdd93e5 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -757,6 +757,7 @@ struct tcp_skb_cb {
 union {
 struct {
 /* There is space for up to 20 bytes */
 +   __u32 in_flight;/* Bytes in flight when packet
sent */
 } tx;   /* only used for outgoing skbs */
 union {
 struct inet_skb_parmh4;
 @@ -842,6 +843,7 @@ union tcp_cc_info;
  struct ack_sample {
 u32 pkts_acked;
 s32 rtt_us;
 +   u32 in_flight;
  };

  struct tcp_congestion_ops {
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
 index f506a0a..338e6bb 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3069,6 +3069,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 long ca_rtt_us = -1L;
 struct sk_buff *skb;
 u32 pkts_acked = 0;
 +   u32 last_in_flight = 0;
 bool rtt_update;
 int flag = 0;

 @@ -3108,6 +3109,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 if (!first_ackt.v64)
 first_ackt = last_ackt;

 +   last_in_flight = TCP_SKB_CB(skb)-tx.in_flight;
 reord = min(pkts_acked, reord);
 if (!after(scb-end_seq, tp-high_seq))
 flag |= FLAG_ORIG_SACK_ACKED;
 @@ -3197,7 +3199,8 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 }

 if (icsk-icsk_ca_ops-pkts_acked) {
 -   struct ack_sample sample = {pkts_acked, ca_rtt_us};
 +   struct ack_sample sample = {pkts_acked, ca_rtt_us,
 +   last_in_flight};

 icsk-icsk_ca_ops-pkts_acked(sk, sample);
 }
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 444ab5b..244d201 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -920,9 +920,12 @@ static int tcp_transmit_skb(struct sock *sk,
struct sk_buff *skb, int clone_it,
 int err;

 BUG_ON(!skb || !tcp_skb_pcount(skb));
 +   tp = tcp_sk(sk);

 if (clone_it) {
 skb_mstamp_get(skb-skb_mstamp);
 +   TCP_SKB_CB(skb)-tx.in_flight = TCP_SKB_CB(skb)-end_seq
 +   - tp-snd_una;
what if skb is a retransmitted packet? e.g. the first retransmission
in fast recovery would always record an inflight of 1 packet?

 Yes.
 This does not affect NV for 2 reasons: 1) NV does not use ACKs when
 ca_state is not Open or Disorder to determine congestion state, 2) even if
 we used it, the small inflight means that the computed throughput will be
 small so it will not cause a non-congestion signal, but will not cause a
 congestion signal either because NV needs many (~60) measurements before
 determining there is congestion.

 However, other consumers may prefer a different value. From a congestion
 avoidance perspective, it is unclear we will be able to compute an
 accurate throughput when retransmitting, so we may as well give a lower
 bound.
I see. Then this is OK for now since only NV uses it. We can enhance
and track tput even during other CA states later. Would that be a
useful feature for NV as well?


 What do you think?



 if (unlikely(skb_cloned(skb)))
 skb = pskb_copy(skb, gfp_mask);
 @@ -933,7 +936,6 @@ static int tcp_transmit_skb(struct sock *sk, struct
sk_buff *skb, int clone_it,
 }

 inet = inet_sk(sk);
 -   tp = tcp_sk(sk);
 tcb = TCP_SKB_CB(skb);
 memset(opts, 0, sizeof(opts));

 --
 1.8.1


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 net-next 3/4] tcp: add in_flight to tcp_skb_cb

2015-08-27 Thread Yuchung Cheng
On Thu, Aug 27, 2015 at 3:54 PM, Yuchung Cheng ych...@google.com wrote:
 On Thu, Aug 27, 2015 at 3:44 PM, Lawrence Brakmo bra...@fb.com wrote:
 Yuchung, thank you for reviewing these patches. Response inline below.

 On 8/27/15, 3:00 PM, Yuchung Cheng ych...@google.com wrote:

On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:
 Add in_flight (bytes in flight when packet was sent) field
 to tx component of tcp_skb_cb and make it available to
 congestion modules' pkts_acked() function through the
 ack_sample function argument.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
 Acked-by: Yuchung Cheng ych...@google.com

 ---
  include/net/tcp.h | 2 ++
  net/ipv4/tcp_input.c  | 5 -
  net/ipv4/tcp_output.c | 4 +++-
  3 files changed, 9 insertions(+), 2 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index a086a98..cdd93e5 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -757,6 +757,7 @@ struct tcp_skb_cb {
 union {
 struct {
 /* There is space for up to 20 bytes */
 +   __u32 in_flight;/* Bytes in flight when packet
sent */
 } tx;   /* only used for outgoing skbs */
 union {
 struct inet_skb_parmh4;
 @@ -842,6 +843,7 @@ union tcp_cc_info;
  struct ack_sample {
 u32 pkts_acked;
 s32 rtt_us;
 +   u32 in_flight;
  };

  struct tcp_congestion_ops {
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
 index f506a0a..338e6bb 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3069,6 +3069,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 long ca_rtt_us = -1L;
 struct sk_buff *skb;
 u32 pkts_acked = 0;
 +   u32 last_in_flight = 0;
 bool rtt_update;
 int flag = 0;

 @@ -3108,6 +3109,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 if (!first_ackt.v64)
 first_ackt = last_ackt;

 +   last_in_flight = TCP_SKB_CB(skb)-tx.in_flight;
 reord = min(pkts_acked, reord);
 if (!after(scb-end_seq, tp-high_seq))
 flag |= FLAG_ORIG_SACK_ACKED;
 @@ -3197,7 +3199,8 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 }

 if (icsk-icsk_ca_ops-pkts_acked) {
 -   struct ack_sample sample = {pkts_acked, ca_rtt_us};
 +   struct ack_sample sample = {pkts_acked, ca_rtt_us,
 +   last_in_flight};

 icsk-icsk_ca_ops-pkts_acked(sk, sample);
 }
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 444ab5b..244d201 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -920,9 +920,12 @@ static int tcp_transmit_skb(struct sock *sk,
struct sk_buff *skb, int clone_it,
 int err;

 BUG_ON(!skb || !tcp_skb_pcount(skb));
 +   tp = tcp_sk(sk);

 if (clone_it) {
 skb_mstamp_get(skb-skb_mstamp);
 +   TCP_SKB_CB(skb)-tx.in_flight = TCP_SKB_CB(skb)-end_seq
 +   - tp-snd_una;
what if skb is a retransmitted packet? e.g. the first retransmission
in fast recovery would always record an inflight of 1 packet?

 Yes.
 This does not affect NV for 2 reasons: 1) NV does not use ACKs when
 ca_state is not Open or Disorder to determine congestion state, 2) even if
 we used it, the small inflight means that the computed throughput will be
 small so it will not cause a non-congestion signal, but will not cause a
 congestion signal either because NV needs many (~60) measurements before
 determining there is congestion.

 However, other consumers may prefer a different value. From a congestion
 avoidance perspective, it is unclear we will be able to compute an
 accurate throughput when retransmitting, so we may as well give a lower
 bound.
 I see. Then this is OK for now since only NV uses it. We can enhance
 and track tput even during other CA states later. Would that be a
 useful feature for NV as well?
For example, we (at Google servers) have seen some flows staying in
very long CA_Recovery due to rate limiter or CA_Disorder state due to
high path reordering. It'd be beneficial to have CC continue to
operate in these circumstances in the future.



 What do you think?



 if (unlikely(skb_cloned(skb)))
 skb = pskb_copy(skb, gfp_mask);
 @@ -933,7 +936,6 @@ static int tcp_transmit_skb(struct sock *sk, struct
sk_buff *skb, int clone_it,
 }

 inet = inet_sk(sk);
 -   tp = tcp_sk(sk);
 tcb = TCP_SKB_CB(skb);
 memset(opts, 0, sizeof(opts));

 --
 1.8.1


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH net-next] udp_offload: Allow device GRO without checksum-complete

2015-08-27 Thread Ramu Ramamurthy

On 2015-08-24 12:34, Tom Herbert wrote:

This patch adds a sysctl which allows GRO for a UDP offload protocol
to be performed in the device NAPI. This potentially is a performance
improvement if the savings of doing GRO in device NAPI outweighs the
cost of performing the checksum. Note that the performing the
checksum in device NAPI may negatively impact latency or throughput
of unrelated flows.

Performance results for VXLAN are below. Allowing GRO in device
NAPI does show performance improvement over doing GRO at the VXLAN
interface, however this performance is still less than what we see
with UDP checksums enabled (or getting checksum complete from the
device).

Test results: Running one netperf TCP_STREAM over VXLAN.

No UDP checksum, enable sysctl to allow GRO at device (this patch)
  TX CPU: 1.71
  RX CPU: 1.14
  6174 Mbps

UDP checksums and remote checksum offload enabled
  TX CPU: 1.97%
  RX CPU: 1.55%
  7527 Mbps

UDP checksums enabled
  TX CPU: 1.22%
  RX CPU: 1.86%
  6539 Mbps

No UDP checksums, GRO enabled on VXLAN interface
  TX CPU: 0.95%
  RX CPU: 1.78%
  4393 Mbps

No UDP checksum, GRO disabled VXLAN interface
  TX CPU: 1.31%
  RX CPU: 2.38%
  3613 Mbps

Signed-off-by: Tom Herbert t...@herbertland.com
---
 Documentation/networking/ip-sysctl.txt | 7 +++
 include/net/udp.h  | 1 +
 net/ipv4/sysctl_net_ipv4.c | 7 +++
 net/ipv4/udp.c | 3 +++
 net/ipv4/udp_offload.c | 7 ---
 5 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt
b/Documentation/networking/ip-sysctl.txt
index 46e88ed..d8563c08 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -711,6 +711,13 @@ udp_wmem_min - INTEGER
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 1 page

+udp_gro_nocsum_ok - BOOLEAN
+   If set, allow Generic Receive Offload (GRO) to be performed for UDP
+   offload protocols in the case that packets are being received
+   without an offloaded checksum. This implies that packets checksums
+   may be performed in the device NAPI routines which could negatively
+   impact unrelated flows.
+
 CIPSOv4 Variables:

 cipso_cache_enable - BOOLEAN
diff --git a/include/net/udp.h b/include/net/udp.h
index 6d4ed18..48eb6ae 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -103,6 +103,7 @@ extern atomic_long_t udp_memory_allocated;
 extern long sysctl_udp_mem[3];
 extern int sysctl_udp_rmem_min;
 extern int sysctl_udp_wmem_min;
+extern int sysctl_udp_gro_nocsum_ok;

 struct sk_buff;

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0330ab2..65fea78 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -766,6 +766,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra1 = one
},
+   {
+   .procname   = udp_gro_nocsum_ok,
+   .data   = sysctl_udp_gro_nocsum_ok,
+   .maxlen = sizeof(sysctl_udp_gro_nocsum_ok),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   },
{ }
 };

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c0a15e7..1d91227 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -130,6 +130,9 @@ EXPORT_SYMBOL(sysctl_udp_wmem_min);
 atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);

+int sysctl_udp_gro_nocsum_ok;
+EXPORT_SYMBOL(sysctl_udp_gro_nocsum_ok);
+
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f938616..1666f44 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -300,9 +300,10 @@ struct sk_buff **udp_gro_receive(struct sk_buff
**head, struct sk_buff *skb,
int flush = 1;

if (NAPI_GRO_CB(skb)-udp_mark ||
-   (skb-ip_summed != CHECKSUM_PARTIAL 
-NAPI_GRO_CB(skb)-csum_cnt == 0 
-!NAPI_GRO_CB(skb)-csum_valid))
+   ((skb-ip_summed != CHECKSUM_PARTIAL 
+ NAPI_GRO_CB(skb)-csum_cnt == 0 
+ !NAPI_GRO_CB(skb)-csum_valid) 
+ !sysctl_udp_gro_nocsum_ok))
goto out;

/* mark that this skb passed once through the udp gro layer */


Thanks for making this configurable, It would help with 10G adapters 
including ( intel 82599es , intel br kx4 dual-port)


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: sched: consolidate tc_classify{,_compat}

2015-08-27 Thread David Miller
From: Daniel Borkmann dan...@iogearbox.net
Date: Thu, 27 Aug 2015 10:11:37 +0200

 For classifiers getting invoked via tc_classify(), we always need an
 extra function call into tc_classify_compat(), as both are being
 exported as symbols and tc_classify() itself doesn't do much except
 handling of reclassifications when tp-classify() returned with
 TC_ACT_RECLASSIFY.
 
 CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
 all others use tc_classify(). When tc actions are being configured
 out in the kernel, tc_classify() effectively does nothing besides
 delegating.
 
 We could spare this layer and consolidate both functions. Artificial
 pktgen micro benchmark on single CPU constantly pushing skbs directly
 into the netif_receive_skb() path with a dummy classifier on ingress
 qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps.
 
 Signed-off-by: Daniel Borkmann dan...@iogearbox.net
 Acked-by: Alexei Starovoitov a...@plumgrid.com
 ---
  v1 - v2:
   - Addressed minor style nits found by Alexei.

Sorry, I applied v1 before seeing this :-/

If you could post a relative patch fixing the style issues, I'd
appreciate it.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-08-27 Thread David Miller
From: Cong Wang xiyou.wangc...@gmail.com
Date: Thu, 27 Aug 2015 15:47:55 -0700

 On Thu, Aug 27, 2015 at 3:42 PM, David Miller da...@davemloft.net wrote:
 If you fix it properly, by making every qdisc capable of being -init()'d
 without explicit parameters, it will be the best behavior overall.
 
 The problem is -init() is not even called when setting it as default,
 since setting a default qdisc doesn't need to create a qdisc. This is
 why the flag has to be in ops-flags rather than qdisc-flags.

Just sounds like another shortcoming of how default qdiscs are handled,
rather than a reason to not fix things properly.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/1] lan78xx: Change default internal PHY ID to 1

2015-08-27 Thread David Miller
From: woojung@microchip.com
Date: Thu, 27 Aug 2015 18:01:17 +

 Change default internal PHY ID to 1.
 
 Signed-off-by: Woojung Huh woojung@microchip.com

This doesn't describe in enough details, this change.

Why is this being changed now?  How did the driver work properly with
the previous value?  If it worked previously, what negative things
could possibly happen using the new value?

You are providing zero context, and reasoning, for your change.  That
makes it impossible to review.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread David Miller
From: sfel...@gmail.com
Date: Thu, 27 Aug 2015 00:16:44 -0700

 Comments?

No fundamental objections from me.

I just want to reiterate one thing I think Jiri said.

There are other kinds of devices which make up this kind of hierarchy.
I can think of two examples involving bonafide ethernet ports.

1) A top-level parent device provides the resources for all of the RX
   and TX queues, which are allocated and divided by the driver down into
   the ethernet ports below.  Example: niu

2) Cards are going to need more than one PCI-E slot to get all of the
   PCI-E lanes necessary to saturate the link.  Two PCI devices
   show up and need to get probed in this scenerio and it would be
   nice to have some object to represent the logical glueing
   together of those two devices.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v2 0/3] mlxsw: small driver update

2015-08-27 Thread David Miller
From: Jiri Pirko j...@resnulli.us
Date: Thu, 27 Aug 2015 17:59:54 +0200

 Ido Schimmel (2):
   mlxsw: Remove duplicate included header
   mlxsw: Make mailboxes 4KB aligned
 
 Jiri Pirko (1):
   mlxsw: adjust transmit fail log message level in __mlxsw_emad_transmit

Series applied, thanks Jiri.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/2] OPENVSWITCH !NETFILTER build fix.

2015-08-27 Thread David Miller
From: Joe Stringer joestrin...@nicira.com
Date: Thu, 27 Aug 2015 15:25:44 -0700

 Fix issues reported by kbuild test robot:
 
 All error/warnings (new ones prefixed by ):

Series applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bridge: fdb: rearrange net_bridge_fdb_entry

2015-08-27 Thread David Miller
From: Nikolay Aleksandrov ra...@blackwall.org
Date: Thu, 27 Aug 2015 14:19:20 -0700

 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 
 While looking into fixing the local entries scalability issue I noticed
 that the structure is badly arranged because vlan_id would fall in a
 second cache line while keeping rcu which is used only when deleting
 in the first, so re-arrange the structure and push rcu to the end so we
 can get 16 bytes which can be used for other fields (by pushing rcu
 fully in the second 64 byte chunk). With this change all the core
 necessary information when doing fdb lookups will be available in a
 single cache line.
 ...
 Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com

This looks fine, applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-08-27 Thread David Miller
From: Cong Wang xiyou.wangc...@gmail.com
Date: Wed, 26 Aug 2015 15:41:26 -0700

 Currently there is no check for if a qdisc is appropriate
 to be used as the default qdisc. This causes we get no
 error even we set the default qdisc to an inappropriate one
 but an error will be shown up later. This is not good.
 
 Also, for qdisc's like HTB, kernel will just crash when
 we use it as default qdisc, because some data structures are
 not even initialized yet before checking opt == NULL, the cleanup
 doing -reset() or -destroy() on them will just crash.
 
 Let's fail as early as we can.
 
 Cc: Jamal Hadi Salim j...@mojatatu.com
 Cc: Stephen Hemminger step...@networkplumber.org
 Signed-off-by: Cong Wang xiyou.wangc...@gmail.com

I don't like this.

The situation is that some sophisticated qdiscs can function without
explicit parameters, some cannot.

That is the problem you need to solve.  For example, if opts is NULL
HTB should use a reasonable set of defaults instead of failing.

Furthermore, you can improve the behavior when this happens.

When qdisc_create_dflt() returns NULL because ops-init() fails, do
something reasonable.

I'm not applying this patch series, it papers over the issue rather
than actually addressing it properly.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bpf: add support for %s specifier to bpf_trace_printk()

2015-08-27 Thread David Miller
From: Alexei Starovoitov a...@plumgrid.com
Date: Wed, 26 Aug 2015 23:26:59 -0700

 +/* similar to strncpy_from_user() but with extra checks */
 +static void probe_read_string(char *buf, int size, long unsafe_ptr)
 +{
 + char dst[4];
 + int i = 0;
 +
 + size--;
 + for (;;) {
 + if (probe_kernel_read(dst, (void *) unsafe_ptr, 4))
 + break;

I don't think this does the right thing when the string is not a multiple
of 3 and ends at the last byte of a page that ends a valid region of
kernel memory.

Seeing this kind of error makes me skeptical to the overall value of
optimizing this :-/
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-08-27 Thread Cong Wang
On Thu, Aug 27, 2015 at 3:30 PM, David Miller da...@davemloft.net wrote:
 I don't like this.

 The situation is that some sophisticated qdiscs can function without
 explicit parameters, some cannot.

This is exactly what this patch tries to solve... I already mark those
with a DEFAULTABLE flag.


 That is the problem you need to solve.  For example, if opts is NULL
 HTB should use a reasonable set of defaults instead of failing.

 Furthermore, you can improve the behavior when this happens.

 When qdisc_create_dflt() returns NULL because ops-init() fails, do
 something reasonable.

 I'm not applying this patch series, it papers over the issue rather
 than actually addressing it properly.

I wish I never mention that crash, which leads you to think I am trying
to fix a crash rather than a more important issue, usability. See below.

Forget about the crash, consider the current behavior:

# echo htb  default_qdisc
# succeed without any error
(then add a root qdisc and remove it)
# failure shown here in dmesg


And compare it with the behavior after my patch:

# echo htb  default_qdisc
Invalid arguments

I think this is clearly an improvement.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-08-27 Thread David Miller
From: Cong Wang xiyou.wangc...@gmail.com
Date: Thu, 27 Aug 2015 15:39:12 -0700

 On Thu, Aug 27, 2015 at 3:30 PM, David Miller da...@davemloft.net wrote:
 I don't like this.

 The situation is that some sophisticated qdiscs can function without
 explicit parameters, some cannot.
 
 This is exactly what this patch tries to solve... I already mark those
 with a DEFAULTABLE flag.

It is not solving it, if you were solving it you would make all qdisc's
capable of being default instead of giving them what is essentially
this is broken flag.
 I wish I never mention that crash, which leads you to think I am trying
 to fix a crash rather than a more important issue, usability. See below.
 
 Forget about the crash, consider the current behavior:
 
 # echo htb  default_qdisc
 # succeed without any error
 (then add a root qdisc and remove it)
 # failure shown here in dmesg
 
 And compare it with the behavior after my patch:
 
 # echo htb  default_qdisc
 Invalid arguments
 
 I think this is clearly an improvement.

Long term it's the wrong fix, trust me.

If you fix it properly, by making every qdisc capable of being -init()'d
without explicit parameters, it will be the best behavior overall.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 net-next 0/8] Geneve: Add support for tunnel metadata mode

2015-08-27 Thread David Miller
From: Pravin B Shelar pshe...@nicira.com
Date: Wed, 26 Aug 2015 14:54:31 -0700

 Following patches adds support for Geneve tunnel metadata
 mode. OVS can make use of Geneve net-device with tunnel
 metadata API from kernel.
 
 This also allows us to consolidate Geneve implementation
 from two kernel modules geneve_core and geneve to single
 geneve module. geneve_core module was targeted to share
 Geneve encap and decap code between Geneve netdevice and
 OVS Geneve tunnel implementation, Since OVS no longer
 needs these API, Geneve code can be consolidated into
 single geneve module.
 
 v3-v4:
 - Drop NETIF_F_NETNS_LOCAL feature.
 - Fix geneve device newlink check
 
 v2-v3:
 - make tunnel medata device and regular device mutually exclusive.
 - Fix Kconfig dependency for Geneve.
 - Fix dst-port netlink encoding.
 - drop changelink patch.
 
 v1-v2:
 - Replaced per hash table tunnel pointer (metadata enabled) with flag.
 - Added support for changelink.
 - Improve geneve device route lookup with more parameters.

Series applied, but I kind of expect some sort of fallout from this
from the 0-day robot or similar :-)

We'll see.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 net-next 3/4] tcp: add in_flight to tcp_skb_cb

2015-08-27 Thread Lawrence Brakmo
Yuchung, thank you for reviewing these patches. Response inline below.

On 8/27/15, 3:00 PM, Yuchung Cheng ych...@google.com wrote:

On Tue, Aug 25, 2015 at 4:33 PM, Lawrence Brakmo bra...@fb.com wrote:
 Add in_flight (bytes in flight when packet was sent) field
 to tx component of tcp_skb_cb and make it available to
 congestion modules' pkts_acked() function through the
 ack_sample function argument.

 Signed-off-by: Lawrence Brakmo bra...@fb.com
 ---
  include/net/tcp.h | 2 ++
  net/ipv4/tcp_input.c  | 5 -
  net/ipv4/tcp_output.c | 4 +++-
  3 files changed, 9 insertions(+), 2 deletions(-)

 diff --git a/include/net/tcp.h b/include/net/tcp.h
 index a086a98..cdd93e5 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -757,6 +757,7 @@ struct tcp_skb_cb {
 union {
 struct {
 /* There is space for up to 20 bytes */
 +   __u32 in_flight;/* Bytes in flight when packet
sent */
 } tx;   /* only used for outgoing skbs */
 union {
 struct inet_skb_parmh4;
 @@ -842,6 +843,7 @@ union tcp_cc_info;
  struct ack_sample {
 u32 pkts_acked;
 s32 rtt_us;
 +   u32 in_flight;
  };

  struct tcp_congestion_ops {
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
 index f506a0a..338e6bb 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3069,6 +3069,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 long ca_rtt_us = -1L;
 struct sk_buff *skb;
 u32 pkts_acked = 0;
 +   u32 last_in_flight = 0;
 bool rtt_update;
 int flag = 0;

 @@ -3108,6 +3109,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 if (!first_ackt.v64)
 first_ackt = last_ackt;

 +   last_in_flight = TCP_SKB_CB(skb)-tx.in_flight;
 reord = min(pkts_acked, reord);
 if (!after(scb-end_seq, tp-high_seq))
 flag |= FLAG_ORIG_SACK_ACKED;
 @@ -3197,7 +3199,8 @@ static int tcp_clean_rtx_queue(struct sock *sk,
int prior_fackets,
 }

 if (icsk-icsk_ca_ops-pkts_acked) {
 -   struct ack_sample sample = {pkts_acked, ca_rtt_us};
 +   struct ack_sample sample = {pkts_acked, ca_rtt_us,
 +   last_in_flight};

 icsk-icsk_ca_ops-pkts_acked(sk, sample);
 }
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 444ab5b..244d201 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -920,9 +920,12 @@ static int tcp_transmit_skb(struct sock *sk,
struct sk_buff *skb, int clone_it,
 int err;

 BUG_ON(!skb || !tcp_skb_pcount(skb));
 +   tp = tcp_sk(sk);

 if (clone_it) {
 skb_mstamp_get(skb-skb_mstamp);
 +   TCP_SKB_CB(skb)-tx.in_flight = TCP_SKB_CB(skb)-end_seq
 +   - tp-snd_una;
what if skb is a retransmitted packet? e.g. the first retransmission
in fast recovery would always record an inflight of 1 packet?

Yes. 
This does not affect NV for 2 reasons: 1) NV does not use ACKs when
ca_state is not Open or Disorder to determine congestion state, 2) even if
we used it, the small inflight means that the computed throughput will be
small so it will not cause a non-congestion signal, but will not cause a
congestion signal either because NV needs many (~60) measurements before
determining there is congestion.

However, other consumers may prefer a different value. From a congestion
avoidance perspective, it is unclear we will be able to compute an
accurate throughput when retransmitting, so we may as well give a lower
bound.

What do you think?
 


 if (unlikely(skb_cloned(skb)))
 skb = pskb_copy(skb, gfp_mask);
 @@ -933,7 +936,6 @@ static int tcp_transmit_skb(struct sock *sk, struct
sk_buff *skb, int clone_it,
 }

 inet = inet_sk(sk);
 -   tp = tcp_sk(sk);
 tcb = TCP_SKB_CB(skb);
 memset(opts, 0, sizeof(opts));

 --
 1.8.1


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/5 v2] net: Refactor inetpeer cache and add support for VRFs

2015-08-27 Thread David Ahern
Per Dave's comment on the version 1 patch adding VRF support to inetpeer
cache by explicitly making the address + index a key. Refactored the
inetpeer code in the process; mostly impacts the use by tcp_metrics.

David Ahern (5):
  net: Introduce ipv4_addr_hash and use it for tcp metrics
  net: Add set,get helpers for inetpeer addresses
  net: Add helper function to compare inetpeer addresses
  net: Refactor inetpeer address struct
  net: Add support for VRFs to inetpeer cache

 include/net/inetpeer.h | 69 +++---
 include/net/ip.h   |  5 
 net/ipv4/icmp.c|  3 +-
 net/ipv4/inetpeer.c| 20 ++---
 net/ipv4/ip_fragment.c |  3 +-
 net/ipv4/route.c   |  7 +++--
 net/ipv4/tcp_metrics.c | 81 --
 7 files changed, 108 insertions(+), 80 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/5] net: Add support for VRFs to inetpeer cache

2015-08-27 Thread David Ahern
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 17 -
 net/ipv4/icmp.c|  3 ++-
 net/ipv4/ip_fragment.c |  3 ++-
 net/ipv4/route.c   |  7 +--
 4 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index e34f98aa93b1..5ec3827b514d 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,11 +15,17 @@
 #include net/ipv6.h
 #include linux/atomic.h
 
+/* IPv4 address key for cache lookups */
+struct ipv4_addr_key {
+   __be32  addr;
+   int vif;
+};
+
 #define INETPEER_MAXKEYSZ   (sizeof(struct in6_addr) / sizeof(u32))
 
 struct inetpeer_addr {
union {
-   __be32  a4;
+   struct ipv4_addr_keya4;
struct in6_addr a6;
u32 key[INETPEER_MAXKEYSZ];
};
@@ -71,13 +77,13 @@ void inet_initpeers(void) __init;
 
 static inline void inetpeer_set_addr_v4(struct inetpeer_addr *iaddr, __be32 ip)
 {
-   iaddr-a4 = ip;
+   iaddr-a4.addr = ip;
iaddr-family = AF_INET;
 }
 
 static inline __be32 inetpeer_get_addr_v4(struct inetpeer_addr *iaddr)
 {
-   return iaddr-a4;
+   return iaddr-a4.addr;
 }
 
 static inline void inetpeer_set_addr_v6(struct inetpeer_addr *iaddr,
@@ -99,11 +105,12 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base,
 
 static inline struct inet_peer *inet_getpeer_v4(struct inet_peer_base *base,
__be32 v4daddr,
-   int create)
+   int vif, int create)
 {
struct inetpeer_addr daddr;
 
-   daddr.a4 = v4daddr;
+   daddr.a4.addr = v4daddr;
+   daddr.a4.vif = vif;
daddr.family = AF_INET;
return inet_getpeer(base, daddr, create);
 }
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f16488efa1c8..79fe05befcae 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -309,9 +309,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct 
rtable *rt,
 
rc = false;
if (icmp_global_allow()) {
+   int vif = vrf_master_ifindex(dst-dev);
struct inet_peer *peer;
 
-   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, vif, 1);
rc = inet_peer_xrlim_allow(peer,
   net-ipv4.sysctl_icmp_ratelimit);
if (peer)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 15762e758861..fa7f15305f9a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -151,7 +151,8 @@ static void ip4_frag_init(struct inet_frag_queue *q, const 
void *a)
qp-vif = arg-vif;
qp-user = arg-user;
qp-peer = sysctl_ipfrag_max_dist ?
-   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL;
+   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, arg-vif, 1) :
+   NULL;
 }
 
 static void ip4_frag_free(struct inet_frag_queue *q)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f3087aaa6dd8..6b91879e9cbe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -838,6 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
struct inet_peer *peer;
struct net *net;
int log_martians;
+   int vif;
 
rcu_read_lock();
in_dev = __in_dev_get_rcu(rt-dst.dev);
@@ -846,10 +847,11 @@ void ip_rt_send_redirect(struct sk_buff *skb)
return;
}
log_martians = IN_DEV_LOG_MARTIANS(in_dev);
+   vif = vrf_master_ifindex_rcu(rt-dst.dev);
rcu_read_unlock();
 
net = dev_net(rt-dst.dev);
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, vif, 1);
if (!peer) {
icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
  rt_nexthop(rt, ip_hdr(skb)-daddr));
@@ -938,7 +940,8 @@ static int ip_error(struct sk_buff *skb)
break;
}
 
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr,
+  vrf_master_ifindex(skb-dev), 1);
 
send = true;
if (peer) {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >