[PATCH iproute2 resend] Fix changing tunnel remote and local address to any

2015-06-04 Thread Thadeu Lima de Souza Cascardo
If a tunnel is created with a local address, you can't change it to any.

 # ip tunnel add tunl1 mode ipip remote 10.16.42.37 local 10.16.42.214 ttl 64
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64
 # ip tunnel change tunl1 local any
 # echo $?
 0
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64

It happens that parse_args zeroes ip_tunnel_parm, and when creating the
tunnel, it is OK to leave it as is if the address is any. However, when
changing the tunnel, the current parameters will be read from
ip_tunnel_parm, and local and remote address won't be zeroes anymore, so
it needs to be explicitly set to any.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@redhat.com
---

Resending because it was probably lost in patchwork.

---
 ip/iptunnel.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index be84b83..78fa988 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -167,10 +167,14 @@ static int parse_args(int argc, char **argv, int cmd, 
struct ip_tunnel_parm *p)
NEXT_ARG();
if (strcmp(*argv, any))
p-iph.daddr = get_addr32(*argv);
+   else
+   p-iph.daddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, local) == 0) {
NEXT_ARG();
if (strcmp(*argv, any))
p-iph.saddr = get_addr32(*argv);
+   else
+   p-iph.saddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, dev) == 0) {
NEXT_ARG();
strncpy(medium, *argv, IFNAMSIZ-1);
-- 
2.4.1
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] Fix changing tunnel remote and local address to any

2015-05-28 Thread Thadeu Lima de Souza Cascardo
If a tunnel is created with a local address, you can't change it to any.

 # ip tunnel add tunl1 mode ipip remote 10.16.42.37 local 10.16.42.214 ttl 64
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64
 # ip tunnel change tunl1 local any
 # echo $?
 0
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64

It happens that parse_args zeroes ip_tunnel_parm, and when creating the
tunnel, it is OK to leave it as is if the address is any. However, when
changing the tunnel, the current parameters will be read from
ip_tunnel_parm, and local and remote address won't be zeroes anymore, so
it needs to be explicitly set to any.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@redhat.com
---
 ip/iptunnel.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index be84b83..78fa988 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -167,10 +167,14 @@ static int parse_args(int argc, char **argv, int cmd, 
struct ip_tunnel_parm *p)
NEXT_ARG();
if (strcmp(*argv, any))
p-iph.daddr = get_addr32(*argv);
+   else
+   p-iph.daddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, local) == 0) {
NEXT_ARG();
if (strcmp(*argv, any))
p-iph.saddr = get_addr32(*argv);
+   else
+   p-iph.saddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, dev) == 0) {
NEXT_ARG();
strncpy(medium, *argv, IFNAMSIZ-1);
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] bridge: fix br_multicast_query_expired() bug

2015-05-28 Thread Thadeu Lima de Souza Cascardo
On Thu, May 28, 2015 at 04:42:54AM -0700, Eric Dumazet wrote:
 From: Eric Dumazet eduma...@google.com
 
 br_multicast_query_expired() querier argument is a pointer to
 a struct bridge_mcast_querier :
 
 struct bridge_mcast_querier {
 struct br_ip addr;
 struct net_bridge_port __rcu*port;
 };
 
 Intent of the code was to clear port field, not the pointer to querier.
 
 Fixes: 2cd4143192e8 (bridge: memorize and export selected IGMP/MLD querier 
 port)
 Signed-off-by: Eric Dumazet eduma...@google.com
 Cc: Linus Lüssing linus.luess...@web.de
 Cc: Steinar H. Gunderson se...@samfundet.no
 ---
  net/bridge/br_multicast.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
 index a3abe6ed111e..22fd0419b314 100644
 --- a/net/bridge/br_multicast.c
 +++ b/net/bridge/br_multicast.c
 @@ -1822,7 +1822,7 @@ static void br_multicast_query_expired(struct 
 net_bridge *br,
   if (query-startup_sent  br-multicast_startup_query_count)
   query-startup_sent++;
  
 - RCU_INIT_POINTER(querier, NULL);
 + RCU_INIT_POINTER(querier-port, NULL);
   br_multicast_send_query(br, NULL, query);
   spin_unlock(br-multicast_lock);
  }
 
 
 
 --

Acked-by: Thadeu Lima de Souza Cascardo casca...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bridge: fix parsing of MLDv2 reports

2015-05-22 Thread Thadeu Lima de Souza Cascardo
When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.

This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.

This only happens when there is a MLDv2 querier in the network.

The fix will only break out of the loop when there is a failure adding a
multicast address.

The mdb before the patch:

dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp

After the patch:

dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp

Reported-by: Rik Theys rik.th...@esat.kuleuven.be
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@redhat.com
Tested-by: Rik Theys rik.th...@esat.kuleuven.be
---
 net/bridge/br_multicast.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 4b6722f..a3abe6e 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1072,7 +1072,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge 
*br,
 
err = br_ip6_multicast_add_group(br, port, grec-grec_mca,
 vid);
-   if (!err)
+   if (err)
break;
}
 
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: veths often slow to come up

2015-08-05 Thread Thadeu Lima de Souza Cascardo
On Tue, Aug 04, 2015 at 08:26:28PM -0700, Cong Wang wrote:
 (Cc'ing netdev for network issues)
 
 On Tue, Aug 4, 2015 at 6:42 AM, Shaun Crampton
 shaun.cramp...@metaswitch.com wrote:
  Please CC me on any responses, thanks.
 
  Setting both ends of a veth to be oper UP completes very quickly but I
  find that pings only start flowing over the veth after about a second.
  This seems to correlate with the NO-CARRIER flag being set or the
  interface being in state UNKNOWN or state DOWN² for about a second
  (demo script below).
 
  If I run the script repeatedly then sometimes it completes very quickly on
  subsequent runs as if there¹s a hot cache somewhere.
 
  Could this be a bug or is there a configuration to speed this up?  Seems
  odd that it¹s almost exactly 1s on the first run.
 
  Seen on these kernels:
  * 3.13.0-57-generic #95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015 x86_64
  x86_64 x86_64 GNU/Linux
  * 4.0.9-coreos #2 SMP Thu Jul 30 01:07:55 UTC 2015 x86_64 Intel(R) Xeon(R)
  CPU @ 2.50GHz GenuineIntel GNU/Linux
 
  Regards,
 
  -Shaun
 

Take a look at linkwatch_urgent_event at net/core/link_watch.c, and all of
link_watch.c in general. That's where the 1s delay comes from. It's designed to
prevent link message storms.

In particular, look at commit 294cc44b7e48a6e7732499eebcf409b231460d8e, which
added the urgent event.

I suspect this was designed to workaround buggy drivers/hardware, not to help
userspace handle thousands of virtual devices being created and destroyed all
the time.

Maybe virtual devices should be whitelisted here? Maybe the patch below is
stupid, because drivers may abuse it, and drivers are buggy, otherwise linkwatch
would not be needed in the first place.

Regards.
Cascardo.

 
  Running my test script below (Assumes veth0/1 do not already exist):
 
  $ sudo ./veth-test.sh
  Time to create veth:
 
  real0m0.019s
  user0m0.002s
  sys 0m0.010s
 
  Time to wait for carrier:
 
  real0m1.005s
  user0m0.007s
  sys 0m0.123s
 
 
 
  # veth-test.sh
 
  #!/bin/bash
  function create_veth {
ip link add type veth
ip link set veth0 up
ip link set veth1 up
  }
  function wait_for_carrier {
while ! ip link show | grep -qE 'veth[01]';
do
  sleep 0.05
done
while ip link show | grep -E 'veth[01]¹ | \
  grep -Eq 'NO-CARRIER|state DOWN|state UNKNOWN';
do
  sleep 0.05
done
  }
  echo Time to create veth:
  time create_veth
  echo
  echo Time to wait for carrier:
  time wait_for_carrier
  ip link del veth0
---
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 343592c..91123a8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -306,6 +306,7 @@ static void veth_setup(struct net_device *dev)
 
dev-priv_flags = ~IFF_TX_SKB_SHARING;
dev-priv_flags |= IFF_LIVE_ADDR_CHANGE;
+   dev-priv_flags |= IFF_LINKWATCH_URGENT;
 
dev-netdev_ops = veth_netdev_ops;
dev-ethtool_ops = veth_ethtool_ops;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607b5f4..138f5e9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1262,6 +1262,7 @@ struct net_device_ops {
  * @IFF_LIVE_ADDR_CHANGE: device supports hardware address
  * change when it's running
  * @IFF_MACVLAN: Macvlan device
+ * @IFF_LINKWATCH_URGENT: device does not flood with link updates
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 10,
@@ -1289,6 +1290,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 122,
IFF_IPVLAN_MASTER   = 123,
IFF_IPVLAN_SLAVE= 124,
+   IFF_LINKWATCH_URGENT= 125,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
diff --git a/net/core/link_watch.c b/net/core/link_watch.c
index 9828616..e2957a0 100644
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -95,6 +95,9 @@ static bool linkwatch_urgent_event(struct net_device *dev)
if (dev-priv_flags  IFF_TEAM_PORT)
return true;
 
+   if (dev-priv_flags  IFF_LINKWATCH_URGENT)
+   return true;
+
return netif_carrier_ok(dev)  qdisc_tx_changing(dev);
 }
--- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IGMP: Inhibit reports for local multicast groups

2015-08-13 Thread Thadeu Lima de Souza Cascardo
On Thu, Aug 13, 2015 at 07:01:37PM +0200, Andrew Lunn wrote:
 On Thu, Aug 13, 2015 at 04:52:32PM +, Philip Downey wrote:
  Hi Andrew
  IGMP snooping is designed to prevent hosts on a local network from 
  receiving traffic for a multicast group they have not explicitly joined.   
  Link-Local multicast traffic should not have an IGMP client since it is 
  reserved for routing protocols.  One would expect that IGMP snooping needs 
  to ignore local multicast traffic in the reserved range intended for 
  routers since there should be no IGMP client to make join requests.
 
 The point of this patch is that Linux is sending out group membership
 for these addresses, it is acting as a client. What happens with a
 switch which is applying IGMP snooping to link-local multicast groups?
 You turn on this feature, and you no longer get your routing protocol
 messages.
 
 I had a quick look at RFC 3376. The only mention i spotted for not
 sending IGMP messages is:
 
The all-systems multicast address, 224.0.0.1, is handled as a special
case.  On all systems -- that is all hosts and routers, including
multicast routers -- reception of packets destined to the all-systems
multicast address, from all sources, is permanently enabled on all
interfaces on which multicast reception is supported.  No IGMP
messages are ever sent regarding the all-systems multicast address.
 
 IGMP v2 has something similar:
 
The all-systems group (address 224.0.0.1) is handled as a special
case.  The host starts in Idle Member state for that group on every
interface, never transitions to another state, and never sends a
report for that group.
 
 But i did not find anything which says all other link-local addresses
 don't need member reports. Did i miss something?
 
   Andrew

From RFC 4541 (Considerations for Internet Group Management Protocol (IGMP) and
Multicast Listener Discovery (MLD) Snooping Switches):

 2) Packets with a destination IP (DIP) address in the 224.0.0.X range
  which are not IGMP must be forwarded on all ports.

  This recommendation is based on the fact that many host systems do
  not send Join IP multicast addresses in this range before sending
  or listening to IP multicast packets.  Furthermore, since the
  224.0.0.X address range is defined as link-local (not to be
  routed), it seems unnecessary to keep the state for each address
  in this range.  Additionally, some routers operate in the
  224.0.0.X address range without issuing IGMP Joins, and these
  applications would break if the switch were to prune them due to
  not having seen a Join Group message from the router.

So, it looks like some hosts and routers out there in the field do not send
joins for those local addresses. In fact, IPv4 local multicast addresses are
ignored when Linux bridge multicast snooping adds a new group.

static int br_ip4_multicast_add_group(struct net_bridge *br,
...
if (ipv4_is_local_multicast(group))
return 0;

Cascardo.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] cxgb4: continue in debug mode, if probe fails

2015-08-27 Thread Thadeu Lima de Souza Cascardo
On Wed, Aug 26, 2015 at 10:30:35PM +0530, Hariprasad Shenai wrote:
 If adapter is flashed with incorrect firmware, probe can fail.
 If probe fails, continue in debug mode, so one can also use the debug
 interface to update the firmware via ethtool.
 
 Signed-off-by: Hariprasad Shenai haripra...@chelsio.com

What do you mean by incorrect firmware? I know the driver can cope with older
firmware if force_old_init is used, for example. Isn't it possible to detect
those old firmware versions and do the same as force_old_init does?

Cascardo.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net-sysfs: get_netdev_queue_index() cleanup

2015-09-11 Thread Thadeu Lima de Souza Cascardo
Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce.

Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue
attribute") moved get_netdev_queue_index around, but kept the old version.
Probably because of a reuse of the original patch from before Eric's change to
that function.

Remove one inline keyword, and no need for a loop to find
an index into a table.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Fixes: 822b3b2ebfff8e9b3d006086c527738a7ca00cd0
Cc: Or Gerlitz <ogerl...@mellanox.com>
Cc: John Fastabend <john.r.fastab...@intel.com>
Cc: Eric Dumazet <eduma...@google.com>
---

Not sure what is the best way to credit Eric Dumazet here. I assume he will add
any appropriate tags.

---
 net/core/net-sysfs.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b279077..49b5990 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1004,15 +1004,12 @@ static ssize_t show_trans_timeout(struct netdev_queue 
*queue,
 }
 
 #ifdef CONFIG_XPS
-static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
+static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 {
struct net_device *dev = queue->dev;
-   int i;
-
-   for (i = 0; i < dev->num_tx_queues; i++)
-   if (queue == >_tx[i])
-   break;
+   unsigned int i;
 
+   i = queue - dev->_tx;
BUG_ON(i >= dev->num_tx_queues);
 
return i;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] net-sysfs: get_netdev_queue_index() cleanup

2015-09-15 Thread Thadeu Lima de Souza Cascardo
Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce.

Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue
attribute") moved get_netdev_queue_index around, but kept the old version.
Probably because of a reuse of the original patch from before Eric's change to
that function.

Remove one inline keyword, and no need for a loop to find
an index into a table.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Fixes: 822b3b2ebfff ("net: Add max rate tx queue attribute")
Acked-by:  Or Gerlitz <ogerl...@mellanox.com>
Acked-by: John Fastabend <john.r.fastab...@intel.com>
Cc: Eric Dumazet <eduma...@google.com>
---
 net/core/net-sysfs.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b279077..49b5990 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1004,15 +1004,12 @@ static ssize_t show_trans_timeout(struct netdev_queue 
*queue,
 }
 
 #ifdef CONFIG_XPS
-static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
+static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 {
struct net_device *dev = queue->dev;
-   int i;
-
-   for (i = 0; i < dev->num_tx_queues; i++)
-   if (queue == >_tx[i])
-   break;
+   unsigned int i;
 
+   i = queue - dev->_tx;
BUG_ON(i >= dev->num_tx_queues);
 
return i;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] lwtunnel: remove source and destination UDP port config option

2015-09-24 Thread Thadeu Lima de Souza Cascardo
On Thu, Sep 24, 2015 at 10:43:07AM +0200, Jiri Benc wrote:
> The UDP tunnel config is asymmetric wrt. to the ports used. The source and
> destination ports from one direction of the tunnel are not related to the
> ports of the other direction. We need to be able to respond to ARP requests
> using the correct ports without involving routing.
> 
> As the consequence, UDP ports need to be fixed property of the tunnel
> interface and cannot be set per route. Remove the ability to set ports per
> route. This is still okay to do, as no kernel has been released with these
> attributes yet.
> 
> Note that the ability to specify source and destination ports is preserved
> for other users of the lwtunnel API which don't use routes for tunnel key
> specification (like openvswitch).
> 
> If in the future we rework ARP handling to allow port specification, the
> attributes can be added back.
> 
> Signed-off-by: Jiri Benc <jb...@redhat.com>
> Acked-by: Thomas Graf <tg...@suug.ch>
> ---
> Resending what was formerly patch 2/2 of the "lwtunnel: make it really work,
> for IPv4" series, without change. This patch can be applied standalone and
> is rather important before the uAPI is set in stone.
> ---

Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Thadeu Lima de Souza Cascardo
On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> > wrote:
> >> Looks like this happens because ip_options_fragment() relies on
> >> correct ip options length in ip control block in skb. But in
> >> ip_finish_output_gso() control block in segments is reused by
> >> skb_gso_segment(). following ip_fragment() sees some garbage.
> >>
> >> In my case there was no ip options but length becomes non-zero and
> >> ip_options_fragment() picked some bytes from payload and decides to
> >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> >> in skb_shared_info at the end of data =)
> >>
> >
> > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > since all the gso information should be saved in shared_info after it 
> > finishes.
> >
> > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> 
> This will break present logic around ip_options_fragment() - it clears
> options from
> second and following fragments. With zeroed cb it will do nothing.
> 
> ip_options_fragment() can get required information directly from ip header but
> it also resets fields in IPCB -- probably it should stay valid here
> and somebody else will use it later.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I have hit this as well, this fixes it for me on an older kernel. Can you try it
on latest kernel?

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d8a1745..f44bc91 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
netdev_features_t features;
struct sk_buff *segs;
int ret = 0;
+   struct inet_skb_parm ipcb;
 
if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
return ip_finish_output2(skb);
@@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 * from host network stack.
 */
+   /* We need to save IPCB here because skb_gso_segment will use
+* SKB_GSO_CB.
+*/
+   ipcb = *IPCB(skb);
features = netif_skb_features(skb);
segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
if (IS_ERR_OR_NULL(segs)) {
@@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
int err;
 
segs->next = NULL;
+   *IPCB(segs) = ipcb;
err = ip_fragment(segs, ip_finish_output2);
 
if (err && ret == 0)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] ip_tunnel: Move stats update to iptunnel_xmit()

2015-12-24 Thread Thadeu Lima de Souza Cascardo
On Thu, Dec 24, 2015 at 10:21:27AM +0100, Nicolas Dichtel wrote:
> Le 24/12/2015 00:52, Pravin B Shelar a écrit :
> [snip]
> >diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> >index 6af78c6..d63a911 100644
> >--- a/net/tipc/udp_media.c
> >+++ b/net/tipc/udp_media.c
> >@@ -182,15 +182,9 @@ static int tipc_udp_send_msg(struct net *net, struct 
> >sk_buff *skb,
> > goto tx_error;
> > }
> > ttl = ip4_dst_hoplimit(>dst);
> >-err = udp_tunnel_xmit_skb(rt, ub->ubsock->sk, skb,
> >-  src->ipv4.s_addr,
> >-  dst->ipv4.s_addr, 0, ttl, 0,
> >-  src->udp_port, dst->udp_port,
> >-  false, true);
> >-if (err < 0) {
> >-ip_rt_put(rt);
> >-goto tx_error;
> >-}
> >+udp_tunnel_xmit_skb(rt, ub->ubsock->sk, skb, src->ipv4.s_addr,
> >+dst->ipv4.s_addr, 0, ttl, 0, src->udp_port,
> >+dst->udp_port, false, true);
> I don't know how tipc works, but this change is clearly suspect. What make the
> error path not needed anymore after your patch?

I looked into it as well. As far as I see, err could only be positive or 0, so
if there is a tipc bug here, Pravin's patch introduces no regression. Or did I
fail to see how udp_tunnel_xmit_skb could return a negative value?

Cascardo.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 1/7] ppc bpf/jit: Disable classic BPF JIT on ppc64le

2016-06-22 Thread Thadeu Lima de Souza Cascardo
On Wed, Jun 22, 2016 at 09:55:01PM +0530, Naveen N. Rao wrote:
> Classic BPF JIT was never ported completely to work on little endian
> powerpc. However, it can be enabled and will crash the system when used.
> As such, disable use of BPF JIT on ppc64le.

Thanks, Naveen.

Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>

> 
> Cc: sta...@vger.kernel.org
> Cc: Matt Evans <m...@ozlabs.org>
> Cc: Denis Kirjanov <k...@linux-powerpc.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: Alexei Starovoitov <a...@fb.com>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: Ananth N Mavinakayanahalli <ana...@in.ibm.com>
> Cc: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> Reported-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> Signed-off-by: Naveen N. Rao <naveen.n@linux.vnet.ibm.com>
> ---
>  arch/powerpc/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 01f7464..0a9d439 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -128,7 +128,7 @@ config PPC
>   select IRQ_FORCED_THREADING
>   select HAVE_RCU_TABLE_FREE if SMP
>   select HAVE_SYSCALL_TRACEPOINTS
> - select HAVE_CBPF_JIT
> + select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
>   select HAVE_ARCH_JUMP_LABEL
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_HAS_GCOV_PROFILE_ALL
> -- 
> 2.8.2
> 


[PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-17 Thread Thadeu Lima de Souza Cascardo
On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote:
> On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote:
> > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > b/arch/powerpc/net/bpf_jit_comp64.c
> > new file mode 100644
> > index 000..954ff53
> > --- /dev/null
> > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > @@ -0,0 +1,956 @@
> ...
> > +
> > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size)
> > +{
> > +   int *p = area;
> > +
> > +   /* Fill whole space with trap instructions */
> > +   while (p < (int *)((char *)area + size))
> > +   *p++ = BREAKPOINT_INSTRUCTION;
> > +}
> 
> This breaks the build for some configs, presumably you're missing a header:
> 
>   arch/powerpc/net/bpf_jit_comp64.c:30:10: error: 'BREAKPOINT_INSTRUCTION' 
> undeclared (first use in this function)
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/
> 
> cheers

Hi, Michael and Naveen.

I noticed independently that there is a problem with BPF JIT and ABIv2, and
worked out the patch below before I noticed Naveen's patchset and the latest
changes in ppc tree for a better way to check for ABI versions.

However, since the issue described below affect mainline and stable kernels,
would you consider applying it before merging your two patchsets, so that we can
more easily backport the fix?

Thanks.
Cascardo.

---
>From a984dc02b6317a1d3a3c2302385adba5227be5bd Mon Sep 17 00:00:00 2001
From: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Date: Wed, 15 Jun 2016 13:22:12 -0300
Subject: [PATCH] ppc: Fix BPF JIT for ABIv2

ABIv2 used for ppc64le does not use function descriptors. Without this patch,
whenever BPF JIT is enabled, we get a crash as below.

[root@ibm-p8-kvm-05-guest-02 ~]# echo 2 > /proc/sys/net/core/bpf_jit_enable
[root@ibm-p8-kvm-05-guest-02 ~]# tcpdump -n -i eth0 tcp port 22
device eth0 entered promiscuous mode
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=1 proglen=8 pass=3 image=d5bb9018 from=tcpdump pid=11387
JIT code: : 00 00 60 38 20 00 80 4e
Pass 1: shrink = 0, seen = 0x3
Pass 2: shrink = 0, seen = 0x3
flen=20 proglen=524 pass=3 image=d5bbd018 from=tcpdump pid=11387
JIT code: : a6 02 08 7c 10 00 01 f8 70 ff c1 f9 78 ff e1 f9
JIT code: 0010: e1 fe 21 f8 7c 00 e3 80 78 00 e3 81 50 78 e7 7d
JIT code: 0020: c8 00 c3 e9 00 00 a0 38 00 c0 e0 3c c6 07 e7 78
JIT code: 0030: 08 00 e7 64 54 1b e7 60 a6 03 e8 7c 0c 00 c0 38
JIT code: 0040: 21 00 80 4e b0 01 80 41 00 00 00 60 dd 86 e0 38
JIT code: 0050: 01 00 e7 3c 40 38 04 7c 9c 00 82 40 00 00 00 60
JIT code: 0060: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60
JIT code: 0070: a6 03 e8 7c 14 00 c0 38 21 00 80 4e 78 01 80 41
JIT code: 0080: 00 00 00 60 06 00 04 28 68 01 82 40 00 00 00 60
JIT code: 0090: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 00a0: a6 03 e8 7c 36 00 c0 38 21 00 80 4e 48 01 80 41
JIT code: 00b0: 00 00 00 60 16 00 04 28 2c 01 82 41 00 00 00 60
JIT code: 00c0: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 00d0: a6 03 e8 7c 38 00 c0 38 21 00 80 4e 18 01 80 41
JIT code: 00e0: 00 00 00 60 16 00 04 28 fc 00 82 41 00 00 00 60
JIT code: 00f0: 00 01 00 48 00 08 04 28 f8 00 82 40 00 00 00 60
JIT code: 0100: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60
JIT code: 0110: a6 03 e8 7c 17 00 c0 38 21 00 80 4e d8 00 80 41
JIT code: 0120: 00 00 00 60 06 00 04 28 c8 00 82 40 00 00 00 60
JIT code: 0130: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 0140: a6 03 e8 7c 14 00 c0 38 21 00 80 4e a8 00 80 41
JIT code: 0150: 00 00 00 60 ff 1f 87 70 98 00 82 40 00 00 00 60
JIT code: 0160: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 88 1b e7 60
JIT code: 0170: a6 03 e8 7c 0e 00 c0 38 21 00 80 4e 78 00 80 41
JIT code: 0180: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64
JIT code: 0190: 4c 1b e7 60 a6 03 e8 7c 0e 00 c5 38 21 00 80 4e
JIT code: 01a0: 54 00 80 41 00 00 00 60 16 00 04 28 38 00 82 41
JIT code: 01b0: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64
JIT code: 01c0: 4c 1b e7 60 a6 03 e8 7c 10 00 c5 38 21 00 80 4e
JIT code: 01d0: 24 00 80 41 00 00 00 60 16 00 04 28 14 00 82 40
JIT code: 01e0: 00 00 00 60 ff ff 60 38 01 00 63 3c 08 00 00 48
JIT code: 01f0: 00 00 60 38 20 01 21 38 10 00 01 e8 a6 03 08 7c
JIT code: 0200: 70 ff c1 e9 78 ff e1 e9 20 00 80 4e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
Oops: Exception in kernel mode, sig: 4 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in: virtio_balloon nfsd ip_tables x_tables autofs4 xfs libcrc32c 
virtio_console virtio_net virtio_pci virtio_ring virtio
CPU: 1 PID: 0 Comm: swapper/1 No

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-21 Thread Thadeu Lima de Souza Cascardo
On Tue, Jun 21, 2016 at 09:15:48PM +1000, Michael Ellerman wrote:
> On Tue, 2016-06-21 at 14:28 +0530, Naveen N. Rao wrote:
> > On 2016/06/20 03:56PM, Thadeu Lima de Souza Cascardo wrote:
> > > On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> > > > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > > > > 
> > > > > Hi, Michael and Naveen.
> > > > > 
> > > > > I noticed independently that there is a problem with BPF JIT and 
> > > > > ABIv2, and
> > > > > worked out the patch below before I noticed Naveen's patchset and the 
> > > > > latest
> > > > > changes in ppc tree for a better way to check for ABI versions.
> > > > > 
> > > > > However, since the issue described below affect mainline and stable 
> > > > > kernels,
> > > > > would you consider applying it before merging your two patchsets, so 
> > > > > that we can
> > > > > more easily backport the fix?
> > > > 
> > > > Hi Cascardo,
> > > > Given that this has been broken on ABIv2 since forever, I didn't bother 
> > > > fixing it. But, I can see why this would be a good thing to have for 
> > > > -stable and existing distros. However, while your patch below may fix 
> > > > the crash you're seeing on ppc64le, it is not sufficient -- you'll need 
> > > > changes in bpf_jit_asm.S as well.
> > > 
> > > Hi, Naveen.
> > > 
> > > Any tips on how to exercise possible issues there? Or what changes you 
> > > think
> > > would be sufficient?
> > 
> > The calling convention is different with ABIv2 and so we'll need changes 
> > in bpf_slow_path_common() and sk_negative_common().
> 
> How big would those changes be? Do we know?
> 
> How come no one reported this was broken previously? This is the first I've
> heard of it being broken.
> 

I just heard of it less than two weeks ago, and only could investigate it last
week, when I realized mainline was also affected.

It looks like the little-endian support for classic JIT were done before the
conversion to ABIv2. And as JIT is disabled by default, no one seems to have
exercised it.

> > However, rather than enabling classic JIT for ppc64le, are we better off 
> > just disabling it?
> > 
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -128,7 +128,7 @@ config PPC
> > select IRQ_FORCED_THREADING
> > select HAVE_RCU_TABLE_FREE if SMP
> > select HAVE_SYSCALL_TRACEPOINTS
> > -   select HAVE_CBPF_JIT
> > +   select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
> > select HAVE_ARCH_JUMP_LABEL
> > select ARCH_HAVE_NMI_SAFE_CMPXCHG
> > select ARCH_HAS_GCOV_PROFILE_ALL
> > 
> > 
> > Michael,
> > Let me know your thoughts on whether you intend to take this patch or 
> > Cascardo's patch for -stable before the eBPF patches. I can redo my 
> > patches accordingly.
> 
> This patch sounds like the best option at the moment for something we can
> backport. Unless the changes to fix it are minimal.
> 
> cheers
> 

With my patch only, I can run a minimal tcpdump tcp port 22 with success. It
correctly filter packets. But as pointed out, slow paths may not be taken.

I don't have strong opinions on what to apply to stable, just that it would be
nice to have something for the crash before applying all the nice changes by
Naveen.

Cascardo.


Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-20 Thread Thadeu Lima de Souza Cascardo
On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote:
> > > On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote:
> > > > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > > > b/arch/powerpc/net/bpf_jit_comp64.c
> > > > new file mode 100644
> > > > index 000..954ff53
> > > > --- /dev/null
> > > > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > > > @@ -0,0 +1,956 @@
> > > ...
> > > > +
> > > > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size)
> > > > +{
> > > > +   int *p = area;
> > > > +
> > > > +   /* Fill whole space with trap instructions */
> > > > +   while (p < (int *)((char *)area + size))
> > > > +   *p++ = BREAKPOINT_INSTRUCTION;
> > > > +}
> > > 
> > > This breaks the build for some configs, presumably you're missing a 
> > > header:
> > > 
> > >   arch/powerpc/net/bpf_jit_comp64.c:30:10: error: 
> > > 'BREAKPOINT_INSTRUCTION' undeclared (first use in this function)
> > > 
> > > http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/
> > > 
> > > cheers
> > 
> > Hi, Michael and Naveen.
> > 
> > I noticed independently that there is a problem with BPF JIT and ABIv2, and
> > worked out the patch below before I noticed Naveen's patchset and the latest
> > changes in ppc tree for a better way to check for ABI versions.
> > 
> > However, since the issue described below affect mainline and stable kernels,
> > would you consider applying it before merging your two patchsets, so that 
> > we can
> > more easily backport the fix?
> 
> Hi Cascardo,
> Given that this has been broken on ABIv2 since forever, I didn't bother 
> fixing it. But, I can see why this would be a good thing to have for 
> -stable and existing distros. However, while your patch below may fix 
> the crash you're seeing on ppc64le, it is not sufficient -- you'll need 
> changes in bpf_jit_asm.S as well.

Hi, Naveen.

Any tips on how to exercise possible issues there? Or what changes you think
would be sufficient?

I will see what I can find by myself, but would appreciate any help.

Regards.
Cascardo.

> 
> Regards,
> Naveen
> 


[PATCH net] sit: set rtnl_link_ops before calling register_netdevice

2016-01-25 Thread Thadeu Lima de Souza Cascardo
When creating a SIT tunnel with ip tunnel, rtnl_link_ops is not set before
ipip6_tunnel_create is called. When register_netdevice is called, there is
no linkinfo attribute in the NEWLINK message because of that.

Setting rtnl_link_ops before calling register_netdevice fixes that.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---
 net/ipv6/sit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index e794ef6..2066d1c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -201,14 +201,14 @@ static int ipip6_tunnel_create(struct net_device *dev)
if ((__force u16)t->parms.i_flags & SIT_ISATAP)
dev->priv_flags |= IFF_ISATAP;
 
+   dev->rtnl_link_ops = _link_ops;
+
err = register_netdevice(dev);
if (err < 0)
goto out;
 
ipip6_tunnel_clone_6rd(dev, sitn);
 
-   dev->rtnl_link_ops = _link_ops;
-
dev_hold(dev);
 
ipip6_tunnel_link(sitn, t);
-- 
2.5.0



Re: [PATCH net-next] vxlan: tun_id is 64bit, not 32bit

2016-02-18 Thread Thadeu Lima de Souza Cascardo
On Thu, Feb 18, 2016 at 07:19:29PM +0100, Jiri Benc wrote:
> The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
> convert the vni to tun_id correctly.
> 
> Fixes: 54bfd872bf16 ("vxlan: keep flags and vni in network byte order")
> Reported-by: Paolo Abeni <pab...@redhat.com>
> Tested-by: Paolo Abeni <pab...@redhat.com>
> Signed-off-by: Jiri Benc <jb...@redhat.com>

Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>


Re: [PATCH] ehea: Drop owner assignment from platform_driver

2016-02-19 Thread Thadeu Lima de Souza Cascardo
On Fri, Feb 19, 2016 at 04:52:19PM +0530, Amitoj Kaur Chawla wrote:
> platform_driver does not need to set an owner, it will be populated by
> the driver core.
> 
> Generated-by: scripts/coccinelle/api/platform_no_drv_owner.cocci
> 
> Signed-off-by: Amitoj Kaur Chawla 
> ---
>  drivers/net/ethernet/ibm/ehea/ehea_main.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
> b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> index 2a0dc12..d4b022f 100644
> --- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
> +++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> @@ -127,7 +127,6 @@ static const struct of_device_id ehea_device_table[] = {
>  static struct platform_driver ehea_driver = {
>   .driver = {
>   .name = "ehea",
> - .owner = THIS_MODULE,
>   .of_match_table = ehea_device_table,
>   },
>   .probe = ehea_probe_adapter,
> -- 
> 1.9.1
> 

NACK.

ehea does not use platform_driver_register, it uses
ibmebus_register_driver, which does not set owner.

Cascardo.


Re: [PATCH] ehea: Drop owner assignment from platform_driver

2016-02-19 Thread Thadeu Lima de Souza Cascardo
On Fri, Feb 19, 2016 at 07:06:46AM -0500, Julia Lawall wrote:
> On Fri, 19 Feb 2016, Thadeu Lima de Souza Cascardo wrote:
> > On Fri, Feb 19, 2016 at 04:52:19PM +0530, Amitoj Kaur Chawla wrote:
> > > platform_driver does not need to set an owner, it will be populated by
> > > the driver core.
> > >
> > > Generated-by: scripts/coccinelle/api/platform_no_drv_owner.cocci
> > >
> > > Signed-off-by: Amitoj Kaur Chawla <amitoj1...@gmail.com>
> > > ---
> > >  drivers/net/ethernet/ibm/ehea/ehea_main.c | 1 -
> > >  1 file changed, 1 deletion(-)
> > >
> > > diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
> > > b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> > > index 2a0dc12..d4b022f 100644
> > > --- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
> > > +++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> > > @@ -127,7 +127,6 @@ static const struct of_device_id ehea_device_table[] 
> > > = {
> > >  static struct platform_driver ehea_driver = {
> > >   .driver = {
> > >   .name = "ehea",
> > > - .owner = THIS_MODULE,
> > >   .of_match_table = ehea_device_table,
> > >   },
> > >   .probe = ehea_probe_adapter,
> > > --
> > > 1.9.1
> > >
> >
> > NACK.
> >
> > ehea does not use platform_driver_register, it uses
> > ibmebus_register_driver, which does not set owner.
> 
> Thanks for the information.  I will try to update the saemantic patch.
> 
> julia

Thanks, Julia. I appreciate your work on that.

Cascardo.


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Thadeu Lima de Souza Cascardo
On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote:
> On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > > (dma_map compound page, handout page-fragments for RX ring, and later
> > > dma_unmap when last RX page-fragments is seen).
> > 
> > So, IMO only holding onto the DMA pages is all that is justified but not a
> > recycle of order-0 pages built on top of the core allocator. For DMA pages,
> > it would take a bit of legwork but the per-cpu allocator could be split
> > and converted to hold arbitrary sized pages with a constructer/destructor
> > to do the DMA coherency step when pages are taken from or handed back to
> > the core allocator. I'm not volunteering to do that unfortunately but I
> > estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> > aware in which case the memory footprint will be high.
> 
> Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
> could hold onto a cache of recently used mappings and recycle them if that
> still works.  It trades off a bit of security (a device can continue to DMA
> after the memory should no longer be accessible to it) for speed, but then
> so does the per-driver hack of keeping pages around still mapped.
> 

There are two problems on the DMA calls on Power servers. One is scalability. A
new allocation method for the address space would be necessary to fix it.

The other one is the latency or the cost of updating the TCE tables. The only
number I have is that I could push around 1M updates per second. So, we could
guess 1us per operation, which is pretty much a no-no for Jesper use case.

Your solution could address both. But I am concerned about the security problem.
Here is why I think this problem should be ignored if we go this way. IOMMU can
be used for three problems: virtualization, paranoia security and debuggability.

For virtualization, there is a solution already, and it's in place for Power and
x86. Power servers have the ability to enlarge the DMA window, allowing the
entire VM memory to be mapped during PCI driver probe time. After that, dma_map
is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even
before booting the guest. Unless we want to fix this for old Power servers, I
see no point in fixing it.

Now, if you are using IOMMU on the host with no passthrough or linear system
memory mapping, you are paranoid. It's not just a matter of security, in fact.
It's also a matter of stability. Hardware, firmware and drivers can be buggy,
and they are. When I worked with drivers on Power servers, I found and fixed a
lot of driver bugs that caused the device to write to memory it was not supposed
to. Good thing is that IOMMU prevented that memory write to happen and the
driver would be reset by EEH. If we can make this scenario faster, and if we
want it to be the default we need to, then your solution might not be desired.
Otherwise, just turn your IOMMU off or put it into passthrough.

Now, the driver keeps pages mapped, but those pages belong to the driver. They
are not pages we decide to give to a userspace process because it's no longer in
use by the driver. So, I don't quite agree this would be a good tradeoff.
Certainly not if we can do it in a way that does not require this.

So, Jesper, please take into consideration that this pool design would rather be
per device. Otherwise, we allow some device to write into another's
device/driver memory.

Cascardo.


[PATCH] ip6_tunnel: set rtnl_link_ops before calling register_netdevice

2016-04-01 Thread Thadeu Lima de Souza Cascardo
When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set
before ip6_tnl_create2 is called. When register_netdevice is called, there
is no linkinfo attribute in the NEWLINK message because of that.

Setting rtnl_link_ops before calling register_netdevice fixes that.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index eb2ac4b..1f20345 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -252,12 +252,12 @@ static int ip6_tnl_create2(struct net_device *dev)
 
t = netdev_priv(dev);
 
+   dev->rtnl_link_ops = _link_ops;
err = register_netdevice(dev);
if (err < 0)
goto out;
 
strcpy(t->parms.name, dev->name);
-   dev->rtnl_link_ops = _link_ops;
 
dev_hold(dev);
ip6_tnl_link(ip6n, t);
-- 
2.5.0



Re: [PATCH 0043/1529] Fix typo

2016-05-21 Thread Thadeu Lima de Souza Cascardo
On Sat, May 21, 2016 at 01:41:21PM +0200, Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 
> ---
>  Documentation/isdn/HiSax.cert | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/isdn/HiSax.cert b/Documentation/isdn/HiSax.cert
> index f2a6fcb..1483cd9d 100644
> --- a/Documentation/isdn/HiSax.cert
> +++ b/Documentation/isdn/HiSax.cert
> @@ -32,7 +32,7 @@ special hardware used will be regarded as approved if at 
> least one
>  solution has been tested including those electrical tests. So if cards 
>  or tas have been completely approved for any other os, the approval
>  for those electrical tests is valid for linux, too.
> -Please send any questions regarding this drivers or approval abouts to 
> +Please send any questions regarding this drivers or approval about to 

Should be "these drivers".

>  wer...@isdn-development.de 
>  Additional information and the type approval documents will be found
>  shortly on the Colognechip website www.colognechip.com 
> -- 
> 2.8.2.534.g1f66975
> 


Re: [PATCH 1/1] net: ehea: avoid null pointer dereference

2016-05-17 Thread Thadeu Lima de Souza Cascardo
On Tue, May 17, 2016 at 10:28:54PM +0200, Heinrich Schuchardt wrote:
> ehea_get_port may return NULL. Do not dereference NULL value.
> 
> Fixes: 8c4877a4128e ("ehea: Use the standard logging functions")
> Signed-off-by: Heinrich Schuchardt <xypron.g...@gmx.de>

Acked-by: Thadeu Lima de Souza Cascardo <casca...@debian.org>

> ---
>  drivers/net/ethernet/ibm/ehea/ehea_main.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
> b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> index 2a0dc12..54efa9a 100644
> --- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
> +++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> @@ -1169,16 +1169,15 @@ static void ehea_parse_eqe(struct ehea_adapter 
> *adapter, u64 eqe)
>   ec = EHEA_BMASK_GET(NEQE_EVENT_CODE, eqe);
>   portnum = EHEA_BMASK_GET(NEQE_PORTNUM, eqe);
>   port = ehea_get_port(adapter, portnum);
> + if (!port) {
> + netdev_err(NULL, "unknown portnum %x\n", portnum);
> + return;
> + }
>   dev = port->netdev;
>  
>   switch (ec) {
>   case EHEA_EC_PORTSTATE_CHG: /* port state change */
>  
> - if (!port) {
> - netdev_err(dev, "unknown portnum %x\n", portnum);
> - break;
> - }
> -
>   if (EHEA_BMASK_GET(NEQE_PORT_UP, eqe)) {
>   if (!netif_carrier_ok(dev)) {
>   ret = ehea_sense_port_attr(port);
> -- 
> 2.1.4
> 


[PATCH net-next] vxlan: if_arp: introduce ARPHRD_VXLANGPE

2016-05-05 Thread Thadeu Lima de Souza Cascardo
Use ARPHRD_VXLANGPE to identify VxLAN GPE interfaces. This is going to be used
to allow GPE interfaces to be added as openvswitch ports.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Cc: Jiri Benc <jb...@redhat.com>
Cc: Simon Horman <simon.hor...@netronome.com>
---
 drivers/net/vxlan.c | 2 +-
 include/uapi/linux/if_arp.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2668e52..9da962f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2603,7 +2603,7 @@ static void vxlan_ether_setup(struct net_device *dev)
 static void vxlan_raw_setup(struct net_device *dev)
 {
dev->header_ops = NULL;
-   dev->type = ARPHRD_NONE;
+   dev->type = ARPHRD_VXLANGPE;
dev->hard_header_len = 0;
dev->addr_len = 0;
dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index 4d024d7..52a8175 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -95,6 +95,7 @@
 #define ARPHRD_IP6GRE  823 /* GRE over IPv6*/
 #define ARPHRD_NETLINK 824 /* Netlink header   */
 #define ARPHRD_6LOWPAN 825 /* IPv6 over LoWPAN */
+#define ARPHRD_VXLANGPE 826/* VxLAN GPE */
 
 #define ARPHRD_VOID  0x/* Void type, nothing is known */
 #define ARPHRD_NONE  0xFFFE/* zero header length */
-- 
2.5.5



Re: [PATCH net-next] vxlan: if_arp: introduce ARPHRD_VXLANGPE

2016-05-05 Thread Thadeu Lima de Souza Cascardo
On Thu, May 05, 2016 at 09:31:41PM +0200, Jiri Benc wrote:
> On Thu,  5 May 2016 13:36:44 -0300, Thadeu Lima de Souza Cascardo wrote:
> > Use ARPHRD_VXLANGPE to identify VxLAN GPE interfaces. This is going to be 
> > used
> > to allow GPE interfaces to be added as openvswitch ports.
> 
> What's wrong with ARPHRD_NONE? I don't think we need a separate type
> for VXLAN-GPE. Just use ARPHRD_NONE in ovs and things should work, for
> all ARPHRD_NONE interfaces as a bonus.
> 

That's fine for me. I looked quickly at the few devices using ARPHRD_NONE in
upstream kernel, not sure if there are broken out-of-tree drivers out there. And
should we care?

> > Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> > Cc: Jiri Benc <jb...@redhat.com>
> > Cc: Simon Horman <simon.hor...@netronome.com>
> 
> You did not CC me nor Simon :-)
> 

I am using sendemail.suppresscc=all to prevent some accidents. I am adding
confirm=always so I can double check before really sending.

Thanks.
Cascardo.

>  Jiri


Re: [PATCH 1/1] Update maintainer for EHEA driver.

2016-07-18 Thread Thadeu Lima de Souza Cascardo
On Mon, Jul 18, 2016 at 12:28:45PM -0500, Douglas Miller wrote:
> Since Thadeu left IBM, EHEA has gone mostly unmaintained, since his email
> address doesn't work anymore.  I'm stepping up to help maintain this
> driver upstream.
> 
> I'm adding Thadeu's personal e-mail address in Cc, hoping that we can
> get his ack.
> 
> CC: Thadeu Lima de Souza Cascardo <casca...@cascardo.eti.br>

Acked-by: Thadeu Lima de Souza Cascardo <casca...@cascardo.eti.br>

> Signed-off-by: Douglas Miller <dougm...@linux.vnet.ibm.com>
> ---
>  MAINTAINERS |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1d74837..e9cefe2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4477,7 +4477,7 @@ S:  Orphan
>  F:   fs/efs/
>  
>  EHEA (IBM pSeries eHEA 10Gb ethernet adapter) DRIVER
> -M:   Thadeu Lima de Souza Cascardo <casca...@linux.vnet.ibm.com>
> +M:   Douglas Miller <dougm...@linux.vnet.ibm.com>
>  L:   netdev@vger.kernel.org
>  S:   Maintained
>  F:   drivers/net/ethernet/ibm/ehea/
> -- 
> 1.7.1
> 


signature.asc
Description: PGP signature


[RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread Thadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps
when forwarding between two 40GbE ports with a single flow configured on
the datapath.
---
 net/openvswitch/flow.c   | 39 +--
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 21 +++--
 3 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128e..90ae248 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = get_cpu();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(>lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(>lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1)
+   && likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
stats->tcp_flags |= tcp_flags;
 unlock:
spin_unlock(>lock);
+   put_cpu();
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
@@ -136,14 +139,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   for_each_node(node) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   for_each_possible_cpu(cpu) {
+   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[cpu]);
 
if (stats) {
/* Local CPU may write on non-local stats, so we must
@@ -163,10 +166,10 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 /* Called with ovs_mutex. */
 void ovs_flow_stats_clear(struct sw_flow *flow)
 {
-   int node;
+   int cpu;
 
-   for_each_node(node) {
-   struct flow_stats *stats = 

Re: [PATCH net] bridge: re-introduce 'fix parsing of MLDv2 reports'

2016-08-31 Thread Thadeu Lima de Souza Cascardo
On Wed, Aug 31, 2016 at 02:16:44PM +0200, Davide Caratti wrote:
> commit bc8c20acaea1 ("bridge: multicast: treat igmpv3 report with
> INCLUDE and no sources as a leave") seems to have accidentally reverted
> commit 47cc84ce0c2f ("bridge: fix parsing of MLDv2 reports"). This
> commit brings back a change to br_ip6_multicast_mld2_report() where
> parsing of MLDv2 reports stops when the first group is successfully
> added to the MDB cache.
> 
> Fixes: bc8c20acaea1 ("bridge: multicast: treat igmpv3 report with INCLUDE and 
> no sources as a leave")
> Signed-off-by: Davide Caratti <dcara...@redhat.com>
> ---
>  net/bridge/br_multicast.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index a5423a1..c5fea93 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -1138,7 +1138,7 @@ static int br_ip6_multicast_mld2_report(struct 
> net_bridge *br,
>   } else {
>   err = br_ip6_multicast_add_group(br, port,
>>grec_mca, vid);
> - if (!err)
> +         if (err)
>   break;
>   }
>   }
> -- 
> 2.5.5
> 

Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>


[PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-15 Thread Thadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Cc: pravin shelar <pshe...@ovn.org>
---

v2:
* use smp_processor_id as ovs_flow_stats_update is always called from BH
context
* use kmem_cache_zalloc to allocate flow

---
 net/openvswitch/flow.c   | 42 ++
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 26 +-
 3 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 5b80612..0fa45439 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = smp_processor_id();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(>lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(>lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1) &&
+   likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -136,15 +138,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   /* We open code this to make sure node 0 is always considered */
-   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   /* We open code this to make sure cpu 0 is always considered */
+   for (cpu = 0; cpu < nr_cpu_ids; cpu 

[PATCH v2 1/2] openvswitch: fix flow stats accounting when node 0 is not possible

2016-09-15 Thread Thadeu Lima de Souza Cascardo
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---
 net/openvswitch/flow.c   | 6 --
 net/openvswitch/flow_table.c | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 1240ae3..5b80612 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
 
if (stats) {
@@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
 {
int node;
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = ovsl_dereference(flow->stats[node]);
 
if (stats) {
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index d073fff..957a3c3 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow)
kfree(flow->id.unmasked_key);
if (flow->sf_acts)
ovs_nla_free_flow_actions((struct sw_flow_actions __force 
*)flow->sf_acts);
-   for_each_node(node)
-   if (flow->stats[node])
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map))
+   if (node != 0 && flow->stats[node])
kmem_cache_free(flow_stats_cache,
(struct flow_stats __force 
*)flow->stats[node]);
kmem_cache_free(flow_cache, flow);
-- 
2.7.4



Re: [PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-15 Thread Thadeu Lima de Souza Cascardo
On Thu, Sep 15, 2016 at 04:09:26PM -0700, Eric Dumazet wrote:
> On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote:
> > Instead of using flow stats per NUMA node, use it per CPU. When using
> > megaflows, the stats lock can be a bottleneck in scalability.
> > 
> > On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> > ~15Mpps when forwarding between two 40GbE ports with a single flow
> > configured on the datapath.
> > 
> > This has been tested on a system with possible CPUs 0-7,16-23. After
> > module removal, there were no corruption on the slab cache.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> > Cc: pravin shelar <pshe...@ovn.org>
> > ---
> 
> > +   /* We open code this to make sure cpu 0 is always considered */
> > +   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
> > cpu_possible_mask))
> > +   if (flow->stats[cpu])
> > kmem_cache_free(flow_stats_cache,
> > -   (struct flow_stats __force 
> > *)flow->stats[node]);
> > +   (struct flow_stats __force 
> > *)flow->stats[cpu]);
> > kmem_cache_free(flow_cache, flow);
> >  }
> >  
> > @@ -757,7 +749,7 @@ int ovs_flow_init(void)
> > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
> >  
> > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> > -  + (nr_node_ids
> > +  + (nr_cpu_ids
> >   * sizeof(struct flow_stats *)),
> >0, 0, NULL);
> > if (flow_cache == NULL)
> 
> Well, if you switch to percpu stats, better use normal
> alloc_percpu(struct flow_stats)
> 
> The code was dealing with per node allocation so could not use existing
> helper.
> 
> No need to keep this forever.

The problem is that the alloc_percpu uses a global spinlock and that affects
some workloads on OVS that creates lots of flows, as described in commit
9ac56358dec1a5aa7f4275a42971f55fad1f7f35 ("datapath: Per NUMA node flow
stats.").

This problem would not happen on this version as the flow allocation does not
suffer from the same scalability problem as when using alloc_percpu.

Cascardo.


[PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible

2016-09-09 Thread Thadeu Lima de Souza Cascardo
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---

I am providing this intermediate patch, that will be thrown out by the next one,
in case there is any need to backport this fix.

---
 net/openvswitch/flow.c   | 6 --
 net/openvswitch/flow_table.c | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128e..3609f37 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
 
if (stats) {
@@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
 {
int node;
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = ovsl_dereference(flow->stats[node]);
 
if (stats) {
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index d073fff..957a3c3 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow)
kfree(flow->id.unmasked_key);
if (flow->sf_acts)
ovs_nla_free_flow_actions((struct sw_flow_actions __force 
*)flow->sf_acts);
-   for_each_node(node)
-   if (flow->stats[node])
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map))
+   if (node != 0 && flow->stats[node])
kmem_cache_free(flow_stats_cache,
(struct flow_stats __force 
*)flow->stats[node]);
kmem_cache_free(flow_cache, flow);
-- 
2.7.4



[PATCH 2/2] openvswitch: use percpu flow stats

2016-09-09 Thread Thadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---
 net/openvswitch/flow.c   | 43 +++
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 23 ---
 3 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 3609f37..2970a9f 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = get_cpu();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(>lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(>lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1) &&
+   likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
stats->tcp_flags |= tcp_flags;
 unlock:
spin_unlock(>lock);
+   put_cpu();
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
@@ -136,15 +139,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   /* We open code this to make sure node 0 is always considered */
-   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   /* We open code this to make s

[PATCH] openvswitch: use alias for genetlink family names

2016-09-09 Thread Thadeu Lima de Souza Cascardo
When userspace tries to create datapaths and the module is not loaded,
it will simply fail. With this patch, the module will be automatically
loaded.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
---
 net/openvswitch/datapath.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 524c0fd..0536ab3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2437,3 +2437,7 @@ module_exit(dp_cleanup);
 
 MODULE_DESCRIPTION("Open vSwitch switching datapath");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS_GENL_FAMILY(OVS_DATAPATH_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
-- 
2.7.4



Re: [ovs-dev] [PATCH net-next v13 0/8] openvswitch: support for layer 3 encapsulated packets

2016-11-15 Thread Thadeu Lima de Souza Cascardo
On November 15, 2016 11:57:21 AM GMT-02:00, "Yang, Yi Y"  
wrote:
>Hi, Jiri
>
>I'm very glad to see you're continuing this work :-), I asked Simon
>about this twice, but nobody replies. I also remember Cascardo has a
>patch set to collaborate with this patch set, I asked Cascardo, but
>nobody responds, will you continue to do Cascardo's " create tunnel
>devices using rtnetlink interface" patch set? I test the old one v3,
>that can work with vxlan module in kernel, but if I build ovs with
>option " --with-linux=/lib/modules/`uname -r`/build", ovs vxlan module
>is built in vport_vxlan module, when I create vxlan-gpe port, kernel
>will automatically load vxlan module in the kernel instead of using the
>APIs in vport_vxlan module. 
>
>Cascardo, are you still working on this?
>
>-Original Message-
>From: netdev-ow...@vger.kernel.org
>[mailto:netdev-ow...@vger.kernel.org] On Behalf Of Jiri Benc
>Sent: Thursday, November 10, 2016 11:28 PM
>To: netdev@vger.kernel.org
>Cc: d...@openvswitch.org; Pravin Shelar ; Lorand Jakab
>; Simon Horman 
>Subject: [PATCH net-next v13 0/8] openvswitch: support for layer 3
>encapsulated packets
>
>At the core of this patch set is removing the assumption in Open
>vSwitch datapath that all packets have Ethernet header.
>
>The implementation relies on the presence of pop_eth and push_eth
>actions in datapath flows to facilitate adding and removing Ethernet
>headers as appropriate. The construction of such flows is left up to
>user-space.
>
>This series is based on work by Simon Horman, Lorand Jakab, Thomas
>Morin and others. I kept Lorand's and Simon's s-o-b in the patches that
>are derived from v11 to record their authorship of parts of the code.
>
>Changes from v12 to v13:
>
>* Addressed Pravin's feedback.
>* Removed the GRE vport conversion patch; L3 GRE ports should be
>created by
>  rtnetlink instead.
>
>Main changes from v11 to v12:
>
>* The patches were restructured and split differently for easier
>review.
>* They were rebased and adjusted to the current net-next. Especially
>MPLS
>handling is different (and easier) thanks to the recent MPLS GSO
>rework.
>* Several bugs were discovered and fixed. The most notable is fragment
>handling: header adjustment for ARPHRD_NONE devices on tx needs to be
>done
>after refragmentation, not before it. This required significant changes
>in
>the patchset. Another one is stricter checking of attributes (match on
>L2
>  vs. L3 packet) at the kernel level.
>* Instead of is_layer3 bool, a mac_proto field is used.
>
>Jiri Benc (8):
>  openvswitch: use hard_header_len instead of hardcoded ETH_HLEN
>  openvswitch: add mac_proto field to the flow key
>  openvswitch: pass mac_proto to ovs_vport_send
>  openvswitch: support MPLS push and pop for L3 packets
>  openvswitch: add processing of L3 packets
>  openvswitch: netlink: support L3 packets
>  openvswitch: add Ethernet push and pop actions
>  openvswitch: allow L3 netdev ports
>
> include/uapi/linux/openvswitch.h |  15 
> net/openvswitch/actions.c| 111 +---
> net/openvswitch/datapath.c   |  13 +--
> net/openvswitch/flow.c   | 105 +--
> net/openvswitch/flow.h   |  22 +
>net/openvswitch/flow_netlink.c   | 179
>++-
> net/openvswitch/vport-netdev.c   |   9 +-
> net/openvswitch/vport.c  |  31 +--
> net/openvswitch/vport.h  |   2 +-
> 9 files changed, 353 insertions(+), 134 deletions(-)
>
>--
>1.8.3.1
>
>___
>dev mailing list
>d...@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Hi.

I am still working on this. Just see my recent discussion with Pravin about the 
way to support out of tree drivers. If you have any opinion on that, please 
share on that thread, preferably with a patch. It's likely that Eric Garver 
will take this task as he was already working with me.

Cascardo.


Re: [PATCH 5/6] platform/x86: make device_attribute const

2017-08-21 Thread Thadeu Lima de Souza Cascardo
For classmate-laptop.c

Acked-by: Thadeu Lima de Souza Cascardo <casca...@holoscopio.com>



[PATCH] caif: use non-archaic spelling of failes

2018-05-31 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 include/net/caif/caif_layer.h | 2 +-
 net/caif/cfrfml.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/caif/caif_layer.h b/include/net/caif/caif_layer.h
index 94e5ed64dc6d..8a114e57bcb6 100644
--- a/include/net/caif/caif_layer.h
+++ b/include/net/caif/caif_layer.h
@@ -22,7 +22,7 @@ struct caif_packet_funcs;
  * @assert: expression to evaluate.
  *
  * This function will print a error message and a do WARN_ON if the
- * assertion failes. Normally this will do a stack up at the current location.
+ * assertion fails. Normally this will do a stack up at the current location.
  */
 #define caif_assert(assert)\
 do {   \
diff --git a/net/caif/cfrfml.c b/net/caif/cfrfml.c
index b82440e1fcb4..3f2c63c78004 100644
--- a/net/caif/cfrfml.c
+++ b/net/caif/cfrfml.c
@@ -85,7 +85,7 @@ static struct cfpkt *rfm_append(struct cfrfml *rfml, char 
*seghead,
tmppkt = cfpkt_append(rfml->incomplete_frm, pkt,
rfml->pdu_size + RFM_HEAD_SIZE);
 
-   /* If cfpkt_append failes input pkts are not freed */
+   /* If cfpkt_append fails input pkts are not freed */
*err = -ENOMEM;
if (tmppkt == NULL)
return NULL;
-- 
2.17.0



[PATCH] vlan: use non-archaic spelling of failes

2018-05-31 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 include/linux/if_vlan.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 78a5a90b4267..83ea4df6ab81 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -331,7 +331,7 @@ static inline bool 
vlan_hw_offload_capable(netdev_features_t features,
  * @mac_len: MAC header length including outer vlan headers
  *
  * Inserts the VLAN tag into @skb as part of the payload at offset mac_len
- * Returns error if skb_cow_head failes.
+ * Returns error if skb_cow_head fails.
  *
  * Does not change skb->protocol so this function can be used during receive.
  */
@@ -379,7 +379,7 @@ static inline int __vlan_insert_inner_tag(struct sk_buff 
*skb,
  * @vlan_tci: VLAN TCI to insert
  *
  * Inserts the VLAN tag into @skb as part of the payload
- * Returns error if skb_cow_head failes.
+ * Returns error if skb_cow_head fails.
  *
  * Does not change skb->protocol so this function can be used during receive.
  */
-- 
2.17.0



[PATCH v2] xfrm6: call kfree_skb when skb is toobig

2018-08-31 Thread Thadeu Lima de Souza Cascardo
After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

The fix is to call kfree_skb in case of transmit failures.

Signed-off-by: Thadeu Lima de Souza Cascardo 
Reviewed-by: Sabrina Dubroca 
Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
 
if (toobig && xfrm6_local_dontfrag(skb)) {
xfrm6_local_rxpmtu(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
} else if (!skb->ignore_df && toobig && skb->sk) {
xfrm_local_error(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
}
 
-- 
2.17.1



Re: [PATCH net-next v2 2/2] cxgb4: collect hardware dump in second kernel

2018-03-24 Thread Thadeu Lima de Souza Cascardo
On Sat, Mar 24, 2018 at 04:26:34PM +0530, Rahul Lakkireddy wrote:
> Register callback to collect hardware/firmware dumps in second kernel
> before hardware/firmware is initialized.  The dumps for each device
> will be available under /sys/kernel/crashdd/cxgb4/ directory in second
> kernel.
> 
> Signed-off-by: Rahul Lakkireddy 
> Signed-off-by: Ganesh Goudar 
> ---
> v2:
> - No Changes.
> 
> Changes since rfc v2:
> - Update comments and commit message for sysfs change.
> 
> rfc v2:
> - Updated dump registration to the new API in patch 1.
[...]
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
> b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> index e880be8e3c45..265cb026f868 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -5527,6 +5527,18 @@ static int init_one(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>   if (err)
>   goto out_free_adapter;
>  
> + if (is_kdump_kernel()) {
> + /* Collect hardware state and append to
> +  * /sys/kernel/crashdd/cxgb4/ directory
> +  */
> + err = cxgb4_cudbg_crashdd_add_dump(adapter);
> + if (err) {
> + dev_warn(adapter->pdev_dev,
> +  "Fail collecting crash device dump, err: %d. 
> Continuing\n",
> +  err);
> + err = 0;
> + }
> + }
>  

The problem I see with this approach is that you require that the driver
is built into the kdump kernel (or present as a module in the kdump
initramfs), and that you will probe the device during the collection of
the dumps.

IMHO, if you are going to require the device to be probed by the same
driver during kdump, you might just as well use the device object itself
to present the crash data. I think that's what Stephen Hemminger meant
when he said to use sysfs. No need at all for any special crashdd. Just
add an attribute or attribute group to the device object.

Otherwise, as Eric Biederman pointed out, you should just add that data
into the vmcore before you kexec, so you don't even need to look at a
different file, and the driver does not even need to be present in the
kdump kernel.

Cascardo.

>   if (!is_t4(adapter->params.chip)) {
>   s_qpp = (QUEUESPERPAGEPF0_S +
> -- 
> 2.14.1


[PATCH] test_bpf: Fix testing with CONFIG_BPF_JIT_ALWAYS_ON=y on other arches

2018-03-20 Thread Thadeu Lima de Souza Cascardo
Function bpf_fill_maxinsns11 is designed to not be able to be JITed on
x86_64. So, it fails when CONFIG_BPF_JIT_ALWAYS_ON=y, and
commit 09584b406742 ("bpf: fix selftests/bpf test_kmod.sh failure when
CONFIG_BPF_JIT_ALWAYS_ON=y") makes sure that failure is detected on that
case.

However, it does not fail on other architectures, which have a different
JIT compiler design. So, test_bpf has started to fail to load on those.

After this fix, test_bpf loads fine on both x86_64 and ppc64el.

Fixes: 09584b406742 ("bpf: fix selftests/bpf test_kmod.sh failure when 
CONFIG_BPF_JIT_ALWAYS_ON=y")
Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
---
 lib/test_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213716faa..3e9335493fe49 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -5467,7 +5467,7 @@ static struct bpf_test tests[] = {
{
"BPF_MAXINSNS: Jump, gap, jump, ...",
{ },
-#ifdef CONFIG_BPF_JIT_ALWAYS_ON
+#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_X86)
CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL,
 #else
CLASSIC | FLAG_NO_DATA,
-- 
2.15.1



Re: [PATCH] test_bpf: Fix testing with CONFIG_BPF_JIT_ALWAYS_ON=y on other arches

2018-03-20 Thread Thadeu Lima de Souza Cascardo
On Tue, Mar 20, 2018 at 09:05:15AM -0700, Yonghong Song wrote:
> 
> 
> On 3/20/18 5:58 AM, Thadeu Lima de Souza Cascardo wrote:
> > Function bpf_fill_maxinsns11 is designed to not be able to be JITed on
> > x86_64. So, it fails when CONFIG_BPF_JIT_ALWAYS_ON=y, and
> > commit 09584b406742 ("bpf: fix selftests/bpf test_kmod.sh failure when
> > CONFIG_BPF_JIT_ALWAYS_ON=y") makes sure that failure is detected on that
> > case.
> > 
> > However, it does not fail on other architectures, which have a different
> > JIT compiler design. So, test_bpf has started to fail to load on those.
> 
> Here, you mentioned that it did not fail on other architectures. Have you
> verified all of them or just looked through the algorithm.

>From our testing, I know at least I get an UNEXPECTED_PASS on arm64, arm, s390x
and ppc64le. i386 doesn't have JIT, so it doesn't have
CONFIG_BPF_JIT_ALWAYS_ON=y.

> 
> Could you give a little bit details about other architectures are okay while
> x86 is not? Maybe, x86 JIT can be improved some how?

As the comment on that functions says:

/* Hits 70 passes on x86_64, so cannot get JITed there. */

And looking at x86_64 JIT compiler, you will notice it's looping trying to
minimize the size of the code, limited to 10 passes. If it does not converge,
it goes back to the non-JIT code.

That's not the case on powerpc or arm, that do not do multiple passes. sparc
seem to do 3 passes, but does not seem to go back to non-JIT code.

Cascardo.

> 
> Thanks!
> 
> > 
> > After this fix, test_bpf loads fine on both x86_64 and ppc64el.
> > 
> > Fixes: 09584b406742 ("bpf: fix selftests/bpf test_kmod.sh failure when 
> > CONFIG_BPF_JIT_ALWAYS_ON=y")
> > Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
> > ---
> >   lib/test_bpf.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> > index 2efb213716faa..3e9335493fe49 100644
> > --- a/lib/test_bpf.c
> > +++ b/lib/test_bpf.c
> > @@ -5467,7 +5467,7 @@ static struct bpf_test tests[] = {
> > {
> > "BPF_MAXINSNS: Jump, gap, jump, ...",
> > { },
> > -#ifdef CONFIG_BPF_JIT_ALWAYS_ON
> > +#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_X86)
> > CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL,
> >   #else
> > CLASSIC | FLAG_NO_DATA,
> > 


[PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu

2018-08-30 Thread Thadeu Lima de Souza Cascardo
Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), skb was scrubbed before checking for ignore_df. The
scrubbing meant ignore_df was false, making the check irrelevant. Now that the
scrubbing happens after that, some packets might fail the checking and dst
will not have its pmtu updated.

Not only that, but too big skb will be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/ipv6/ip6_vti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index c72ae3a4fe09..fbd3752ea587 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
}
 
mtu = dst_mtu(dst);
-   if (!skb->ignore_df && skb->len > mtu) {
+   if (skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
 
if (skb->protocol == htons(ETH_P_IPV6)) {
-- 
2.17.1



[PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-30 Thread Thadeu Lima de Souza Cascardo
After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

The fix is to call kfree_skb in case of transmit failures.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
 
if (toobig && xfrm6_local_dontfrag(skb)) {
xfrm6_local_rxpmtu(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
} else if (!skb->ignore_df && toobig && skb->sk) {
xfrm_local_error(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
}
 
-- 
2.17.1