from:"William Tu"

Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 11:39 AM Björn Töpel  wrote:
>
> On 2018-10-02 20:23, William Tu wrote:
> > On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
> >>
> >> From: Björn Töpel 
> >>
> >> Jeff: Please remove the v1 patches from your dev-queue!
> >>
> >> This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
> >> driver.
> >>
> >> The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
> >> analogous to the i40e ZC support. Again, as in i40e, code paths have
> >> been copied from the XDP path to the zero-copy path. Going forward we
> >> will try to generalize more code between the AF_XDP ZC drivers, and
> >> also reduce the heavy C
> >>
> >> We have run some benchmarks on a dual socket system with two Broadwell
> >> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> >> cores which gives a total of 28, but only two cores are used in these
> >> experiments. One for TR/RX and one for the user space application. The
> >> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> >> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> >> memory. The compiler used is GCC 7.3.0. The NIC is Intel
> >> 82599ES/X520-2 10Gbit/s using the ixgbe driver.
> >>
> >> Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
> >> for 64B and 1500B packets, generated by a commercial packet generator
> >> HW blasting packets at full 10Gbit/s line rate. The results are with
> >> retpoline and all other spectre and meltdown fixes.
> >>
> >> AF_XDP performance 64B packets:
> >> Benchmark   XDP_DRV with zerocopy
> >> rxdrop14.7
> >> txpush14.6
> >> l2fwd 11.1
> >>
> >> AF_XDP performance 1500B packets:
> >> Benchmark   XDP_DRV with zerocopy
> >> rxdrop0.8
> >> l2fwd 0.8
> >>
> >> XDP performance on our system as a base line.
> >>
> >> 64B packets:
> >> XDP stats   CPU Mpps   issue-pps
> >> XDP-RX CPU  16  14.7   0
> >>
> >> 1500B packets:
> >> XDP stats   CPU Mpps   issue-pps
> >> XDP-RX CPU  16  0.80
> >>
> >> The structure of the patch set is as follows:
> >>
> >> Patch 1: Introduce Rx/Tx ring enable/disable functionality
> >> Patch 2: Preparatory patche to ixgbe driver code for RX
> >> Patch 3: ixgbe zero-copy support for RX
> >> Patch 4: Preparatory patch to ixgbe driver code for TX
> >> Patch 5: ixgbe zero-copy support for TX
> >>
> >> Changes since v1:
> >>
> >> * Removed redundant AF_XDP precondition checks, pointed out by
> >>Jakub. Now, the preconditions are only checked at XDP enable time.
> >> * Fixed a crash in the egress path, due to incorrect usage of
> >>ixgbe_ring queue_index member. In v2 a ring_idx back reference is
> >>introduced, and used in favor of queue_index. William reported the
> >>crash, and helped me smoke out the issue. Kudos!
> >
> > Thanks! I tested this series and no more crash.
>
> Thank you for spending time on this!
>
> > The number is pretty good (*without* spectre and meltdown fixes)
> > model name : Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz, total 16 cores/
> >
> > AF_XDP performance 64B packets:
> > Benchmark   XDP_DRV with zerocopy
> > rxdrop20
> > txpush18
> > l2fwd 20
Sorry please ignore this number!
It's actually 2Mpps from xdpsock but that's because my sender only sends 2Mpps.
>
> What is 20 here? Given that 14.8Mpps is maximum for 64B@10Gbit/s for
> one queue, is this multiple queues? Is this xdpsock or OvS with AF_XDP?

I'm redoing the experiments with higher traffic rate, will report later..
William

Re: [PATCH v2 4/5] ixgbe: move common Tx functions to ixgbe_txrx_common.h

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> This patch prepares for the upcoming zero-copy Tx functionality by
> moving common functions used both by the regular path and zero-copy
> path.
>
> Signed-off-by: Björn Töpel 
> ---
Thanks!
Tested-by: William Tu

Re: [PATCH v2 2/5] ixgbe: move common Rx functions to ixgbe_txrx_common.h

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> This patch prepares for the upcoming zero-copy Rx functionality, by
> moving/changing linkage of common functions, used both by the regular
> path and zero-copy path.
>
> Signed-off-by: Björn Töpel 
> ---

Thanks!
Tested-by: William Tu

Re: [PATCH v2 5/5] ixgbe: add AF_XDP zero-copy Tx support

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> This patch adds zero-copy Tx support for AF_XDP sockets. It implements
> the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
> NAPI context. This means pulling egress packets from the Tx ring,
> placing the frames on the NIC HW descriptor ring and completing sent
> frames back to the application via the completion ring.
>
> The regular XDP Tx ring is used for AF_XDP as well. This rationale for
> this is as follows: XDP_REDIRECT guarantees mutual exclusion between
> different NAPI contexts based on CPU id. In other words, a netdev can
> XDP_REDIRECT to another netdev with a different NAPI context, since
> the operation is bound to a specific core and each core has its own
> hardware ring.
>
> As the AF_XDP Tx action is running in the same NAPI context and using
> the same ring, it will also be protected from XDP_REDIRECT actions
> with the exact same mechanism.
>
> As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
> ixgbe_xsk.c.
>
> Signed-off-by: Björn Töpel 
> ---

Thanks!
Tested-by: William Tu

Re: [PATCH v2 3/5] ixgbe: add AF_XDP zero-copy Rx support

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
> queue.
>
> All AF_XDP specific functions are added to a new file, ixgbe_xsk.c.
>
> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> will allocate a new buffer and copy the zero-copy frame prior passing
> it to the kernel stack.
>
> Signed-off-by: Björn Töpel 
> ---

Thanks!
Tested-by: William Tu 

>  drivers/net/ethernet/intel/ixgbe/Makefile |   3 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  27 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  17 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  78 ++-
>  .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  15 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 628 ++
>  6 files changed, 747 insertions(+), 21 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>

Re: [PATCH v2 1/5] ixgbe: added Rx/Tx ring disable/enable functions

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Add functions for Rx/Tx ring enable/disable. Instead of resetting the
> whole device, only the affected ring is disabled or enabled.
>
> This plumbing is used in later commits, when zero-copy AF_XDP support
> is introduced.
>
> Signed-off-by: Björn Töpel 

Thanks!
Tested-by: William Tu

Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-02 Thread William Tu

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Jeff: Please remove the v1 patches from your dev-queue!
>
> This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
> driver.
>
> The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
> analogous to the i40e ZC support. Again, as in i40e, code paths have
> been copied from the XDP path to the zero-copy path. Going forward we
> will try to generalize more code between the AF_XDP ZC drivers, and
> also reduce the heavy C
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is GCC 7.3.0. The NIC is Intel
> 82599ES/X520-2 10Gbit/s using the ixgbe driver.
>
> Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
> for 64B and 1500B packets, generated by a commercial packet generator
> HW blasting packets at full 10Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes.
>
> AF_XDP performance 64B packets:
> Benchmark   XDP_DRV with zerocopy
> rxdrop14.7
> txpush14.6
> l2fwd 11.1
>
> AF_XDP performance 1500B packets:
> Benchmark   XDP_DRV with zerocopy
> rxdrop0.8
> l2fwd 0.8
>
> XDP performance on our system as a base line.
>
> 64B packets:
> XDP stats   CPU Mpps   issue-pps
> XDP-RX CPU  16  14.7   0
>
> 1500B packets:
> XDP stats   CPU Mpps   issue-pps
> XDP-RX CPU  16  0.80
>
> The structure of the patch set is as follows:
>
> Patch 1: Introduce Rx/Tx ring enable/disable functionality
> Patch 2: Preparatory patche to ixgbe driver code for RX
> Patch 3: ixgbe zero-copy support for RX
> Patch 4: Preparatory patch to ixgbe driver code for TX
> Patch 5: ixgbe zero-copy support for TX
>
> Changes since v1:
>
> * Removed redundant AF_XDP precondition checks, pointed out by
>   Jakub. Now, the preconditions are only checked at XDP enable time.
> * Fixed a crash in the egress path, due to incorrect usage of
>   ixgbe_ring queue_index member. In v2 a ring_idx back reference is
>   introduced, and used in favor of queue_index. William reported the
>   crash, and helped me smoke out the issue. Kudos!

Thanks! I tested this series and no more crash.
The number is pretty good (*without* spectre and meltdown fixes)
model name : Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz, total 16 cores/

AF_XDP performance 64B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop20
txpush18
l2fwd 20

Regards,
William

Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-29 Thread William Tu

> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

I currently don't have i40e nic to test, so
I'm also looking forward to the ixgbe patch!

Thank you
William

Re: [PATCH net-next] openvswitch: Derive IP protocol number for IPv6 later frags

2018-08-13 Thread William Tu

On Sun, Aug 12, 2018 at 6:09 PM Pravin Shelar  wrote:
>
> On Fri, Aug 10, 2018 at 10:19 AM, Yi-Hung Wei  wrote:
> > Currently, OVS only parses the IP protocol number for the first
> > IPv6 fragment, but sets the IP protocol number for the later fragments
> > to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
> > number for the IPV6 later frags so that we can match that.
> >
> > Signed-off-by: Yi-Hung Wei 
> > ---
> >  net/openvswitch/flow.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> > index 56b8e7167790..3d654c4f71be 100644
> > --- a/net/openvswitch/flow.c
> > +++ b/net/openvswitch/flow.c
> > @@ -297,7 +297,13 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
> > sw_flow_key *key)
> >
> > nh_len = payload_ofs - nh_ofs;
> > skb_set_transport_header(skb, nh_ofs + nh_len);
> > -   key->ip.proto = nexthdr;
> > +   if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
> > +   unsigned int offset = 0;

How about we start the 2nd time parsing from
unsigned int offset = payload_ofs;

> > +
> > +   key->ip.proto = ipv6_find_hdr(skb, , -1, NULL, NULL);

Then we only find the last header from previous parsed offset.

William

> > +   } else {
> > +   key->ip.proto = nexthdr;
> > +   }
> parsing ipv6 ipv6_skip_exthdr() is called to find fragment hdr and
> then this patch calls ipv6_find_hdr() to find next protocol. I think
> we could call ipv6_find_hdr() to get fragment type and next hdr, that
> would save parsing same packet twice in some cases.
>
> Other option would be calling ipv6_find_hdr() after setting 
> OVS_FRAG_TYPE_LATER.

Re: [PATCH net-next] net: ip6_gre: get ipv6hdr after skb_cow_head()

2018-07-13 Thread William Tu

On Thu, Jul 12, 2018 at 10:40 PM, Prashant Bhole
 wrote:
> A KASAN:use-after-free bug was found related to ip6-erspan
> while running selftests/net/ip6_gre_headroom.sh
>
> It happens because of following sequence:
> - ipv6hdr pointer is obtained from skb
> - skb_cow_head() is called, skb->head memory is reallocated
> - old data is accessed using ipv6hdr pointer
>
> skb_cow_head() call was added in e41c7c68ea77 ("ip6erspan: make sure
> enough headroom at xmit."), but looking at the history there was a
> chance of similar bug because gre_handle_offloads() and pskb_trim()
> can also reallocate skb->head memory. Fixes tag points to commit
> which introduced possibility of this bug.
>
> This patch moves ipv6hdr pointer assignment after skb_cow_head() call.
>
> Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
> Signed-off-by: Prashant Bhole 
> ---

Thanks for the fix.
Acked-by: William Tu

Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.

2018-06-15 Thread William Tu

On Thu, Jun 14, 2018 at 10:24 PM, Eyal Birger  wrote:
>
>
>> On 14 Jun 2018, at 15:01, William Tu  wrote:
>>
>> Make the printting of bpf xfrm tunnel better and
>> cleanup xfrm state and policy when xfrm test finishes.
>
> Yeah the ‘tee’ was useful when developing the test - I could see what’s going 
> on :)
>
> Now that it’s in ‘selftests’ it’s definitely better without it.
>
> Thanks for the cleanup!
> Eyal.

Hi Eyal,
Thanks for double check!

Hi Daniel and Martin,
Sorry for the confusing "bpf-net". It should be "net"

Thanks
William

[PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.

2018-06-14 Thread William Tu

Make the printting of bpf xfrm tunnel better and
cleanup xfrm state and policy when xfrm test finishes.

Signed-off-by: William Tu 
---
 tools/testing/selftests/bpf/test_tunnel.sh | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tunnel.sh 
b/tools/testing/selftests/bpf/test_tunnel.sh
index aeb2901f21f4..7b1946b340be 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -608,28 +608,26 @@ setup_xfrm_tunnel()
 test_xfrm_tunnel()
 {
config_device
-#tcpdump -nei veth1 ip &
-   output=$(mktemp)
-   cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
-setup_xfrm_tunnel
+   > /sys/kernel/debug/tracing/trace
+   setup_xfrm_tunnel
tc qdisc add dev veth1 clsact
tc filter add dev veth1 proto ip ingress bpf da obj test_tunnel_kern.o \
sec xfrm_get_state
ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
sleep 1
-   grep "reqid 1" $output
+   grep "reqid 1" /sys/kernel/debug/tracing/trace
check_err $?
-   grep "spi 0x1" $output
+   grep "spi 0x1" /sys/kernel/debug/tracing/trace
check_err $?
-   grep "remote ip 0xac100164" $output
+   grep "remote ip 0xac100164" /sys/kernel/debug/tracing/trace
check_err $?
cleanup
 
if [ $ret -ne 0 ]; then
-echo -e ${RED}"FAIL: xfrm tunnel"${NC}
-return 1
-fi
-echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
+   echo -e ${RED}"FAIL: xfrm tunnel"${NC}
+   return 1
+   fi
+   echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
 }
 
 attach_bpf()
@@ -657,6 +655,10 @@ cleanup()
ip link del ip6geneve11 2> /dev/null
ip link del erspan11 2> /dev/null
ip link del ip6erspan11 2> /dev/null
+   ip xfrm policy delete dir out src 10.1.1.200/32 dst 10.1.1.100/32 2> 
/dev/null
+   ip xfrm policy delete dir in src 10.1.1.100/32 dst 10.1.1.200/32 2> 
/dev/null
+   ip xfrm state delete src 172.16.1.100 dst 172.16.1.200 proto esp spi 
0x1 2> /dev/null
+   ip xfrm state delete src 172.16.1.200 dst 172.16.1.100 proto esp spi 
0x2 2> /dev/null
 }
 
 cleanup_exit()
-- 
2.7.4

Re: [PATCH net-next] selftests: net: Test headroom handling of ip6_gre devices

2018-05-24 Thread William Tu

Hi Petr,

I tried to test this patch on latest net-next but encounter a couple issues.

On Wed, May 23, 2018 at 9:41 AM, Petr Machata  wrote:
> Commit 5691484df961 ("net: ip6_gre: Fix headroom request in
> ip6erspan_tunnel_xmit()") and commit 01b8d064d58b ("net: ip6_gre:
> Request headroom in __gre6_xmit()") fix problems in reserving headroom
> in the packets tunneled through ip6gre/tap and ip6erspan netdevices.
>
> These two patches included snippets that reproduced the issues. This
> patch elevates the snippets to a full-fledged test case.
>
> Suggested-by: David Miller 
> Signed-off-by: Petr Machata 
> ---
>  tools/testing/selftests/net/ip6_gre_headroom.sh | 59 
> +
>  1 file changed, 59 insertions(+)
>  create mode 100755 tools/testing/selftests/net/ip6_gre_headroom.sh
>
> diff --git a/tools/testing/selftests/net/ip6_gre_headroom.sh 
> b/tools/testing/selftests/net/ip6_gre_headroom.sh
> new file mode 100755
> index 000..9aaf63fd
> --- /dev/null
> +++ b/tools/testing/selftests/net/ip6_gre_headroom.sh
> @@ -0,0 +1,59 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Test that enough headroom is reserved for the first packet passing through 
> an
> +# IPv6 GRE-like netdevice.
> +
> +setup_prepare()
> +{
> +   ip link add h1 type veth peer name swp1
> +   ip link add h3 type veth peer name swp3
> +
> +   ip link set dev h1 up
> +   ip address add 192.0.2.1/28 dev h1
> +
> +   ip link add dev vh3 type vrf table 20
> +   ip link set dev h3 master vh3
> +   ip link set dev vh3 up
> +   ip link set dev h3 up
> +
> +   ip link set dev swp3 up
> +   ip address add dev swp3 2001:db8:2::1/64
> +
> +   ip link set dev swp1 up
> +   tc qdisc add dev swp1 clsact
> +}
> +
> +cleanup()
> +{
> +   ip link del dev swp1
> +   ip link del dev swp3
> +   ip link del dev vh3
I think we also need to do:
ip link del dev gt6

> +}
> +
> +test_headroom()
> +{
> +   ip link add name gt6 "$@"
> +   ip link set dev gt6 up
> +
> +   sleep 1
> +
> +   tc filter add dev swp1 ingress pref 1000 matchall skip_hw \
> +   action mirred egress mirror dev gt6
> +   ping -I h1 192.0.2.2 -c 1 -w 2 &> /dev/null

I increase ping count from 1 to 1000
and after a while the program hangs when I try to ctrl+c
+ cleanup
+ ip link del dev swp1
dmesg shows:

[ 1256.002453] unregister_netdevice: waiting for swp1 to become free.
Usage count = 9
[ 1266.082571] unregister_netdevice: waiting for swp1 to become free.
Usage count = 9
[ 1276.163011] unregister_netdevice: waiting for swp1 to become free.
Usage count = 9

Thanks
William

[PATCHv2 net-next] erspan: set bso bit based on mirrored packet's len

2018-05-18 Thread William Tu

Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
handled.  BSO has 4 possible values:
  00 --> Good frame with no error, or unknown integrity
  11 --> Payload is a Bad Frame with CRC or Alignment Error
  01 --> Payload is a Short Frame
  10 --> Payload is an Oversized Frame

Based the short/oversized definitions in RFC1757, the patch sets
the bso bit based on the mirrored packet's size.

Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
v1->v2
  Improve code comments, make enum erspan_bso clearer
---
 include/net/erspan.h | 28 
 1 file changed, 28 insertions(+)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index d044aa60cc76..b39643ef4c95 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -219,6 +219,33 @@ static inline __be32 erspan_get_timestamp(void)
return htonl((u32)h_usecs);
 }
 
+/* ERSPAN BSO (Bad/Short/Oversized), see RFC1757
+ *   00b --> Good frame with no error, or unknown integrity
+ *   01b --> Payload is a Short Frame
+ *   10b --> Payload is an Oversized Frame
+ *   11b --> Payload is a Bad Frame with CRC or Alignment Error
+ */
+enum erspan_bso {
+   BSO_NOERROR = 0x0,
+   BSO_SHORT = 0x1,
+   BSO_OVERSIZED = 0x2,
+   BSO_BAD = 0x3,
+};
+
+static inline u8 erspan_detect_bso(struct sk_buff *skb)
+{
+   /* BSO_BAD is not handled because the frame CRC
+* or alignment error information is in FCS.
+*/
+   if (skb->len < ETH_ZLEN)
+   return BSO_SHORT;
+
+   if (skb->len > ETH_FRAME_LEN)
+   return BSO_OVERSIZED;
+
+   return BSO_NOERROR;
+}
+
 static inline void erspan_build_header_v2(struct sk_buff *skb,
  u32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
@@ -248,6 +275,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
vlan_tci = ntohs(qp->tci);
}
 
+   bso = erspan_detect_bso(skb);
skb_push(skb, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
ershdr = (struct erspan_base_hdr *)skb->data;
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
-- 
2.7.4

[PATCH net] net: ip6_gre: fix tunnel metadata device sharing.

2018-05-18 Thread William Tu

Currently ip6gre and ip6erspan share single metadata mode device,
using 'collect_md_tun'.  Thus, when doing:
  ip link add dev ip6gre11 type ip6gretap external
  ip link add dev ip6erspan12 type ip6erspan external
  RTNETLINK answers: File exists
simply fails due to the 2nd tries to create the same collect_md_tun.

The patch fixes it by adding a separate collect md tunnel device
for the ip6erspan, 'collect_md_tun_erspan'.  As a result, a couple
of places need to refactor/split up in order to distinguish ip6gre
and ip6erspan.

First, move the collect_md check at ip6gre_tunnel_{unlink,link} and
create separate function {ip6gre,ip6ersapn}_tunnel_{link_md,unlink_md}.
Then before link/unlink, make sure the link_md/unlink_md is called.
Finally, a separate ndo_uninit is created for ip6erspan.  Tested it
using the samples/bpf/test_tunnel_bpf.sh.

Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv6/ip6_gre.c | 101 +
 1 file changed, 79 insertions(+), 22 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 5162ecc45c20..458de353f5d9 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -71,6 +71,7 @@ struct ip6gre_net {
struct ip6_tnl __rcu *tunnels[4][IP6_GRE_HASH_SIZE];
 
struct ip6_tnl __rcu *collect_md_tun;
+   struct ip6_tnl __rcu *collect_md_tun_erspan;
struct net_device *fb_tunnel_dev;
 };
 
@@ -233,7 +234,12 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct 
net_device *dev,
if (cand)
return cand;
 
-   t = rcu_dereference(ign->collect_md_tun);
+   if (gre_proto == htons(ETH_P_ERSPAN) ||
+   gre_proto == htons(ETH_P_ERSPAN2))
+   t = rcu_dereference(ign->collect_md_tun_erspan);
+   else
+   t = rcu_dereference(ign->collect_md_tun);
+
if (t && t->dev->flags & IFF_UP)
return t;
 
@@ -262,6 +268,31 @@ static struct ip6_tnl __rcu **__ip6gre_bucket(struct 
ip6gre_net *ign,
return >tunnels[prio][h];
 }
 
+static void ip6gre_tunnel_link_md(struct ip6gre_net *ign, struct ip6_tnl *t)
+{
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun, t);
+}
+
+static void ip6erspan_tunnel_link_md(struct ip6gre_net *ign, struct ip6_tnl *t)
+{
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun_erspan, t);
+}
+
+static void ip6gre_tunnel_unlink_md(struct ip6gre_net *ign, struct ip6_tnl *t)
+{
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun, NULL);
+}
+
+static void ip6erspan_tunnel_unlink_md(struct ip6gre_net *ign,
+  struct ip6_tnl *t)
+{
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun_erspan, NULL);
+}
+
 static inline struct ip6_tnl __rcu **ip6gre_bucket(struct ip6gre_net *ign,
const struct ip6_tnl *t)
 {
@@ -272,9 +303,6 @@ static void ip6gre_tunnel_link(struct ip6gre_net *ign, 
struct ip6_tnl *t)
 {
struct ip6_tnl __rcu **tp = ip6gre_bucket(ign, t);
 
-   if (t->parms.collect_md)
-   rcu_assign_pointer(ign->collect_md_tun, t);
-
rcu_assign_pointer(t->next, rtnl_dereference(*tp));
rcu_assign_pointer(*tp, t);
 }
@@ -284,9 +312,6 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)
struct ip6_tnl __rcu **tp;
struct ip6_tnl *iter;
 
-   if (t->parms.collect_md)
-   rcu_assign_pointer(ign->collect_md_tun, NULL);
-
for (tp = ip6gre_bucket(ign, t);
 (iter = rtnl_dereference(*tp)) != NULL;
 tp = >next) {
@@ -375,11 +400,23 @@ static struct ip6_tnl *ip6gre_tunnel_locate(struct net 
*net,
return NULL;
 }
 
+static void ip6erspan_tunnel_uninit(struct net_device *dev)
+{
+   struct ip6_tnl *t = netdev_priv(dev);
+   struct ip6gre_net *ign = net_generic(t->net, ip6gre_net_id);
+
+   ip6erspan_tunnel_unlink_md(ign, t);
+   ip6gre_tunnel_unlink(ign, t);
+   dst_cache_reset(>dst_cache);
+   dev_put(dev);
+}
+
 static void ip6gre_tunnel_uninit(struct net_device *dev)
 {
struct ip6_tnl *t = netdev_priv(dev);
struct ip6gre_net *ign = net_generic(t->net, ip6gre_net_id);
 
+   ip6gre_tunnel_unlink_md(ign, t);
ip6gre_tunnel_unlink(ign, t);
dst_cache_reset(>dst_cache);
dev_put(dev);
@@ -1806,7 +1843,7 @@ static int ip6erspan_tap_init(struct net_device *dev)
 
 static const struct net_device_ops ip6erspan_netdev_ops = {
.ndo_init = ip6erspan_tap_init,
-   .ndo_uninit =   ip6gre_tunnel_uninit,
+   .ndo_uninit =   ip6erspan_tunnel_uninit,
.ndo_start_xmit =   ip6erspan_tunnel_xmit,
.ndo_set_mac_address

Re: [PATCH net 7/7] net: ip6_gre: Fix ip6erspan hlen calculation

2018-05-17 Thread William Tu

On Thu, May 17, 2018 at 7:36 AM, Petr Machata <pe...@mellanox.com> wrote:
> Even though ip6erspan_tap_init() sets up hlen and tun_hlen according to
> what ERSPAN needs, it goes ahead to call ip6gre_tnl_link_config() which
> overwrites these settings with GRE-specific ones.
>
> Similarly for changelink callbacks, which are handled by
> ip6gre_changelink() calls ip6gre_tnl_change() calls
> ip6gre_tnl_link_config() as well.
>
> The difference ends up being 12 vs. 20 bytes, and this is generally not
> a problem, because a 12-byte request likely ends up allocating more and
> the extra 8 bytes are thus available. However correct it is not.
>
> So replace the newlink and changelink callbacks with an ERSPAN-specific
> ones, reusing the newly-introduced _common() functions.
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

Acked-by: William Tu <u9012...@gmail.com>

Thanks, using ERSPAN-specific newlink and changelink is also on
my todo list.  I'm hitting another issue related to the shared collect_md_tun
between erspan and gre, I will make my patch based on this series.


>  net/ipv6/ip6_gre.c | 74 
> +++---
>  1 file changed, 65 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index c17e38b..36b1669 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -81,6 +81,7 @@ static int ip6gre_tunnel_init(struct net_device *dev);
>  static void ip6gre_tunnel_setup(struct net_device *dev);
>  static void ip6gre_tunnel_link(struct ip6gre_net *ign, struct ip6_tnl *t);
>  static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu);
> +static void ip6erspan_tnl_link_config(struct ip6_tnl *t, int set_mtu);
>
>  /* Tunnel hash table */
>
> @@ -1751,6 +1752,19 @@ static const struct net_device_ops 
> ip6gre_tap_netdev_ops = {
> .ndo_get_iflink = ip6_tnl_get_iflink,
>  };
>
> +static int ip6erspan_calc_hlen(struct ip6_tnl *tunnel)
> +{
> +   int t_hlen;
> +
> +   tunnel->tun_hlen = 8;
> +   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
> +  erspan_hdr_len(tunnel->parms.erspan_ver);
> +
> +   t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
> +   tunnel->dev->hard_header_len = LL_MAX_HEADER + t_hlen;
> +   return t_hlen;
> +}
> +
>  static int ip6erspan_tap_init(struct net_device *dev)
>  {
> struct ip6_tnl *tunnel;
> @@ -1774,12 +1788,7 @@ static int ip6erspan_tap_init(struct net_device *dev)
> return ret;
> }
>
> -   tunnel->tun_hlen = 8;
> -   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
> -  erspan_hdr_len(tunnel->parms.erspan_ver);
> -   t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
> -
> -   dev->hard_header_len = LL_MAX_HEADER + t_hlen;
> +   t_hlen = ip6erspan_calc_hlen(tunnel);
> dev->mtu = ETH_DATA_LEN - t_hlen;
> if (dev->type == ARPHRD_ETHER)
> dev->mtu -= ETH_HLEN;
> @@ -1787,7 +1796,7 @@ static int ip6erspan_tap_init(struct net_device *dev)
> dev->mtu -= 8;
>
> dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> -   ip6gre_tnl_link_config(tunnel, 1);
> +   ip6erspan_tnl_link_config(tunnel, 1);
>
> return 0;
>  }
> @@ -2118,6 +2127,53 @@ static void ip6erspan_tap_setup(struct net_device *dev)
> netif_keep_dst(dev);
>  }
>
> +static int ip6erspan_newlink(struct net *src_net, struct net_device *dev,
> +struct nlattr *tb[], struct nlattr *data[],
> +struct netlink_ext_ack *extack)
> +{
> +   int err = ip6gre_newlink_common(src_net, dev, tb, data, extack);
> +   struct ip6_tnl *nt = netdev_priv(dev);
> +   struct net *net = dev_net(dev);
> +
> +   if (!err) {
> +   ip6erspan_tnl_link_config(nt, !tb[IFLA_MTU]);
> +   ip6gre_tunnel_link(net_generic(net, ip6gre_net_id), nt);
> +   }
> +   return err;
> +}
> +
> +static void ip6erspan_tnl_link_config(struct ip6_tnl *t, int set_mtu)
> +{
> +   ip6gre_tnl_link_config_common(t);
> +   ip6gre_tnl_link_config_route(t, set_mtu, ip6erspan_calc_hlen(t));
> +}
> +
> +static int ip6erspan_tnl_change(struct ip6_tnl *t,
> +   const struct __ip6_tnl_parm *p, int set_mtu)
> +{
> +   ip6gre_tnl_copy_tnl_parm(t, p);
> +   ip6erspan_tnl_link_config(t, set_mtu);
> +   return 0;
> +}
> +
> +static int ip6erspan_changelink(struct net_device *dev, struct nlattr *tb[],
> +

Re: [PATCH net 5/7] net: ip6_gre: Split up ip6gre_newlink()

2018-05-17 Thread William Tu

On Thu, May 17, 2018 at 7:36 AM, Petr Machata <pe...@mellanox.com> wrote:
> Extract from ip6gre_newlink() a reusable function
> ip6gre_newlink_common(). The ip6gre_tnl_link_config() call needs to be
> made customizable for ERSPAN, thus reorder it with calls to
> ip6_tnl_change_mtu() and dev_hold(), and extract the whole tail to the
> caller, ip6gre_newlink(). Thus enable an ERSPAN-specific _newlink()
> function without a lot of duplicity.
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

LGTM.
Acked-by: William Tu <u9012...@gmail.com>

>  net/ipv6/ip6_gre.c | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 307ac6d..4dfa21d 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -1858,9 +1858,9 @@ static bool ip6gre_netlink_encap_parms(struct nlattr 
> *data[],
> return ret;
>  }
>
> -static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
> - struct nlattr *tb[], struct nlattr *data[],
> - struct netlink_ext_ack *extack)
> +static int ip6gre_newlink_common(struct net *src_net, struct net_device *dev,
> +struct nlattr *tb[], struct nlattr *data[],
> +struct netlink_ext_ack *extack)
>  {
> struct ip6_tnl *nt;
> struct net *net = dev_net(dev);
> @@ -1897,18 +1897,30 @@ static int ip6gre_newlink(struct net *src_net, struct 
> net_device *dev,
> if (err)
> goto out;
>
> -   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
> -
> if (tb[IFLA_MTU])
> ip6_tnl_change_mtu(dev, nla_get_u32(tb[IFLA_MTU]));
>
> dev_hold(dev);
> -   ip6gre_tunnel_link(ign, nt);
>
>  out:
> return err;
>  }
>
> +static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
> + struct nlattr *tb[], struct nlattr *data[],
> + struct netlink_ext_ack *extack)
> +{
> +   int err = ip6gre_newlink_common(src_net, dev, tb, data, extack);
> +   struct ip6_tnl *nt = netdev_priv(dev);
> +   struct net *net = dev_net(dev);
> +
> +   if (!err) {
> +   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
> +   ip6gre_tunnel_link(net_generic(net, ip6gre_net_id), nt);
> +   }
> +   return err;
> +}
> +
>  static int ip6gre_changelink(struct net_device *dev, struct nlattr *tb[],
>  struct nlattr *data[],
>  struct netlink_ext_ack *extack)
> --
> 2.4.11
>

Re: [PATCH net 6/7] net: ip6_gre: Split up ip6gre_changelink()

2018-05-17 Thread William Tu

On Thu, May 17, 2018 at 7:36 AM, Petr Machata <pe...@mellanox.com> wrote:
> Extract from ip6gre_changelink() a reusable function
> ip6gre_changelink_common(). This will allow introduction of
> ERSPAN-specific _changelink() function with not a lot of code
> duplication.
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

LGTM.
Acked-by: William Tu <u9012...@gmail.com>


>  net/ipv6/ip6_gre.c | 33 -
>  1 file changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 4dfa21d..c17e38b 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -1921,37 +1921,52 @@ static int ip6gre_newlink(struct net *src_net, struct 
> net_device *dev,
> return err;
>  }
>
> -static int ip6gre_changelink(struct net_device *dev, struct nlattr *tb[],
> -struct nlattr *data[],
> -struct netlink_ext_ack *extack)
> +static struct ip6_tnl *
> +ip6gre_changelink_common(struct net_device *dev, struct nlattr *tb[],
> +struct nlattr *data[], struct __ip6_tnl_parm *p_p,
> +struct netlink_ext_ack *extack)
>  {
> struct ip6_tnl *t, *nt = netdev_priv(dev);
> struct net *net = nt->net;
> struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
> -   struct __ip6_tnl_parm p;
> struct ip_tunnel_encap ipencap;
>
> if (dev == ign->fb_tunnel_dev)
> -   return -EINVAL;
> +   return ERR_PTR(-EINVAL);
>
> if (ip6gre_netlink_encap_parms(data, )) {
> int err = ip6_tnl_encap_setup(nt, );
>
> if (err < 0)
> -   return err;
> +   return ERR_PTR(err);
> }
>
> -   ip6gre_netlink_parms(data, );
> +   ip6gre_netlink_parms(data, p_p);
>
> -   t = ip6gre_tunnel_locate(net, , 0);
> +   t = ip6gre_tunnel_locate(net, p_p, 0);
>
> if (t) {
> if (t->dev != dev)
> -   return -EEXIST;
> +   return ERR_PTR(-EEXIST);
> } else {
> t = nt;
> }
>
> +   return t;
> +}
> +
> +static int ip6gre_changelink(struct net_device *dev, struct nlattr *tb[],
> +struct nlattr *data[],
> +struct netlink_ext_ack *extack)
> +{
> +   struct ip6gre_net *ign = net_generic(dev_net(dev), ip6gre_net_id);
> +   struct __ip6_tnl_parm p;
> +   struct ip6_tnl *t;
> +
> +   t = ip6gre_changelink_common(dev, tb, data, , extack);
> +   if (IS_ERR(t))
> +   return PTR_ERR(t);
> +
> ip6gre_tunnel_unlink(ign, t);
> ip6gre_tnl_change(t, , !tb[IFLA_MTU]);
> ip6gre_tunnel_link(ign, t);
> --
> 2.4.11
>

Re: [PATCH net 3/7] net: ip6_gre: Split up ip6gre_tnl_link_config()

2018-05-17 Thread William Tu

On Thu, May 17, 2018 at 7:36 AM, Petr Machata <pe...@mellanox.com> wrote:
> The function ip6gre_tnl_link_config() is used for setting up
> configuration of both ip6gretap and ip6erspan tunnels. Split the
> function into the common part and the route-lookup part. The latter then
> takes the calculated header length as an argument. This split will allow
> the patches down the line to sneak in a custom header length computation
> for the ERSPAN tunnel.
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

LGTM.
Acked-by: William Tu <u9012...@gmail.com>

>  net/ipv6/ip6_gre.c | 38 ++
>  1 file changed, 26 insertions(+), 12 deletions(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 53b1531..78ba6b9 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -1022,12 +1022,11 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct 
> sk_buff *skb,
> return NETDEV_TX_OK;
>  }
>
> -static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu)
> +static void ip6gre_tnl_link_config_common(struct ip6_tnl *t)
>  {
> struct net_device *dev = t->dev;
> struct __ip6_tnl_parm *p = >parms;
> struct flowi6 *fl6 = >fl.u.ip6;
> -   int t_hlen;
>
> if (dev->type != ARPHRD_ETHER) {
> memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr));
> @@ -1054,12 +1053,13 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, 
> int set_mtu)
> dev->flags |= IFF_POINTOPOINT;
> else
> dev->flags &= ~IFF_POINTOPOINT;
> +}
>
> -   t->tun_hlen = gre_calc_hlen(t->parms.o_flags);
> -
> -   t->hlen = t->encap_hlen + t->tun_hlen;
> -
> -   t_hlen = t->hlen + sizeof(struct ipv6hdr);
> +static void ip6gre_tnl_link_config_route(struct ip6_tnl *t, int set_mtu,
> +int t_hlen)
> +{
> +   const struct __ip6_tnl_parm *p = >parms;
> +   struct net_device *dev = t->dev;
>
> if (p->flags & IP6_TNL_F_CAP_XMIT) {
> int strict = (ipv6_addr_type(>raddr) &
> @@ -1091,6 +1091,24 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, 
> int set_mtu)
> }
>  }
>
> +static int ip6gre_calc_hlen(struct ip6_tnl *tunnel)
> +{
> +   int t_hlen;
> +
> +   tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
> +   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
> +
> +   t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
> +   tunnel->dev->hard_header_len = LL_MAX_HEADER + t_hlen;
> +   return t_hlen;
> +}
> +
> +static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu)
> +{
> +   ip6gre_tnl_link_config_common(t);
> +   ip6gre_tnl_link_config_route(t, set_mtu, ip6gre_calc_hlen(t));
> +}
> +
>  static int ip6gre_tnl_change(struct ip6_tnl *t,
> const struct __ip6_tnl_parm *p, int set_mtu)
>  {
> @@ -1384,11 +1402,7 @@ static int ip6gre_tunnel_init_common(struct net_device 
> *dev)
> return ret;
> }
>
> -   tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
> -   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
> -   t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
> -
> -   dev->hard_header_len = LL_MAX_HEADER + t_hlen;
> +   t_hlen = ip6gre_calc_hlen(tunnel);
> dev->mtu = ETH_DATA_LEN - t_hlen;
> if (dev->type == ARPHRD_ETHER)
> dev->mtu -= ETH_HLEN;
> --
> 2.4.11
>

Re: [PATCH net 4/7] net: ip6_gre: Split up ip6gre_tnl_change()

2018-05-17 Thread William Tu

On Thu, May 17, 2018 at 7:36 AM, Petr Machata <pe...@mellanox.com> wrote:
> Split a reusable function ip6gre_tnl_copy_tnl_parm() from
> ip6gre_tnl_change(). This will allow ERSPAN-specific code to
> reuse the common parts while customizing the behavior for ERSPAN.
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

LGTM.
Acked-by: William Tu <u9012...@gmail.com>


>  net/ipv6/ip6_gre.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 78ba6b9..307ac6d 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -1109,8 +1109,8 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, 
> int set_mtu)
> ip6gre_tnl_link_config_route(t, set_mtu, ip6gre_calc_hlen(t));
>  }
>
> -static int ip6gre_tnl_change(struct ip6_tnl *t,
> -   const struct __ip6_tnl_parm *p, int set_mtu)
> +static void ip6gre_tnl_copy_tnl_parm(struct ip6_tnl *t,
> +const struct __ip6_tnl_parm *p)
>  {
> t->parms.laddr = p->laddr;
> t->parms.raddr = p->raddr;
> @@ -1126,6 +1126,12 @@ static int ip6gre_tnl_change(struct ip6_tnl *t,
> t->parms.o_flags = p->o_flags;
> t->parms.fwmark = p->fwmark;
> dst_cache_reset(>dst_cache);
> +}
> +
> +static int ip6gre_tnl_change(struct ip6_tnl *t, const struct __ip6_tnl_parm 
> *p,
> +int set_mtu)
> +{
> +   ip6gre_tnl_copy_tnl_parm(t, p);
> ip6gre_tnl_link_config(t, set_mtu);
> return 0;
>  }
> --
> 2.4.11
>

Re: [PATCH net 2/7] net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()

2018-05-17 Thread William Tu

6b/0x1d0
> [  191.066922]  ? print_irqtrace_events+0x120/0x120
> [  191.071593]  ? __lock_is_held+0xa0/0x160
> [  191.075566]  __do_softirq+0x1d4/0x9d2
> [  191.079282]  ? ip6_finish_output2+0x524/0x1460
> [  191.083771]  do_softirq_own_stack+0x2a/0x40
> [  191.087994]  
> [  191.090130]  do_softirq.part.13+0x38/0x40
> [  191.094178]  __local_bh_enable_ip+0x135/0x190
> [  191.098591]  ip6_finish_output2+0x54d/0x1460
> [  191.102916]  ? ip6_forward_finish+0x2f0/0x2f0
> [  191.107314]  ? ip6_mtu+0x3c/0x2c0
> [  191.110674]  ? ip6_finish_output+0x2f8/0x650
> [  191.114992]  ? ip6_output+0x12a/0x500
> [  191.118696]  ip6_output+0x12a/0x500
> [  191.13]  ? ip6_route_dev_notify+0x5b0/0x5b0
> [  191.126807]  ? ip6_finish_output+0x650/0x650
> [  191.131120]  ? ip6_fragment+0x1a60/0x1a60
> [  191.135182]  ? icmp6_dst_alloc+0x26e/0x470
> [  191.139317]  mld_sendpack+0x672/0x830
> [  191.143021]  ? igmp6_mcf_seq_next+0x2f0/0x2f0
> [  191.147429]  ? __local_bh_enable_ip+0x77/0x190
> [  191.151913]  ipv6_mc_dad_complete+0x47/0x90
> [  191.156144]  addrconf_dad_completed+0x561/0x720
> [  191.160731]  ? addrconf_rs_timer+0x3a0/0x3a0
> [  191.165036]  ? mark_held_locks+0xc9/0x140
> [  191.169095]  ? __local_bh_enable_ip+0x77/0x190
> [  191.173570]  ? addrconf_dad_work+0x50d/0xa20
> [  191.177886]  ? addrconf_dad_work+0x529/0xa20
> [  191.182194]  addrconf_dad_work+0x529/0xa20
> [  191.186342]  ? addrconf_dad_completed+0x720/0x720
> [  191.191088]  ? __lock_is_held+0xa0/0x160
> [  191.195059]  ? process_one_work+0x45d/0xe20
> [  191.199302]  ? process_one_work+0x51e/0xe20
> [  191.203531]  ? rcu_read_lock_sched_held+0x93/0xa0
> [  191.208279]  process_one_work+0x51e/0xe20
> [  191.212340]  ? pwq_dec_nr_in_flight+0x200/0x200
> [  191.216912]  ? get_lock_stats+0x4b/0xf0
> [  191.220788]  ? preempt_count_sub+0xf/0xd0
> [  191.224844]  ? worker_thread+0x219/0x860
> [  191.228823]  ? do_raw_spin_trylock+0x6d/0xa0
> [  191.233142]  worker_thread+0xeb/0x860
> [  191.236848]  ? process_one_work+0xe20/0xe20
> [  191.241095]  kthread+0x206/0x300
> [  191.244352]  ? process_one_work+0xe20/0xe20
> [  191.248587]  ? kthread_stop+0x570/0x570
> [  191.252459]  ret_from_fork+0x3a/0x50
> [  191.256082] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf 
> db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 
> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24
> [  191.275327] RIP: skb_panic+0xc3/0x100 RSP: 8801d54072f0
> [  191.281024] ---[ end trace 7ea51094e099e006 ]---
> [  191.285724] Kernel panic - not syncing: Fatal exception in interrupt
> [  191.292168] Kernel Offset: disabled
> [  191.295697] ---[ end Kernel panic - not syncing: Fatal exception in 
> interrupt ]---
>
> Reproducer:
>
> ip link add h1 type veth peer name swp1
> ip link add h3 type veth peer name swp3
>
> ip link set dev h1 up
> ip address add 192.0.2.1/28 dev h1
>
> ip link add dev vh3 type vrf table 20
> ip link set dev h3 master vh3
> ip link set dev vh3 up
> ip link set dev h3 up
>
> ip link set dev swp3 up
>     ip address add dev swp3 2001:db8:2::1/64
>
> ip link set dev swp1 up
> tc qdisc add dev swp1 clsact
>
> ip link add name gt6 type ip6erspan \
> local 2001:db8:2::1 remote 2001:db8:2::2 oseq okey 123
> ip link set dev gt6 up
>
> sleep 1
>
> tc filter add dev swp1 ingress pref 1000 matchall skip_hw \
> action mirred egress mirror dev gt6
> ping -I h1 192.0.2.2
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

Acked-by: William Tu <u9012...@gmail.com>

>  net/ipv6/ip6_gre.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 5d93c7c..53b1531 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -911,7 +911,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
> *skb,
> truncate = true;
> }
>
> -   if (skb_cow_head(skb, dev->needed_headroom))
> +   if (skb_cow_head(skb, dev->needed_headroom ?: t->hlen))
> goto tx_err;
>
> t->parms.o_flags &= ~TUNNEL_KEY;
> --
> 2.4.11
>

Re: [PATCH net 1/7] net: ip6_gre: Request headroom in __gre6_xmit()

2018-05-17 Thread William Tu

58.941842]  ? preempt_count_sub+0xf/0xd0
> [  158.945940]  ? schedule+0x5b/0x140
> [  158.949412]  kthread+0x206/0x300
> [  158.952689]  ? sort_range+0x20/0x20
> [  158.956249]  ? kthread_stop+0x570/0x570
> [  158.960164]  ret_from_fork+0x3a/0x50
> [  158.963823] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf 
> db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 
> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24
> [  158.983235] RIP: skb_panic+0xc3/0x100 RSP: 8801d3f27110
> [  158.988935] ---[ end trace 5af56ee845aa6cc8 ]---
> [  158.993641] Kernel panic - not syncing: Fatal exception in interrupt
> [  159.000176] Kernel Offset: disabled
> [  159.003767] ---[ end Kernel panic - not syncing: Fatal exception in 
> interrupt ]---
>
> Reproducer:
>
> ip link add h1 type veth peer name swp1
> ip link add h3 type veth peer name swp3
>
> ip link set dev h1 up
> ip address add 192.0.2.1/28 dev h1
>
> ip link add dev vh3 type vrf table 20
> ip link set dev h3 master vh3
> ip link set dev vh3 up
> ip link set dev h3 up
>
> ip link set dev swp3 up
> ip address add dev swp3 2001:db8:2::1/64
>
> ip link set dev swp1 up
> tc qdisc add dev swp1 clsact
>
> ip link add name gt6 type ip6gretap \
> local 2001:db8:2::1 remote 2001:db8:2::2
> ip link set dev gt6 up
>
> sleep 1
>
> tc filter add dev swp1 ingress pref 1000 matchall skip_hw \
> action mirred egress mirror dev gt6
> ping -I h1 192.0.2.2
>
> Signed-off-by: Petr Machata <pe...@mellanox.com>
> ---

Thanks for the fix.
Acked-by: William Tu <u9012...@gmail.com>

>  net/ipv6/ip6_gre.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 69727bc..5d93c7c 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -698,6 +698,9 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
> else
> fl6->daddr = tunnel->parms.raddr;
>
> +   if (skb_cow_head(skb, dev->needed_headroom ?: tunnel->hlen))
> +   return -ENOMEM;
> +
> /* Push GRE header. */
> protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto;
>
> --
> 2.4.11
>

Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-17 Thread William Tu

On Wed, May 16, 2018 at 3:24 PM, Tobin C. Harding <to...@apporbit.com> wrote:
> On Wed, May 16, 2018 at 07:05:34AM -0700, William Tu wrote:
>> On Mon, May 14, 2018 at 10:33 PM, Tobin C. Harding <to...@apporbit.com> 
>> wrote:
>> > On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote:
>> >> Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
>> >> handled.  BSO has 4 possible values:
>> >>   00 --> Good frame with no error, or unknown integrity
>> >>   11 --> Payload is a Bad Frame with CRC or Alignment Error
>> >>   01 --> Payload is a Short Frame
>> >>   10 --> Payload is an Oversized Frame
>> >>
>> >> Based the short/oversized definitions in RFC1757, the patch sets
>> >> the bso bit based on the mirrored packet's size.
>> >>
>> >> Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
>> >> Signed-off-by: William Tu <u9012...@gmail.com>
>> >> ---
>> >>  include/net/erspan.h | 25 +
>> >>  1 file changed, 25 insertions(+)
>> >>
>> >> diff --git a/include/net/erspan.h b/include/net/erspan.h
>> >> index d044aa60cc76..5eb95f78ad45 100644
>> >> --- a/include/net/erspan.h
>> >> +++ b/include/net/erspan.h
>> >> @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
>> >>   return htonl((u32)h_usecs);
>> >>  }
>> >>
>> >> +/* ERSPAN BSO (Bad/Short/Oversized)
>> >> + *   00b --> Good frame with no error, or unknown integrity
>> >> + *   01b --> Payload is a Short Frame
>> >> + *   10b --> Payload is an Oversized Frame
>> >> + *   11b --> Payload is a Bad Frame with CRC or Alignment Error
>> >> + */
>> >> +enum erspan_bso {
>> >> + BSO_NOERROR,
>> >> + BSO_SHORT,
>> >> + BSO_OVERSIZED,
>> >> + BSO_BAD,
>> >> +};
>> >
>> > If we are relying on the values perhaps this would be clearer
>> >
>> > BSO_NOERROR = 0x00,
>> > BSO_SHORT   = 0x01,
>> > BSO_OVERSIZED   = 0x02,
>> > BSO_BAD = 0x03,
>> >
>>
>> Yes, thanks. I will change in v2.
>>
>> >> +
>> >> +static inline u8 erspan_detect_bso(struct sk_buff *skb)
>> >> +{
>> >> + if (skb->len < ETH_ZLEN)
>> >> + return BSO_SHORT;
>> >> +
>> >> + if (skb->len > ETH_FRAME_LEN)
>> >> + return BSO_OVERSIZED;
>> >> +
>> >> + return BSO_NOERROR;
>> >> +}
>> >
>> > Without having much contextual knowledge around this patch; should we be
>> > doing some check on CRC or alignment (at some stage)?  Having BSO_BAD
>> > seems to imply so?
>> >
>>
>> The definition of BSO_BAD:
>> etherStatsCRCAlignErrors OBJECT-TYPE
>>   SYNTAX Counter
>>   ACCESS read-only
>>   STATUS mandatory
>>   DESCRIPTION
>>   "The total number of packets received that
>>   had a length (excluding framing bits, but
>>   including FCS octets) of between 64 and 1518
>>   octets, inclusive, but but had either a bad
>>   Frame Check Sequence (FCS) with an integral
>>   number of octets (FCS Error) or a bad FCS with
>>   a non-integral number of octets (Alignment Error)."
>>
>> But I don't know how to check CRC error at this code point.
>> Isn't it done by the NIC hardware?
>
> I'll just start with; I don't know anything about ERSPAN
>
> "ERSPAN is a Cisco proprietary feature and is available only to
> Catalyst 6500, 7600, Nexus, and ASR 1000 platforms to date. The
> ASR 1000 supports ERSPAN source (monitoring) only on Fast
> Ethernet, Gigabit Ethernet, and port-channel interfaces."
>
> https://supportforums.cisco.com/t5/network-infrastructure-documents/understanding-span-rspan-and-erspan/ta-p/3144951
>
> I dug around a bit and none of the files that currently import erspan.h
> actually use the 'bso' field
>
> $ grep bso $(git grep -l 'erspan\.h')
> include/net/erspan.h:   u8 bso = 0; /* Bad/Short/Oversized */
> include/net/erspan.h:   ershdr->en = bso;
> net/ipv4/ip_gre.c: ICMP in the real Internet is absolutely infeasible.
> net/ipv4/ip_gre.c:   * ICMP in the real Internet is absolutely infeasible.
>
Yes, that's expected.

>
> Normally, AFAICT, the FCS does not get passed to the operating system
> since its a link layer mechanism.  If ERSPAN is passing the FCS when it
> mirrors frames (does it mirror frames or packets, I don't know?) then
> surely ERSPAN should provide a function to return the BSO value.

It mirrors layer 2 ethernet frame, so no FCS is passing.

>
> So IMHO this patch seems like a just pretense and not really doing
> anything.
>
The purpose is to set the BSO bit according to the spec, so that
ERSPAN monitor can interpret the mirrored traffic.

Thanks,
William

[PATCH net] erspan: fix invalid erspan version.

2018-05-16 Thread William Tu

ERSPAN only support version 1 and 2.  When packets send to an
erspan device which does not have proper version number set,
drop the packet.  In real case, we observe multicast packets
sent to the erspan pernet device, erspan0, which does not have
erspan version configured.

Reported-by: Greg Rose <gvrose8...@gmail.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv4/ip_gre.c  | 4 +++-
 net/ipv6/ip6_gre.c | 5 -
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2409e648454d..2d8efeecf619 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -734,10 +734,12 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
erspan_build_header(skb, ntohl(tunnel->parms.o_key),
tunnel->index,
truncate, true);
-   else
+   else if (tunnel->erspan_ver == 2)
erspan_build_header_v2(skb, ntohl(tunnel->parms.o_key),
   tunnel->dir, tunnel->hwid,
   truncate, true);
+   else
+   goto free_skb;
 
tunnel->parms.o_flags &= ~TUNNEL_KEY;
__gre_xmit(skb, dev, >parms.iph, htons(ETH_P_ERSPAN));
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index bede77f24784..d20072fc38cb 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -991,11 +991,14 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
erspan_build_header(skb, ntohl(t->parms.o_key),
t->parms.index,
truncate, false);
-   else
+   else if (t->parms.erspan_ver == 2)
erspan_build_header_v2(skb, ntohl(t->parms.o_key),
   t->parms.dir,
   t->parms.hwid,
   truncate, false);
+   else
+   goto tx_err;
+
fl6.daddr = t->parms.raddr;
}
 
-- 
2.7.4

Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-16 Thread William Tu

On Mon, May 14, 2018 at 10:33 PM, Tobin C. Harding <to...@apporbit.com> wrote:
> On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote:
>> Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
>> handled.  BSO has 4 possible values:
>>   00 --> Good frame with no error, or unknown integrity
>>   11 --> Payload is a Bad Frame with CRC or Alignment Error
>>   01 --> Payload is a Short Frame
>>   10 --> Payload is an Oversized Frame
>>
>> Based the short/oversized definitions in RFC1757, the patch sets
>> the bso bit based on the mirrored packet's size.
>>
>> Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  include/net/erspan.h | 25 +
>>  1 file changed, 25 insertions(+)
>>
>> diff --git a/include/net/erspan.h b/include/net/erspan.h
>> index d044aa60cc76..5eb95f78ad45 100644
>> --- a/include/net/erspan.h
>> +++ b/include/net/erspan.h
>> @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
>>   return htonl((u32)h_usecs);
>>  }
>>
>> +/* ERSPAN BSO (Bad/Short/Oversized)
>> + *   00b --> Good frame with no error, or unknown integrity
>> + *   01b --> Payload is a Short Frame
>> + *   10b --> Payload is an Oversized Frame
>> + *   11b --> Payload is a Bad Frame with CRC or Alignment Error
>> + */
>> +enum erspan_bso {
>> + BSO_NOERROR,
>> + BSO_SHORT,
>> + BSO_OVERSIZED,
>> + BSO_BAD,
>> +};
>
> If we are relying on the values perhaps this would be clearer
>
> BSO_NOERROR = 0x00,
> BSO_SHORT   = 0x01,
> BSO_OVERSIZED   = 0x02,
> BSO_BAD = 0x03,
>

Yes, thanks. I will change in v2.

>> +
>> +static inline u8 erspan_detect_bso(struct sk_buff *skb)
>> +{
>> + if (skb->len < ETH_ZLEN)
>> + return BSO_SHORT;
>> +
>> + if (skb->len > ETH_FRAME_LEN)
>> + return BSO_OVERSIZED;
>> +
>> + return BSO_NOERROR;
>> +}
>
> Without having much contextual knowledge around this patch; should we be
> doing some check on CRC or alignment (at some stage)?  Having BSO_BAD
> seems to imply so?
>

The definition of BSO_BAD:
etherStatsCRCAlignErrors OBJECT-TYPE
  SYNTAX Counter
  ACCESS read-only
  STATUS mandatory
  DESCRIPTION
  "The total number of packets received that
  had a length (excluding framing bits, but
  including FCS octets) of between 64 and 1518
  octets, inclusive, but but had either a bad
  Frame Check Sequence (FCS) with an integral
  number of octets (FCS Error) or a bad FCS with
  a non-integral number of octets (Alignment Error)."

But I don't know how to check CRC error at this code point.
Isn't it done by the NIC hardware?

Thanks for your review!
William

[PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-14 Thread William Tu

Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
handled.  BSO has 4 possible values:
  00 --> Good frame with no error, or unknown integrity
  11 --> Payload is a Bad Frame with CRC or Alignment Error
  01 --> Payload is a Short Frame
  10 --> Payload is an Oversized Frame

Based the short/oversized definitions in RFC1757, the patch sets
the bso bit based on the mirrored packet's size.

Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index d044aa60cc76..5eb95f78ad45 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
return htonl((u32)h_usecs);
 }
 
+/* ERSPAN BSO (Bad/Short/Oversized)
+ *   00b --> Good frame with no error, or unknown integrity
+ *   01b --> Payload is a Short Frame
+ *   10b --> Payload is an Oversized Frame
+ *   11b --> Payload is a Bad Frame with CRC or Alignment Error
+ */
+enum erspan_bso {
+   BSO_NOERROR,
+   BSO_SHORT,
+   BSO_OVERSIZED,
+   BSO_BAD,
+};
+
+static inline u8 erspan_detect_bso(struct sk_buff *skb)
+{
+   if (skb->len < ETH_ZLEN)
+   return BSO_SHORT;
+
+   if (skb->len > ETH_FRAME_LEN)
+   return BSO_OVERSIZED;
+
+   return BSO_NOERROR;
+}
+
 static inline void erspan_build_header_v2(struct sk_buff *skb,
  u32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
@@ -248,6 +272,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
vlan_tci = ntohs(qp->tci);
}
 
+   bso = erspan_detect_bso(skb);
skb_push(skb, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
ershdr = (struct erspan_base_hdr *)skb->data;
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
-- 
2.7.4

[PATCH net-next] erspan: auto detect truncated ipv6 packets.

2018-05-11 Thread William Tu

Currently the truncated bit is set only when 1) the mirrored packet
is larger than mtu and 2) the ipv4 packet tot_len is larger than
the actual skb->len.  This patch adds another case for detecting
whether ipv6 packet is truncated or not, by checking the ipv6 header
payload_len and the skb->len.

Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv4/ip_gre.c  | 6 ++
 net/ipv6/ip6_gre.c | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index dfe5b22f6ed4..2409e648454d 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -579,6 +579,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
int version;
__be16 df;
int nhoff;
+   int thoff;
 
tun_info = skb_tunnel_info(skb);
if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
@@ -611,6 +612,11 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
(ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
truncate = true;
 
+   thoff = skb_transport_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IPV6) &&
+   (ntohs(ipv6_hdr(skb)->payload_len) > skb->len - thoff))
+   truncate = true;
+
if (version == 1) {
erspan_build_header(skb, ntohl(tunnel_id_to_key32(key->tun_id)),
ntohl(md->u.index), truncate, true);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index b511818b268c..bede77f24784 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -897,6 +897,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
int err = -EINVAL;
__u32 mtu;
int nhoff;
+   int thoff;
 
if (!ip6_tnl_xmit_ctl(t, >parms.laddr, >parms.raddr))
goto tx_err;
@@ -914,6 +915,11 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
(ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
truncate = true;
 
+   thoff = skb_transport_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IPV6) &&
+   (ntohs(ipv6_hdr(skb)->payload_len) > skb->len - thoff))
+   truncate = true;
+
if (skb_cow_head(skb, dev->needed_headroom))
goto tx_err;
 
-- 
2.7.4

Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.

2018-05-02 Thread William Tu

On Wed, May 2, 2018 at 1:29 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 05/02/2018 06:52 AM, Alexei Starovoitov wrote:
>> On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:
>>>
>>>> How did you test this patch?
>>>>
>>> Without the patch, the test case will fail.
>>> With the patch, the test case passes.
>>
>> Please test it with real program and you'll see crashes and garbage returned.
>
> +1, *convert_ctx_access() use bpf_insn's off to determine what to rewrite,
> so this is definitely buggy, and wasn't properly tested as it should have
> been. The test case is also way too simple, just the LDX and then doing a
> return 0 will get you past verifier, but won't give you anything in terms
> of runtime testing that test_verifier is doing. A single test case for a
> non trivial verifier change like this is also _completely insufficient_,
> this really needs to test all sort of weird corner cases (involving out of
> bounds accesses, overflows, etc).

Thanks, now I understand.
It's much more complicated than I thought.

William

Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.

2018-05-01 Thread William Tu

On Tue, May 1, 2018 at 4:16 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Mon, Apr 30, 2018 at 10:15:05AM -0700, William Tu wrote:
>> Existing verifier does not allow 'ctx + const + const'.  However, due to
>> compiler optimization, there is a case where BPF compilerit generates
>> 'ctx + const + 0', as shown below:
>>
>>   599: (1d) if r2 == r4 goto pc+2
>>R0=inv(id=0) R1=ctx(id=0,off=40,imm=0)
>>R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x))
>>R3=inv(id=0,umax_value=65535,var_off=(0x0; 0x)) R4=inv0
>>R6=ctx(id=0,off=0,imm=0) R7=inv2
>>   600: (bf) r1 = r6   // r1 is ctx
>>   601: (07) r1 += 36  // r1 has offset 36
>>   602: (61) r4 = *(u32 *)(r1 +0)  // r1 + 0
>>   dereference of modified ctx ptr R1 off=36+0, ctx+const is allowed,
>>   ctx+const+const is not
>>
>> The reason for BPF backend generating this code is due optimization
>> likes this, explained from Yonghong:
>> if (...)
>> *(ctx + 60)
>> else
>> *(ctx + 56)
>>
>> The compiler translates it to
>> if (...)
>>ptr = ctx + 60
>> else
>>ptr = ctx + 56
>> *(ptr + 0)
>>
>> So load ptr memory become an example of 'ctx + const + 0'.  This patch
>> enables support for this case.
>>
>> Fixes: f8ddadc4db6c7 ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>> Cc: Yonghong Song <y...@fb.com>
>> Signed-off-by: Yifeng Sun <pkusunyif...@gmail.com>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  kernel/bpf/verifier.c   |  2 +-
>>  tools/testing/selftests/bpf/test_verifier.c | 13 +
>>  2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 712d8655e916..c9a791b9cf2a 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1638,7 +1638,7 @@ static int check_mem_access(struct bpf_verifier_env 
>> *env, int insn_idx, u32 regn
>>   /* ctx accesses must be at a fixed offset, so that we can
>>* determine what type of data were returned.
>>*/
>> - if (reg->off) {
>> + if (reg->off && off != reg->off) {
>>   verbose(env,
>>   "dereference of modified ctx ptr R%d 
>> off=%d+%d, ctx+const is allowed, ctx+const+const is not\n",
>>   regno, reg->off, off - reg->off);
>> diff --git a/tools/testing/selftests/bpf/test_verifier.c 
>> b/tools/testing/selftests/bpf/test_verifier.c
>> index 1acafe26498b..95ad5d5723ae 100644
>> --- a/tools/testing/selftests/bpf/test_verifier.c
>> +++ b/tools/testing/selftests/bpf/test_verifier.c
>> @@ -8452,6 +8452,19 @@ static struct bpf_test tests[] = {
>>   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
>>   },
>>   {
>> + "arithmetic ops make PTR_TO_CTX + const + 0 valid",
>> + .insns = {
>> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
>> +   offsetof(struct __sk_buff, data) -
>> +   offsetof(struct __sk_buff, mark)),

This is:
   r1 += N // r1 has offset N: the offset between data and mark)

>> + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),

This is:
   r0 = *(u32 *)(r1 +0)  // r1 + 0

So the above two lines create similar case I hit
  601: (07) r1 += 36   // r1 has offset 36
  602: (61) r4 = *(u32 *)(r1 +0)  // r1 + 0

>
> How rewritten code looks here?
>
> The patch is allowing check_ctx_access() to proceed with sort-of
> correct 'off' and remember ctx_field_size,
> but in convert_ctx_accesses() it's using insn->off to do conversion.
> Which is zero in this case, so it will convert
> struct __sk_buff {
> __u32 len; // offset 0
>
> into access of 'struct sk_buff'->len
> and then will add __sk_buff's  -  delta to in-kernel len field.
> Which will point to some random field further down in struct sk_buff.
> Doesn't look correct at all.

why?
So it points to ctx + "offsetof(struct __sk_buff, data) -
offsetof(struct __sk_buff, mark)",
which is ctx + const
then I tested that 'ctx + const + 0' should pass the verifier

> How did you test this patch?
>
Without the patch, the test case will fail.
With the patch, the test case passes.

William

[PATCH bpf-next] bpf/verifier: enable ctx + const + 0.

2018-04-30 Thread William Tu

Existing verifier does not allow 'ctx + const + const'.  However, due to
compiler optimization, there is a case where BPF compilerit generates
'ctx + const + 0', as shown below:

  599: (1d) if r2 == r4 goto pc+2
   R0=inv(id=0) R1=ctx(id=0,off=40,imm=0)
   R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x))
   R3=inv(id=0,umax_value=65535,var_off=(0x0; 0x)) R4=inv0
   R6=ctx(id=0,off=0,imm=0) R7=inv2
  600: (bf) r1 = r6 // r1 is ctx
  601: (07) r1 += 36// r1 has offset 36
  602: (61) r4 = *(u32 *)(r1 +0)// r1 + 0
  dereference of modified ctx ptr R1 off=36+0, ctx+const is allowed,
  ctx+const+const is not

The reason for BPF backend generating this code is due optimization
likes this, explained from Yonghong:
if (...)
*(ctx + 60)
else
*(ctx + 56)

The compiler translates it to
if (...)
   ptr = ctx + 60
else
   ptr = ctx + 56
*(ptr + 0)

So load ptr memory become an example of 'ctx + const + 0'.  This patch
enables support for this case.

Fixes: f8ddadc4db6c7 ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Cc: Yonghong Song <y...@fb.com>
Signed-off-by: Yifeng Sun <pkusunyif...@gmail.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 kernel/bpf/verifier.c   |  2 +-
 tools/testing/selftests/bpf/test_verifier.c | 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 712d8655e916..c9a791b9cf2a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1638,7 +1638,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
/* ctx accesses must be at a fixed offset, so that we can
 * determine what type of data were returned.
 */
-   if (reg->off) {
+   if (reg->off && off != reg->off) {
verbose(env,
"dereference of modified ctx ptr R%d off=%d+%d, 
ctx+const is allowed, ctx+const+const is not\n",
regno, reg->off, off - reg->off);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 1acafe26498b..95ad5d5723ae 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8452,6 +8452,19 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
{
+   "arithmetic ops make PTR_TO_CTX + const + 0 valid",
+   .insns = {
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+ offsetof(struct __sk_buff, data) -
+ offsetof(struct __sk_buff, mark)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
"pkt_end - pkt_start is allowed",
.insns = {
BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-- 
2.7.4

Re: [PATCH bpf-next] selftests/bpf: bpf tunnel test.

2018-04-30 Thread William Tu

On Mon, Apr 30, 2018 at 2:05 AM, Daniel Borkmann  wrote:
> On 04/30/2018 09:02 AM, Y Song wrote:
>> Hi, William,
>>
>> When compiled the selftests/bpf in my centos 7 based system, I have
>> the following failures,
>>
>> clang -I. -I./include/uapi -I../../../include/uapi
>> -Wno-compare-distinct-pointer-types \
>>  -O2 -target bpf -emit-llvm -c test_tunnel_kern.c -o - |  \
>> llc -march=bpf -mcpu=generic  -filetype=obj -o
>> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/test_tunnel_kern.o
>> test_tunnel_kern.c:21:10: fatal error: 'linux/erspan.h' file not found
>> #include 
>>  ^~~~
>> 1 error generated.
>>
>> Maybe I missed some packages to install?
>
It works for me because I do 'make headers_install' and install the
erspan.h in my /usr/include/linux/
I will submit a patch to add the erspan.h in tools/include/uapi/linux
Thanks
William

[PATCH bpf-next] tools include uapi: Grab a copy of linux/erspan.h

2018-04-30 Thread William Tu

Bring the erspan uapi header file so BPF tunnel helpers can use it.

Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
Reported-by: Yonghong Song <y...@fb.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 tools/include/uapi/linux/erspan.h | 52 +++
 1 file changed, 52 insertions(+)
 create mode 100644 tools/include/uapi/linux/erspan.h

diff --git a/tools/include/uapi/linux/erspan.h 
b/tools/include/uapi/linux/erspan.h
new file mode 100644
index ..841573019ae1
--- /dev/null
+++ b/tools/include/uapi/linux/erspan.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * ERSPAN Tunnel Metadata
+ *
+ * Copyright (c) 2018 VMware
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * Userspace API for metadata mode ERSPAN tunnel
+ */
+#ifndef _UAPI_ERSPAN_H
+#define _UAPI_ERSPAN_H
+
+#include/* For __beXX in userspace */
+#include 
+
+/* ERSPAN version 2 metadata header */
+struct erspan_md2 {
+   __be32 timestamp;
+   __be16 sgt; /* security group tag */
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#else
+#error "Please fix "
+#endif
+};
+
+struct erspan_metadata {
+   int version;
+   union {
+   __be32 index;   /* Version 1 (type II)*/
+   struct erspan_md2 md2;  /* Version 2 (type III) */
+   } u;
+};
+
+#endif /* _UAPI_ERSPAN_H */
-- 
2.7.4

[PATCH net-next] erspan: auto detect truncated packets.

2018-04-27 Thread William Tu

Currently the truncated bit is set only when the mirrored packet
is larger than mtu.  For certain cases, the packet might already
been truncated before sending to the erspan tunnel.  In this case,
the patch detect whether the IP header's total length is larger
than the actual skb->len.  If true, this indicated that the
mirrored packet is truncated and set the erspan truncate bit.

I tested the patch using bpf_skb_change_tail helper function to
shrink the packet size and send to erspan tunnel.

Reported-by: Xiaoyan Jin <xiaoy...@vmware.com>
Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv4/ip_gre.c  | 6 ++
 net/ipv6/ip6_gre.c | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 9c169bb2444d..dfe5b22f6ed4 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -578,6 +578,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
int tunnel_hlen;
int version;
__be16 df;
+   int nhoff;
 
tun_info = skb_tunnel_info(skb);
if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
@@ -605,6 +606,11 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
truncate = true;
}
 
+   nhoff = skb_network_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IP) &&
+   (ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
+   truncate = true;
+
if (version == 1) {
erspan_build_header(skb, ntohl(tunnel_id_to_key32(key->tun_id)),
ntohl(md->u.index), truncate, true);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 69727bc168cb..ac7ce85df667 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -896,6 +896,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
struct flowi6 fl6;
int err = -EINVAL;
__u32 mtu;
+   int nhoff;
 
if (!ip6_tnl_xmit_ctl(t, >parms.laddr, >parms.raddr))
goto tx_err;
@@ -908,6 +909,11 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
truncate = true;
}
 
+   nhoff = skb_network_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IP) &&
+   (ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
+   truncate = true;
+
if (skb_cow_head(skb, dev->needed_headroom))
goto tx_err;
 
-- 
2.7.4

[PATCHv2 bpf-next 0/2] BPF tunnel testsuite

2018-04-26 Thread William Tu

The patch series provide end-to-end eBPF tunnel testsute.  A common topology
is created below for all types of tunnels:

Topology: 
- 
 root namespace   | at_ns0 namespace   
  |
  --- | ---
  | tnl dev | | | tnl dev |  (overlay network) 
  --- | ---
  metadata-mode   | native-mode
   with bpf   |
  |
  --  | -- 
  |  veth1  | - |  veth0  |  (underlay network)
  --peer-- 

  
   
Device Configuration  
  
 Root namespace with metadata-mode tunnel + BPF
 Device names and addresses:   
   veth1 IP: 172.16.1.200, IPv6: 00::22 (underlay) 
   tunnel dev 11, ex: gre11, IPv4: 10.1.1.200 (overlay)  
   
 Namespace at_ns0 with native tunnel   
 Device names and addresses:   
   veth0 IPv4: 172.16.1.100, IPv6: 00::11 (underlay)   
   tunnel dev 00, ex: gre00, IPv4: 10.1.1.100 (overlay)  
   
   
End-to-end ping packet flow   
---   
 Most of the tests start by namespace creation, device configuration,  
 then ping the underlay and overlay network.  When doing 'ping 10.1.1.100' 
 from root namespace, the following operations happen: 
 1) Route lookup shows 10.1.1.100/24 belongs to tnl dev, fwd to tnl dev.   
 2) Tnl device's egress BPF program is triggered and set the tunnel metadata,  
with remote_ip=172.16.1.200 and others.
 3) Outer tunnel header is prepended and route the packet to veth1's egress
 4) veth0's ingress queue receive the tunneled packet at namespace at_ns0  
 5) Tunnel protocol handler, ex: vxlan_rcv, decap the packet   
 6) Forward the packet to the overlay tnl dev  

Test Cases
-
 Tunnel Type |  BPF Programs
-
 GRE:  gre_set_tunnel, gre_get_tunnel
 IP6GRE:   ip6gretap_set_tunnel, ip6gretap_get_tunnel
 ERSPAN:   erspan_set_tunnel, erspan_get_tunnel
 IP6ERSPAN:ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
 VXLAN:vxlan_set_tunnel, vxlan_get_tunnel
 IP6VXLAN: ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
 GENEVE:   geneve_set_tunnel, geneve_get_tunnel
 IP6GENEVE:ip6geneve_set_tunnel, ip6geneve_get_tunnel
 IPIP: ipip_set_tunnel, ipip_get_tunnel
 IP6IP:ipip6_set_tunnel, ipip6_get_tunnel,
   ip6ip6_set_tunnel, ip6ip6_get_tunnel
 XFRM: xfrm_get_state

William Tu (2):
  selftests/bpf: bpf tunnel test.
  samples/bpf: remove the bpf tunnel testsuite.

 samples/bpf/Makefile   |   1 -
 samples/bpf/tcbpf2_kern.c  | 612 -
 samples/bpf/test_tunnel_bpf.sh | 390 -
 tools/testing/selftests/bpf/Makefile   |   5 +-
 tools/testing/selftests/bpf/test_tunnel.sh | 729 +
 tools/testing/selftests/bpf/test_tunnel_kern.c | 713 
 6 files changed, 1445 insertions(+), 1005 deletions(-)
 delete mode 100644 samples/bpf/tcbpf2_kern.c
 delete mode 100755 samples/bpf/test_tunnel_bpf.sh
 create mode 100755 tools/testing/selftests/bpf/test_tunnel.sh
 create mode 100644 tools/testing/selftests/bpf/test_tunnel_kern.c

-- 
2.7.4

[PATCHv2 bpf-next 2/2] samples/bpf: remove the bpf tunnel testsuite.

2018-04-26 Thread William Tu

Move the testsuite to
selftests/bpf/{test_tunnel_kern.c, test_tunnel.sh}

Signed-off-by: William Tu <u9012...@gmail.com>
---
 samples/bpf/Makefile   |   1 -
 samples/bpf/tcbpf2_kern.c  | 612 -
 samples/bpf/test_tunnel_bpf.sh | 390 --
 3 files changed, 1003 deletions(-)
 delete mode 100644 samples/bpf/tcbpf2_kern.c
 delete mode 100755 samples/bpf/test_tunnel_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aa8c392e2e52..b853581592fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -114,7 +114,6 @@ always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
-always += tcbpf2_kern.o
 always += tc_l2_redirect_kern.o
 always += lathist_kern.o
 always += offwaketime_kern.o
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
deleted file mode 100644
index fa260c750fb1..
--- a/samples/bpf/tcbpf2_kern.c
+++ /dev/null
@@ -1,612 +0,0 @@
-/* Copyright (c) 2016 VMware
- * Copyright (c) 2016 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- */
-#define KBUILD_MODNAME "foo"
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "bpf_helpers.h"
-#include "bpf_endian.h"
-
-#define _htonl __builtin_bswap32
-#define ERROR(ret) do {\
-   char fmt[] = "ERROR line:%d ret:%d\n";\
-   bpf_trace_printk(fmt, sizeof(fmt), __LINE__, ret); \
-   } while(0)
-
-struct geneve_opt {
-   __be16  opt_class;
-   u8  type;
-   u8  length:5;
-   u8  r3:1;
-   u8  r2:1;
-   u8  r1:1;
-   u8  opt_data[8]; /* hard-coded to 8 byte */
-};
-
-struct vxlan_metadata {
-   u32 gbp;
-};
-
-SEC("gre_set_tunnel")
-int _gre_set_tunnel(struct __sk_buff *skb)
-{
-   int ret;
-   struct bpf_tunnel_key key;
-
-   __builtin_memset(, 0x0, sizeof(key));
-   key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
-   key.tunnel_id = 2;
-   key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;
-
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_ZERO_CSUM_TX | BPF_F_SEQ_NUMBER);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   return TC_ACT_OK;
-}
-
-SEC("gre_get_tunnel")
-int _gre_get_tunnel(struct __sk_buff *skb)
-{
-   int ret;
-   struct bpf_tunnel_key key;
-   char fmt[] = "key %d remote ip 0x%x\n";
-
-   ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   bpf_trace_printk(fmt, sizeof(fmt), key.tunnel_id, key.remote_ipv4);
-   return TC_ACT_OK;
-}
-
-SEC("ip6gretap_set_tunnel")
-int _ip6gretap_set_tunnel(struct __sk_buff *skb)
-{
-   struct bpf_tunnel_key key;
-   int ret;
-
-   __builtin_memset(, 0x0, sizeof(key));
-   key.remote_ipv6[3] = _htonl(0x11); /* ::11 */
-   key.tunnel_id = 2;
-   key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;
-   key.tunnel_label = 0xabcde;
-
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
-BPF_F_SEQ_NUMBER);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   return TC_ACT_OK;
-}
-
-SEC("ip6gretap_get_tunnel")
-int _ip6gretap_get_tunnel(struct __sk_buff *skb)
-{
-   char fmt[] = "key %d remote ip6 ::%x label %x\n";
-   struct bpf_tunnel_key key;
-   int ret;
-
-   ret = bpf_skb_get_tunnel_key(skb, , sizeof(key),
-BPF_F_TUNINFO_IPV6);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   bpf_trace_printk(fmt, sizeof(fmt),
-key.tunnel_id, key.remote_ipv6[3], key.tunnel_label);
-
-   return TC_ACT_OK;
-}
-
-SEC("erspan_set_tunnel")
-int _erspan_set_tunnel(struct __sk_buff *skb)
-{
-   struct bpf_tunnel_key key;
-   struct erspan_metadata md;
-   int ret;
-
-   __builtin_memset(, 0x0, sizeof(key));
-   key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
-   key.tunnel_id = 2;
-   key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;
-
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
BPF_F_ZERO_CSUM_TX);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   __builtin_memset(, 0, sizeof(md));
-#ifdef ERSPAN_V1
-

[PATCHv2 bpf-next 1/2] selftests/bpf: bpf tunnel test.

2018-04-26 Thread William Tu

The patch migrates the original tests at samples/bpf/tcbpf2_kern.c
and samples/bpf/test_tunnel_bpf.sh to selftests.  There are a couple
changes from the original:
1) add ipv6 vxlan, ipv6 geneve, ipv6 ipip tests
2) simplify the original ipip tests (remove iperf tests)
3) improve documentation
4) use bpf_ntoh* and bpf_hton* api

In summary, 'test_tunnel_kern.o' contains the following bpf program:
  GRE: gre_set_tunnel, gre_get_tunnel
  IP6GRE: ip6gretap_set_tunnel, ip6gretap_get_tunnel
  ERSPAN: erspan_set_tunnel, erspan_get_tunnel
  IP6ERSPAN: ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
  VXLAN: vxlan_set_tunnel, vxlan_get_tunnel
  IP6VXLAN: ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
  GENEVE: geneve_set_tunnel, geneve_get_tunnel
  IP6GENEVE: ip6geneve_set_tunnel, ip6geneve_get_tunnel
  IPIP: ipip_set_tunnel, ipip_get_tunnel
  IP6IP: ipip6_set_tunnel, ipip6_get_tunnel,
 ip6ip6_set_tunnel, ip6ip6_get_tunnel
  XFRM: xfrm_get_state

Signed-off-by: William Tu <u9012...@gmail.com>
---
 tools/testing/selftests/bpf/Makefile   |   5 +-
 tools/testing/selftests/bpf/test_tunnel.sh | 729 +
 tools/testing/selftests/bpf/test_tunnel_kern.c | 713 
 3 files changed, 1445 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_tunnel.sh
 create mode 100644 tools/testing/selftests/bpf/test_tunnel_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0c19d5e08f08..b64a7a39cbc8 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -40,7 +40,8 @@ TEST_PROGS := test_kmod.sh \
test_xdp_redirect.sh \
test_xdp_meta.sh \
test_offload.py \
-   test_sock_addr.sh
+   test_sock_addr.sh \
+   test_tunnel.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
diff --git a/tools/testing/selftests/bpf/test_tunnel.sh 
b/tools/testing/selftests/bpf/test_tunnel.sh
new file mode 100755
index ..aeb2901f21f4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -0,0 +1,729 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# End-to-end eBPF tunnel test suite
+#   The script tests BPF network tunnel implementation.
+#
+# Topology:
+# -
+# root namespace   | at_ns0 namespace
+#  |
+#  --- | ---
+#  | tnl dev | | | tnl dev |  (overlay network)
+#  --- | ---
+#  metadata-mode   | native-mode
+#   with bpf   |
+#  |
+#  --  | --
+#  |  veth1  | - |  veth0  |  (underlay network)
+#  --peer--
+#
+#
+# Device Configuration
+# 
+# Root namespace with metadata-mode tunnel + BPF
+# Device names and addresses:
+#  veth1 IP: 172.16.1.200, IPv6: 00::22 (underlay)
+#  tunnel dev 11, ex: gre11, IPv4: 10.1.1.200 (overlay)
+#
+# Namespace at_ns0 with native tunnel
+# Device names and addresses:
+#  veth0 IPv4: 172.16.1.100, IPv6: 00::11 (underlay)
+#  tunnel dev 00, ex: gre00, IPv4: 10.1.1.100 (overlay)
+#
+#
+# End-to-end ping packet flow
+# ---
+# Most of the tests start by namespace creation, device configuration,
+# then ping the underlay and overlay network.  When doing 'ping 10.1.1.100'
+# from root namespace, the following operations happen:
+# 1) Route lookup shows 10.1.1.100/24 belongs to tnl dev, fwd to tnl dev.
+# 2) Tnl device's egress BPF program is triggered and set the tunnel metadata,
+#with remote_ip=172.16.1.200 and others.
+# 3) Outer tunnel header is prepended and route the packet to veth1's egress
+# 4) veth0's ingress queue receive the tunneled packet at namespace at_ns0
+# 5) Tunnel protocol handler, ex: vxlan_rcv, decap the packet
+# 6) Forward the packet to the overlay tnl dev
+
+PING_ARG="-c 3 -w 10 -q"
+ret=0
+GREEN='\033[0;92m'
+RED='\033[0;31m'
+NC='\033[0m' # No Color
+
+config_device()
+{
+   ip netns add at_ns0
+   ip link add veth0 type veth peer name veth1
+   ip link set veth0 netns at_ns0
+   ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+   ip netns exec at_ns0 ip link set dev veth0 up
+   ip link set dev veth1 up mtu 1500
+   ip addr add dev ve

Re: [PATCH bpf-next] selftests/bpf: bpf tunnel test.

2018-04-26 Thread William Tu

On Wed, Apr 25, 2018 at 8:01 AM, William Tu <u9012...@gmail.com> wrote:
> The patch migrates the original tests at samples/bpf/tcbpf2_kern.c
> and samples/bpf/test_tunnel_bpf.sh to selftests.  There are a couple
> changes from the original:
> 1) add ipv6 vxlan, ipv6 geneve, ipv6 ipip tests
> 2) simplify the original ipip tests (remove iperf tests)
> 3) improve documentation
> 4) use bpf_ntoh* and bpf_hton* api
>
> In summary, 'test_tunnel_kern.o' contains the following bpf program:
>   GRE: gre_set_tunnel, gre_get_tunnel
>   IP6GRE: ip6gretap_set_tunnel, ip6gretap_get_tunnel
>   ERSPAN: erspan_set_tunnel, erspan_get_tunnel
>   IP6ERSPAN: ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
>   VXLAN: vxlan_set_tunnel, vxlan_get_tunnel
>   IP6VXLAN: ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
>   GENEVE: geneve_set_tunnel, geneve_get_tunnel
>   IP6GENEVE: ip6geneve_set_tunnel, ip6geneve_get_tunnel
>   IPIP: ipip_set_tunnel, ipip_get_tunnel
>   IP6IP: ipip6_set_tunnel, ipip6_get_tunnel,
>      ip6ip6_set_tunnel, ip6ip6_get_tunnel
>
> Signed-off-by: William Tu <u9012...@gmail.com>
> ---

I made a mistake by removing the recent XFRM helper test cases.
I will send v2.

William

Re: [PATCH bpf-next] bpf: clear the ip_tunnel_info.

2018-04-25 Thread William Tu

On Wed, Apr 25, 2018 at 12:54 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 04/25/2018 08:46 AM, William Tu wrote:
>> The percpu metadata_dst might carry the stale ip_tunnel_info
>> and cause incorrect behavior.  When mixing tests using ipv4/ipv6
>> bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
>> ipv4's src ip addr as its ipv6 src address, because the previous
>> tunnel info does not clean up.  The patch zeros the fields in
>> ip_tunnel_info.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> Reported-by: Yifeng Sun <pkusunyif...@gmail.com>
>
> Since this is a fix, I've applied this to bpf, thanks William!

Thanks.
Just to add some context about this issue.
This happens when doing in sequence
1) start ipv4 vxlan bpf tunnel
2) delete all related devices
3) start ipv6 vxlan bpf tunnel

The first ipv4 vxlan tunnel sets the ipv4 src ip in the ip_tunnel_key
and does not clear. So the 3) ipv6 vxlan bpf tunnel, uses the ipv4's
address as its ipv6 address.  As a result, vxlan driver reports
[81227.576732] ip6vxlan00: add 7a:2c:d7:fe:a9:43 ->
::ac10:0164::::
[81237.614330] ip6vxlan00: no route to ::ac10:0164::::
where "ac10:0164" is 172.16.1.200.

Similar issue when testing ipv4 geneve followed by ipv6 geneve.
Regards,
William

[PATCH bpf-next] selftests/bpf: bpf tunnel test.

2018-04-25 Thread William Tu

The patch migrates the original tests at samples/bpf/tcbpf2_kern.c
and samples/bpf/test_tunnel_bpf.sh to selftests.  There are a couple
changes from the original:
1) add ipv6 vxlan, ipv6 geneve, ipv6 ipip tests
2) simplify the original ipip tests (remove iperf tests)
3) improve documentation
4) use bpf_ntoh* and bpf_hton* api

In summary, 'test_tunnel_kern.o' contains the following bpf program:
  GRE: gre_set_tunnel, gre_get_tunnel
  IP6GRE: ip6gretap_set_tunnel, ip6gretap_get_tunnel
  ERSPAN: erspan_set_tunnel, erspan_get_tunnel
  IP6ERSPAN: ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
  VXLAN: vxlan_set_tunnel, vxlan_get_tunnel
  IP6VXLAN: ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
  GENEVE: geneve_set_tunnel, geneve_get_tunnel
  IP6GENEVE: ip6geneve_set_tunnel, ip6geneve_get_tunnel
  IPIP: ipip_set_tunnel, ipip_get_tunnel
  IP6IP: ipip6_set_tunnel, ipip6_get_tunnel,
 ip6ip6_set_tunnel, ip6ip6_get_tunnel

Signed-off-by: William Tu <u9012...@gmail.com>
---
 samples/bpf/Makefile   |   1 -
 samples/bpf/tcbpf2_kern.c  | 612 --
 samples/bpf/test_tunnel_bpf.sh | 390 --
 tools/testing/selftests/bpf/Makefile   |   5 +-
 tools/testing/selftests/bpf/test_tunnel.sh | 651 +++
 tools/testing/selftests/bpf/test_tunnel_kern.c | 691 +
 6 files changed, 1345 insertions(+), 1005 deletions(-)
 delete mode 100644 samples/bpf/tcbpf2_kern.c
 delete mode 100755 samples/bpf/test_tunnel_bpf.sh
 create mode 100755 tools/testing/selftests/bpf/test_tunnel.sh
 create mode 100644 tools/testing/selftests/bpf/test_tunnel_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aa8c392e2e52..b853581592fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -114,7 +114,6 @@ always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
-always += tcbpf2_kern.o
 always += tc_l2_redirect_kern.o
 always += lathist_kern.o
 always += offwaketime_kern.o
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
deleted file mode 100644
index fa260c750fb1..
--- a/samples/bpf/tcbpf2_kern.c
+++ /dev/null
@@ -1,612 +0,0 @@
-/* Copyright (c) 2016 VMware
- * Copyright (c) 2016 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- */
-#define KBUILD_MODNAME "foo"
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "bpf_helpers.h"
-#include "bpf_endian.h"
-
-#define _htonl __builtin_bswap32
-#define ERROR(ret) do {\
-   char fmt[] = "ERROR line:%d ret:%d\n";\
-   bpf_trace_printk(fmt, sizeof(fmt), __LINE__, ret); \
-   } while(0)
-
-struct geneve_opt {
-   __be16  opt_class;
-   u8  type;
-   u8  length:5;
-   u8  r3:1;
-   u8  r2:1;
-   u8  r1:1;
-   u8  opt_data[8]; /* hard-coded to 8 byte */
-};
-
-struct vxlan_metadata {
-   u32 gbp;
-};
-
-SEC("gre_set_tunnel")
-int _gre_set_tunnel(struct __sk_buff *skb)
-{
-   int ret;
-   struct bpf_tunnel_key key;
-
-   __builtin_memset(, 0x0, sizeof(key));
-   key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
-   key.tunnel_id = 2;
-   key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;
-
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_ZERO_CSUM_TX | BPF_F_SEQ_NUMBER);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   return TC_ACT_OK;
-}
-
-SEC("gre_get_tunnel")
-int _gre_get_tunnel(struct __sk_buff *skb)
-{
-   int ret;
-   struct bpf_tunnel_key key;
-   char fmt[] = "key %d remote ip 0x%x\n";
-
-   ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-   }
-
-   bpf_trace_printk(fmt, sizeof(fmt), key.tunnel_id, key.remote_ipv4);
-   return TC_ACT_OK;
-}
-
-SEC("ip6gretap_set_tunnel")
-int _ip6gretap_set_tunnel(struct __sk_buff *skb)
-{
-   struct bpf_tunnel_key key;
-   int ret;
-
-   __builtin_memset(, 0x0, sizeof(key));
-   key.remote_ipv6[3] = _htonl(0x11); /* ::11 */
-   key.tunnel_id = 2;
-   key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;
-   key.tunnel_label = 0xabcde;
-
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
-BPF_F_SEQ_NUMBER);
-   if (ret < 0) {
-   ERROR(ret);
-   return TC_ACT_SHOT;
-

[PATCH bpf-next] bpf: clear the ip_tunnel_info.

2018-04-25 Thread William Tu

The percpu metadata_dst might carry the stale ip_tunnel_info
and cause incorrect behavior.  When mixing tests using ipv4/ipv6
bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
ipv4's src ip addr as its ipv6 src address, because the previous
tunnel info does not clean up.  The patch zeros the fields in
ip_tunnel_info.

Signed-off-by: William Tu <u9012...@gmail.com>
Reported-by: Yifeng Sun <pkusunyif...@gmail.com>
---
 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 8e45c6c7ab08..d3781daa26ab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3281,6 +3281,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
skb_dst_set(skb, (struct dst_entry *) md);
 
info = >u.tun_info;
+   memset(info, 0, sizeof(*info));
info->mode = IP_TUNNEL_INFO_TX;
 
info->key.tun_flags = TUNNEL_KEY | TUNNEL_CSUM | TUNNEL_NOCACHE;
-- 
2.7.4

Re: [PATCH iproute2] iplink_geneve: correct size of message to avoid spurious errors

2018-04-20 Thread William Tu

On Wed, Apr 18, 2018 at 11:06 AM, Jakub Kicinski
<jakub.kicin...@netronome.com> wrote:
> Commit 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
> inadvertently changed the parameter to addattr_l() resulting in:
>
> addattr_l ERROR: message exceeded bound of 4
>
> when remote is specified.
>
> Fixes: 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
> Signed-off-by: Jakub Kicinski <jakub.kicin...@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.mon...@netronome.com>
> ---

Thanks. We also hit this issue when creating geneve tunnel.

Acked-by: William Tu <u9012...@gmail.com>

Re: [RFC PATCH v2 00/14] Introducing AF_XDP support

2018-04-10 Thread William Tu

On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel <bjorn.to...@gmail.com> wrote:
> 2018-04-09 23:51 GMT+02:00 William Tu <u9012...@gmail.com>:
>> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.to...@gmail.com> wrote:
>>> From: Björn Töpel <bjorn.to...@intel.com>
>>>
>>> This RFC introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and, in upcoming
>>> patch sets, zero-copy semantics. In this v2 version, we have removed
>>> all zero-copy related code in order to make it smaller, simpler and
>>> hopefully more review friendly. This RFC only supports copy-mode for
>>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>>> changes that Jesper Dangaard Brouer is working on. Some of his work is
>>> already on the mailing list for review. We will publish our zero-copy
>>> support for RX and TX on top of his patch sets at a later point in
>>> time.
>>>
>>> An AF_XDP socket (XSK) is created with the normal socket()
>>> syscall. Associated with each XSK are two queues: the RX queue and the
>>> TX queue. A socket can receive packets on the RX queue and it can send
>>> packets on the TX queue. These queues are registered and sized with
>>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
>>> mandatory to have at least one of these queues for each socket. In
>>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>>> packet buffers. An RX or TX descriptor points to a data buffer in a
>>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>>> packet does not have to be copied between RX and TX. Moreover, if a
>>> packet needs to be kept for a while due to a possible retransmit, the
>>> descriptor that points to that packet can be changed to point to
>>> another and reused right away. This again avoids copying data.
>>>
>>> This new dedicated packet buffer area is called a UMEM. It consists of
>>> a number of equally size frames and each frame has a unique frame
>>> id. A descriptor in one of the queues references a frame by
>>> referencing its frame id. The user space allocates memory for this
>>> UMEM using whatever means it feels is most appropriate (malloc, mmap,
>>> huge pages, etc). This memory area is then registered with the kernel
>>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
>>> the FILL queue and the COMPLETION queue. The fill queue is used by the
>>> application to send down frame ids for the kernel to fill in with RX
>>> packet data. References to these frames will then appear in the RX
>>> queue of the XSK once they have been received. The completion queue,
>>> on the other hand, contains frame ids that the kernel has transmitted
>>> completely and can now be used again by user space, for either TX or
>>> RX. Thus, the frame ids appearing in the completion queue are ids that
>>> were previously transmitted using the TX queue. In summary, the RX and
>>> FILL queues are used for the RX path and the TX and COMPLETION queues
>>> are used for the TX path.
>>>
>> Can we register a UMEM to multiple device's queue?
>>
>
> No, one UMEM, one netdev queue in this RFC. That being said, there's
> nothing stopping a user from creating an additional UMEM, say UMEM',
> pointing to the same memory as UMEM, but bound to another
> netdev/queue. Note that the user space application has to make sure
> that the buffer handling is sane (user/kernel frame ownership).
>
> We used to allow to share UMEM between unrelated sockets, but after
> the introduction of the UMEM queues (fill/completion) that's no the
> case any more. For the zero-copy scenario, having to manage multiple
> DMA mappings per UMEM was a bit of a mess, so we went for the simpler
> (current) solution with one UMEM per netdev/queue.
>
>> So far the l2fwd sample code is sending/receiving from the same
>> queue. I'm thinking about forwarding packets from one device to another.
>> Now I'm copying packets from one device's RX desc to another device's TX
>> completion queue. But this introduces one extra copy.
>>
>
> So you've setup two identical UMEMs? Then you can just forward the
> incoming Rx descriptor to the other netdev's Tx queue. Note, that you
> only need to copy the descriptor, not the actual frame data.
>

Thanks!
I will give it a try, I guess you're saying I can do below:

int sfd1; // for devi

Re: [PATCH net] ip_gre: clear feature flags when incompatible o_flags are set

2018-04-10 Thread William Tu

On Tue, Apr 10, 2018 at 6:10 AM, Xin Long <lucien@gmail.com> wrote:
> On Tue, Apr 10, 2018 at 6:57 PM, Sabrina Dubroca <s...@queasysnail.net> wrote:
>> Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
>> netlink") added the ability to change o_flags, but missed that the
>> GSO/LLTX features are disabled by default, and only enabled some gre
>> features are unused. Thus we also need to disable the GSO/LLTX features
>> on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.
>>
>> These two examples should result in the same features being set:
>>
>> ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 
>> 255 key 0
>>
>> ip link set gre_none type gre seq
>> ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 
>> 255 key 1 seq
>>
>> Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via 
>> netlink")
>> Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
>> ---

Looks good to me.
Acked-by: William Tu <u9012...@gmail.com>


>>  net/ipv4/ip_gre.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
>> index a8772a978224..9c169bb2444d 100644
>> --- a/net/ipv4/ip_gre.c
>> +++ b/net/ipv4/ip_gre.c
>> @@ -781,8 +781,14 @@ static void ipgre_link_update(struct net_device *dev, 
>> bool set_mtu)
>> tunnel->encap.type == TUNNEL_ENCAP_NONE) {
>> dev->features |= NETIF_F_GSO_SOFTWARE;
>> dev->hw_features |= NETIF_F_GSO_SOFTWARE;
>> +   } else {
>> +   dev->features &= ~NETIF_F_GSO_SOFTWARE;
>> +   dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>> }
>> dev->features |= NETIF_F_LLTX;
>> +   } else {
>> +   dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>> +   dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
>> }
>>  }
>>
>> --
>> 2.16.2
>>
> Reviewed-by: Xin Long <lucien@gmail.com>

Re: [RFC PATCH v2 00/14] Introducing AF_XDP support

2018-04-09 Thread William Tu

On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work is
> already on the mailing list for review. We will publish our zero-copy
> support for RX and TX on top of his patch sets at a later point in
> time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is called a UMEM. It consists of
> a number of equally size frames and each frame has a unique frame
> id. A descriptor in one of the queues references a frame by
> referencing its frame id. The user space allocates memory for this
> UMEM using whatever means it feels is most appropriate (malloc, mmap,
> huge pages, etc). This memory area is then registered with the kernel
> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
> the FILL queue and the COMPLETION queue. The fill queue is used by the
> application to send down frame ids for the kernel to fill in with RX
> packet data. References to these frames will then appear in the RX
> queue of the XSK once they have been received. The completion queue,
> on the other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
Can we register a UMEM to multiple device's queue?

So far the l2fwd sample code is sending/receiving from the same
queue. I'm thinking about forwarding packets from one device to another.
Now I'm copying packets from one device's RX desc to another device's TX
completion queue. But this introduces one extra copy.

One way I can do is to call bpf_redirect helper function, but sometimes
I still need to process the packet in userspace.

I like this work!
Thanks a lot.
William

Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-28 Thread William Tu

Hi Jesper,
Thanks for the comments.

>> I assume this xdpsock code is small and should all fit into the icache.
>> However, doing another perf stat on xdpsock l2fwd shows
>>
>> 13,720,109,581  stalled-cycles-frontend   # 60.01% frontend cycles
>> idle (23.82%)
>>
>>   stalled-cycles-backend
>>   7,994,837  branch-misses   # 0.16% of all branches
>>(23.80%)
>> 996,874,424  bus-cycles  # 99.679 M/sec  (23.80%)
>>  18,942,220,445  ref-cycles  # 1894.067 M/sec(28.56%)
>> 100,983,226  LLC-loads   # 10.097 M/sec  (23.80%)
>>   4,897,089  LLC-load-misses # 4.85% of all LL-cache hits 
>> (23.80%)
>>  66,659,889  LLC-stores  # 6.665 M/sec   (9.52%)
>>   8,373 LLC-store-misses # 0.837 K/sec  (9.52%)
>> 158,178,410  LLC-prefetches   # 15.817 M/sec  (9.52%)
>>   3,011,180  LLC-prefetch-misses  # 0.301 M/sec   (9.52%)
>>   8,190,383,109  dTLB-loads   # 818.971 M/sec (9.52%)
>>  20,432,204  dTLB-load-misses # 0.25% of all dTLB cache hits   
>> (9.52%)
>>   3,729,504,674  dTLB-stores   # 372.920 M/sec (9.52%)
>> 992,231  dTLB-store-misses # 0.099 M/sec(9.52%)
>>   dTLB-prefetches
>>   dTLB-prefetch-misses
>>  11,619 iTLB-loads# 0.001 M/sec (9.52%)
>>   1,874,756  iTLB-load-misses # 16135.26% of all iTLB cache hits 
>> (14.28%)
>
> What was the sample period for this perf stat?
>
10 seconds.
root@ovs-smartnic:~/net-next/tools/perf# ./perf stat -C 6 sleep 10

>> I have super high iTLB-load-misses. This is probably the cause of high
>> frontend stalled.
>
> It looks very strange that your iTLB-loads are 11,619, while the
> iTLB-load-misses are much much higher 1,874,756.
>
Does it mean cpu try to load the code, then fail, then load again and
fail again...
So the number of iTLB loads is larger than misses.
Maybe it's related to high nmi rate, where the nmi handler clear my iTLB?
Let me try to remove the nmi interference first.

>> Do you know any way to improve iTLB hit rate?
>
> The xdpsock code should be small enough to fit in the iCache, but it
> might be layout in memory in an unfortunate way.  You could play with
> rearranging the C-code (look at the objdump alignments).
>
> If you want to know the details about code alignment issue, and how to
> troubleshoot them, you should read this VERY excellent blog post by
> Denis Bakhvalov:
> https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues

Thanks for the link.
William

Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-27 Thread William Tu

> Indeed. Intel iommu has least effect on RX because of premap/recycle.
> But TX dma map and unmap is really expensive!
>
>>
>> Basically the IOMMU can make creating/destroying a DMA mapping really
>> expensive. The easiest way to work around it in the case of the Intel
>> IOMMU is to boot with "iommu=pt" which will create an identity mapping
>> for the host. The downside is though that you then have the entire
>> system accessible to the device unless a new mapping is created for it
>> by assigning it to a new IOMMU domain.
>
>
> Yeah thats what I would say, If you really want to use intel iommu and
> don't want to hit by performance , use 'iommu=pt'.
>
> Good to have confirmation from you Alex. Thanks.
>

Thanks for the suggestion! Update my performance number:

without iommu=pt (posted before)
Benchmark   XDP_SKB
rxdrop  2.3 Mpps
txpush 1.05 Mpps
l2fwd0.90 Mpps

with iommu=pt (new)
Benchmark   XDP_SKB
rxdrop  2.24 Mpps
txpush 1.54 Mpps
l2fwd1.23 Mpps

TX indeed shows better rate, while RX remains.
William

Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-27 Thread William Tu

On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
<bro...@redhat.com> wrote:
> On Mon, 26 Mar 2018 14:58:02 -0700
> William Tu <u9012...@gmail.com> wrote:
>
>> > Again high count for NMI ?!?
>> >
>> > Maybe you just forgot to tell perf that you want it to decode the
>> > bpf_prog correctly?
>> >
>> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>> >
>> > Enable via:
>> >  $ sysctl net/core/bpf_jit_kallsyms=1
>> >
>> > And use perf report (while BPF is STILL LOADED):
>> >
>> >  $ perf report --kallsyms=/proc/kallsyms
>> >
>> > E.g. for emailing this you can use this command:
>> >
>> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms 
>> > --no-children --stdio -g none | head -n 40
>> >
>>
>> Thanks, I followed the steps, the result of l2fwd
>> # Total Lost Samples: 119
>> #
>> # Samples: 2K of event 'cycles:ppp'
>> # Event count (approx.): 25675705627
>> #
>> # Overhead  CPU  Command  Shared Object   Symbol
>> #   ...  ...  ..  
>> ..
>> #
>> 10.48%  013  xdpsock  xdpsock [.] main
>>  9.77%  013  xdpsock  [kernel.vmlinux][k] clflush_cache_range
>>  8.45%  013  xdpsock  [kernel.vmlinux][k] nmi
>>  8.07%  013  xdpsock  [kernel.vmlinux][k] xsk_sendmsg
>>  7.81%  013  xdpsock  [kernel.vmlinux][k] __domain_mapping
>>  4.95%  013  xdpsock  [kernel.vmlinux][k] ixgbe_xmit_frame_ring
>>  4.66%  013  xdpsock  [kernel.vmlinux][k] skb_store_bits
>>  4.39%  013  xdpsock  [kernel.vmlinux][k] syscall_return_via_sysret
>>  3.93%  013  xdpsock  [kernel.vmlinux][k] pfn_to_dma_pte
>>  2.62%  013  xdpsock  [kernel.vmlinux][k] __intel_map_single
>>  2.53%  013  xdpsock  [kernel.vmlinux][k] __alloc_skb
>>  2.36%  013  xdpsock  [kernel.vmlinux][k] iommu_no_mapping
>>  2.21%  013  xdpsock  [kernel.vmlinux][k] alloc_skb_with_frags
>>  2.07%  013  xdpsock  [kernel.vmlinux][k] skb_set_owner_w
>>  1.98%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_node_track_caller
>>  1.94%  013  xdpsock  [kernel.vmlinux][k] ksize
>>  1.84%  013  xdpsock  [kernel.vmlinux][k] validate_xmit_skb_list
>>  1.62%  013  xdpsock  [kernel.vmlinux][k] kmem_cache_alloc_node
>>  1.48%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_reserve.isra.37
>>  1.21%  013  xdpsock  xdpsock [.] xq_enq
>>  1.08%  013  xdpsock  [kernel.vmlinux][k] intel_alloc_iova
>>
>
> You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks 
> wrong.
>
Thanks, you're right. Let me dig more on this NMI behavior.

>
>> And l2fwd under "perf stat" looks OK to me. There is little context
>> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
>>
>> Performance counter stats for 'CPU(s) 6':
>>   1.787420  cpu-clock (msec)  #1.000 CPUs utilized
>> 24  context-switches  #0.002 K/sec
>>  0  cpu-migrations#0.000 K/sec
>>  0  page-faults   #0.000 K/sec
>> 22,361,333,647  cycles#2.236 GHz
>> 13,458,442,838  stalled-cycles-frontend   #   60.19% frontend cycles idle
>> 26,251,003,067  instructions  #1.17  insn per cycle
>>   #0.51  stalled cycles per 
>> insn
>>  4,938,921,868  branches  #  493.853 M/sec
>>  7,591,739  branch-misses #0.15% of all branches
>>   10.000835769 seconds time elapsed
>
> This perf stat also indicate something is wrong.
>
> The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> usually I see, e.g. 2.36  insn per cycle).
>
> It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> cycles idle'.   This means your CPU have issues/bottleneck fetching
> instructions. Explained by Andi Kleen here [1]
>
> [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
>
thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows

13,720,109,581  stalled-cy

Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-26 Thread William Tu

Hi Jesper,

Thanks a lot for your prompt reply.

>> Hi,
>> I also did an evaluation of AF_XDP, however the performance isn't as
>> good as above.
>> I'd like to share the result and see if there are some tuning suggestions.
>>
>> System:
>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>
> Hmmm, why is X540-AT2 not able to use XDP natively?

Because I'm only able to use ixgbe driver for this NIC,
and AF_XDP patch only has i40e support?

>
>> AF_XDP performance:
>> Benchmark   XDP_SKB
>> rxdrop  1.27 Mpps
>> txpush  0.99 Mpps
>> l2fwd0.85 Mpps
>
> Definitely too low...
>
I did another run, the rxdrop seems better.
Benchmark   XDP_SKB
rxdrop  2.3 Mpps
txpush 1.05 Mpps
l2fwd0.90 Mpps

> What is the performance if you drop packets via iptables?
>
> Command:
>  $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>
I did
# iptables -t raw -I PREROUTING -p udp -i enp10s0f0 -j DROP
# iptables -nvL -t raw; sleep 10; iptables -nvL -t raw

and I got 2.9Mpps.

>> NIC configuration:
>> the command
>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>> doesn't work on my ixgbe driver, so I use ntuple:
>>
>> ethtool -K enp10s0f0 ntuple on
>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>> then
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>
>> I also take a look at perf result:
>> For rxdrop:
>> 86.56%  xdpsock xdpsock   [.] main
>>   9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>   4.23%  xdpsock  xdpsock [.] xq_enq
>
> It looks very strange that you see non-maskable interrupt's (NMI) being
> this high...
>
yes, that's weird. Looking at the perf annotate of nmi,
it shows 100% spent on nop instruction.

>
>> For l2fwd:
>>  20.81%  xdpsock xdpsock [.] main
>>  10.64%  xdpsock [kernel.vmlinux][k] clflush_cache_range
>
> Oh, clflush_cache_range is being called!

I though clflush_cache_range is high because we have many smp_rmb, smp_wmb
in the xdpsock queue/ring management userspace code.
(perf shows that 75% of this 10.64% spent on mfence instruction.)

> Do your system use an IOMMU ?
>
Yes.
With CONFIG_INTEL_IOMMU=y
and I saw some related functions called (ex: intel_alloc_iova).

>>   8.46%  xdpsock  [kernel.vmlinux][k] xsk_sendmsg
>>   6.72%  xdpsock  [kernel.vmlinux][k] skb_set_owner_w
>>   5.89%  xdpsock  [kernel.vmlinux][k] __domain_mapping
>>   5.74%  xdpsock  [kernel.vmlinux][k] alloc_skb_with_frags
>>   4.62%  xdpsock  [kernel.vmlinux][k] netif_skb_features
>>   3.96%  xdpsock  [kernel.vmlinux][k] ___slab_alloc
>>   3.18%  xdpsock  [kernel.vmlinux][k] nmi
>
> Again high count for NMI ?!?
>
> Maybe you just forgot to tell perf that you want it to decode the
> bpf_prog correctly?
>
> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>
> Enable via:
>  $ sysctl net/core/bpf_jit_kallsyms=1
>
> And use perf report (while BPF is STILL LOADED):
>
>  $ perf report --kallsyms=/proc/kallsyms
>
> E.g. for emailing this you can use this command:
>
>  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms 
> --no-children --stdio -g none | head -n 40
>

Thanks, I followed the steps, the result of l2fwd
# Total Lost Samples: 119
#
# Samples: 2K of event 'cycles:ppp'
# Event count (approx.): 25675705627
#
# Overhead  CPU  Command  Shared Object   Symbol
#   ...  ...  ..  ..
#
10.48%  013  xdpsock  xdpsock [.] main
 9.77%  013  xdpsock  [kernel.vmlinux][k] clflush_cache_range
 8.45%  013  xdpsock  [kernel.vmlinux][k] nmi
 8.07%  013  xdpsock  [kernel.vmlinux][k] xsk_sendmsg
 7.81%  013  xdpsock  [kernel.vmlinux][k] __domain_mapping
 4.95%  013  xdpsock  [kernel.vmlinux][k] ixgbe_xmit_frame_ring
 4.66%  013  xdpsock  [kernel.vmlinux][k] skb_store_bits
 4.39%  013  xdpsock  [kernel.vmlinux][k] syscall_return_via_sysret
 3.93%  013  xdpsock  [kernel.vmlinux][k] pfn_to_dma_pte
 2.62%  013  xdpsock  [kernel.vmlinux][k] __intel_map_single
 2.53%  013  xdpsock  [kernel.vmlinux][k] __alloc_skb
 2.36%  013  xdpsock  [kernel.vmlinux][k] iommu_no_mapping
 2.21%  013  xdpsock  [kernel.vmlinux][k] alloc_skb_with_frags
 2.07%  013  xdpsock  [kernel.vmlinux][k] skb_set_owner_w
 1.98%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_node_track_caller
 1.94%  013  xdpsock  [kernel.vmlinux][k] ksize
 1.84%  013  xdpsock  [kernel.vmlinux][k] validate_xmit_skb_list
 1.62%  013  xdpsock  [kernel.vmlinux][k] kmem_cache_alloc_node
 1.48%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_reserve.isra.37
 1.21%  013  xdpsock  xdpsock [.] xq_enq
 1.08%  013  xdpsock  [kernel.vmlinux][k] intel_alloc_iova

And l2fwd under "perf

Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-26 Thread William Tu

On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>
> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and

Re: [PATCH][next] gre: fix TUNNEL_SEQ bit check on sequence numbering

2018-03-21 Thread William Tu

On Wed, Mar 21, 2018 at 12:34 PM, Colin King <colin.k...@canonical.com> wrote:
> From: Colin Ian King <colin.k...@canonical.com>
>
> The current logic of flags | TUNNEL_SEQ is always non-zero and hence
> sequence numbers are always incremented no matter the setting of the
> TUNNEL_SEQ bit.  Fix this by using & instead of |.
>
> Detected by CoverityScan, CID#1466039 ("Operands don't affect result")
>
> Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Thanks for the fix!
btw, how can I access the CoverityScan result with this CID?

Acked-by: William Tu <u9012...@gmail.com>


> ---
>  net/ipv4/ip_gre.c  | 2 +-
>  net/ipv6/ip6_gre.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 2fa2ef2e2af9..9ab1aa2f7660 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -550,7 +550,7 @@ static void gre_fb_xmit(struct sk_buff *skb, struct 
> net_device *dev,
> (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
> gre_build_header(skb, tunnel_hlen, flags, proto,
>  tunnel_id_to_key32(tun_info->key.tun_id),
> -(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
> +(flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
>
> df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 0bcefc480aeb..3a98c694da5f 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -725,7 +725,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
> gre_build_header(skb, tunnel->tun_hlen,
>  flags, protocol,
>  tunnel_id_to_key32(tun_info->key.tun_id),
> -(flags | TUNNEL_SEQ) ? 
> htonl(tunnel->o_seqno++)
> +(flags & TUNNEL_SEQ) ? 
> htonl(tunnel->o_seqno++)
>   : 0);
>
> } else {
> --
> 2.15.1
>

[PATCH net 3/3] ip6erspan: make sure enough headroom at xmit.

2018-03-09 Thread William Tu

The patch adds skb_cow_header() to ensure enough headroom
at ip6erspan_tunnel_xmit before pushing the erspan header
to the skb.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv6/ip6_gre.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4ab476d3a46e..9a759bbbd8a6 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -906,6 +906,9 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
truncate = true;
}
 
+   if (skb_cow_head(skb, dev->needed_headroom))
+   goto tx_err;
+
t->parms.o_flags &= ~TUNNEL_KEY;
IPCB(skb)->flags = 0;
 
-- 
2.7.4

[PATCH net 2/3] ip6erspan: improve error handling for erspan version number.

2018-03-09 Thread William Tu

When users fill in incorrect erspan version number through
the struct erspan_metadata uapi, current code skips pushing
the erspan header but continue pushing the gre header, which
is incorrect.  The patch fixes it by returning error.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv6/ip6_gre.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index a056c2bb4b9a..4ab476d3a46e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -948,6 +948,8 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
   md->u.md2.dir,
   get_hwid(>u.md2),
   truncate, false);
+   } else {
+   goto tx_err;
}
} else {
switch (skb->protocol) {
-- 
2.7.4

[PATCH net 0/3] a couple of erspan fixes

2018-03-09 Thread William Tu

The series fixes a couple of erspan issues.
The first patch adds the erspan v2 proto type to the ip6 tunnel lookup.
The second patch improves the error handling when users screws the
version number in metadata.  The final patch makes sure the skb has
enough headroom for pushing erspan header when xmit.

William Tu (3):
  ip6gre: add erspan v2 to tunnel lookup
  ip6erspan: improve error handling for erspan version number.
  ip6erspan: make sure enough headroom at xmit.

 net/ipv6/ip6_gre.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

-- 
2.7.4

[PATCH net 1/3] ip6gre: add erspan v2 to tunnel lookup

2018-03-09 Thread William Tu

The patch adds the erspan v2 proto in ip6gre_tunnel_lookup
so the erspan v2 tunnel can be found correctly.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv6/ip6_gre.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 18a3dfbd0300..a056c2bb4b9a 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -126,7 +126,8 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct 
net_device *dev,
struct ip6_tnl *t, *cand = NULL;
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
int dev_type = (gre_proto == htons(ETH_P_TEB) ||
-   gre_proto == htons(ETH_P_ERSPAN)) ?
+   gre_proto == htons(ETH_P_ERSPAN) ||
+   gre_proto == htons(ETH_P_ERSPAN2)) ?
   ARPHRD_ETHER : ARPHRD_IP6GRE;
int score, cand_score = 4;
 
-- 
2.7.4

[PATCHv2 net-next] openvswitch: fix vport packet length check.

2018-03-07 Thread William Tu

When sending a packet to a tunnel device, the dev's hard_header_len
could be larger than the skb->len in function packet_length().
In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
which is around 180, and an ARP packet sent to this tunnel has
skb->len = 42.  This causes the 'unsign int length' to become super
large because it is negative value, causing the later ovs_vport_send
to drop it due to over-mtu size.  The patch fixes it by setting it to 0.

Signed-off-by: William Tu <u9012...@gmail.com>
---
v1->v2:
  replace the return type from unsigned int to int
---
 net/openvswitch/vport.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index b6c8524032a0..f81c1d0ddff4 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -464,10 +464,10 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
*skb,
return 0;
 }
 
-static unsigned int packet_length(const struct sk_buff *skb,
- struct net_device *dev)
+static int packet_length(const struct sk_buff *skb,
+struct net_device *dev)
 {
-   unsigned int length = skb->len - dev->hard_header_len;
+   int length = skb->len - dev->hard_header_len;
 
if (!skb_vlan_tag_present(skb) &&
eth_type_vlan(skb->protocol))
@@ -478,7 +478,7 @@ static unsigned int packet_length(const struct sk_buff *skb,
 * account for 802.1ad. e.g. is_skb_forwardable().
 */
 
-   return length;
+   return length > 0 ? length : 0;
 }
 
 void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
-- 
2.7.4

Re: [PATCH net-next] openvswitch: fix vport packet length check.

2018-03-07 Thread William Tu

On Wed, Mar 7, 2018 at 1:18 PM, Pravin Shelar <pshe...@ovn.org> wrote:
> On Tue, Mar 6, 2018 at 5:56 PM, William Tu <u9012...@gmail.com> wrote:
>> When sending a packet to a tunnel device, the dev's hard_header_len
>> could be larger than the skb->len in function packet_length().
>> In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
>> which is around 180, and an ARP packet sent to this tunnel has
>> skb->len = 42.  This causes the 'unsign int length' to become super
>> large because it is negative value, causing the later ovs_vport_send
>> to drop it due to over-mtu size.  The patch fixes it by setting it to 0.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  net/openvswitch/vport.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
>> index b6c8524032a0..7718d5b4cf8a 100644
>> --- a/net/openvswitch/vport.c
>> +++ b/net/openvswitch/vport.c
>> @@ -467,7 +467,7 @@ int ovs_vport_receive(struct vport *vport, struct 
>> sk_buff *skb,
>>  static unsigned int packet_length(const struct sk_buff *skb,
>>   struct net_device *dev)
> Can you also change return type of this function?
>

OK, I will change to int. Thanks

[PATCH net-next] openvswitch: fix vport packet length check.

2018-03-06 Thread William Tu

When sending a packet to a tunnel device, the dev's hard_header_len
could be larger than the skb->len in function packet_length().
In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
which is around 180, and an ARP packet sent to this tunnel has
skb->len = 42.  This causes the 'unsign int length' to become super
large because it is negative value, causing the later ovs_vport_send
to drop it due to over-mtu size.  The patch fixes it by setting it to 0.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/openvswitch/vport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index b6c8524032a0..7718d5b4cf8a 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -467,7 +467,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
*skb,
 static unsigned int packet_length(const struct sk_buff *skb,
  struct net_device *dev)
 {
-   unsigned int length = skb->len - dev->hard_header_len;
+   int length = skb->len - dev->hard_header_len;
 
if (!skb_vlan_tag_present(skb) &&
eth_type_vlan(skb->protocol))
@@ -478,7 +478,7 @@ static unsigned int packet_length(const struct sk_buff *skb,
 * account for 802.1ad. e.g. is_skb_forwardable().
 */
 
-   return length;
+   return length > 0 ? length : 0;
 }
 
 void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
-- 
2.7.4

Re: [PATCH net-next] selftests: rtnetlink: remove testns on test fail

2018-03-02 Thread William Tu

On Thu, Mar 1, 2018 at 6:22 PM, Prashant Bhole
<bhole_prashant...@lab.ntt.co.jp> wrote:
> This patch removes testns after test failure so that next test can
> continue with clean ns
>
> Signed-off-by: Prashant Bhole <bhole_prashant...@lab.ntt.co.jp>

Thanks for the fix.

Acked-by: William Tu <u9012...@gmail.com>

Re: help on iproute2 hangs

2018-03-01 Thread William Tu

>
> I still can not reproduce the hang, but try this and see if it fixes
> your problem (whitespace damaged on paste):
>
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 7ca47b22581a..9d692afbc740 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -670,8 +672,9 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl,
> struct iovec *iov,
> free(buf);
> if (h->nlmsg_seq == seq)
> return 0;
> -   else
> +   else if (i < iovlen)
> goto next;
> +   return 0;
> }
>
> if (rtnl->proto != NETLINK_SOCK_DIAG &&

Yes. this solves our problem.
Thanks a lot!
William

[PATCHv2 net-next 0/2] gre: add sequence number for collect md mode.

2018-03-01 Thread William Tu

Currently GRE sequence number can only be used in native tunnel mode.
The first patch adds sequence number support for gre collect
metadata mode, and the second patch tests it using BPF.

RFC2890 defines GRE sequence number to be specific to the traffic
flow identified by the key.  However, this patch does not implement
per-key seqno.  The sequence number is shared in the same tunnel
device. That is, different tunnel keys using the same collect_md
tunnel share single sequence number.

A new BFP uapi tunnel flag 'BPF_F_SEQ_NUMBER' is added.
--
v1->v2:
  rename BPF_F_GRE_SEQ to BPF_F_SEQ_NUMBER suggested by Daniel
--
William Tu (2):
  gre: add sequence number for collect md mode.
  samples/bpf: add gre sequence number test.

 include/uapi/linux/bpf.h   |  1 +
 net/core/filter.c  |  4 +++-
 net/ipv4/ip_gre.c  |  7 +--
 net/ipv6/ip6_gre.c | 13 -
 samples/bpf/tcbpf2_kern.c  |  6 --
 samples/bpf/test_tunnel_bpf.sh |  5 +++--
 6 files changed, 24 insertions(+), 12 deletions(-)

-- 
2.7.4

[PATCHv2 net-next 2/2] samples/bpf: add gre sequence number test.

2018-03-01 Thread William Tu

The patch adds tests for GRE sequence number
support for metadata mode tunnel.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 samples/bpf/tcbpf2_kern.c  | 6 --
 samples/bpf/test_tunnel_bpf.sh | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index efdc16d195ff..9a8db7bd6db4 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -52,7 +52,8 @@ int _gre_set_tunnel(struct __sk_buff *skb)
key.tunnel_tos = 0;
key.tunnel_ttl = 64;
 
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
BPF_F_ZERO_CSUM_TX);
+   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
+BPF_F_ZERO_CSUM_TX | BPF_F_SEQ_NUMBER);
if (ret < 0) {
ERROR(ret);
return TC_ACT_SHOT;
@@ -92,7 +93,8 @@ int _ip6gretap_set_tunnel(struct __sk_buff *skb)
key.tunnel_label = 0xabcde;
 
ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX);
+BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
+BPF_F_SEQ_NUMBER);
if (ret < 0) {
ERROR(ret);
return TC_ACT_SHOT;
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index 43ce049996ee..c265863ccdf9 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -23,7 +23,8 @@ function config_device {
 function add_gre_tunnel {
# in namespace
ip netns exec at_ns0 \
-   ip link add dev $DEV_NS type $TYPE key 2 local 172.16.1.100 
remote 172.16.1.200
+ip link add dev $DEV_NS type $TYPE seq key 2 \
+   local 172.16.1.100 remote 172.16.1.200
ip netns exec at_ns0 ip link set dev $DEV_NS up
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
 
@@ -43,7 +44,7 @@ function add_ip6gretap_tunnel {
 
# in namespace
ip netns exec at_ns0 \
-   ip link add dev $DEV_NS type $TYPE flowlabel 0xbcdef key 2 \
+   ip link add dev $DEV_NS type $TYPE seq flowlabel 0xbcdef key 2 \
local ::11 remote ::22
 
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
-- 
2.7.4

[PATCHv2 net-next 1/2] gre: add sequence number for collect md mode.

2018-03-01 Thread William Tu

Currently GRE sequence number can only be used in native
tunnel mode.  This patch adds sequence number support for
gre collect metadata mode.  RFC2890 defines GRE sequence
number to be specific to the traffic flow identified by the
key.  However, this patch does not implement per-key seqno.
The sequence number is shared in the same tunnel device.
That is, different tunnel keys using the same collect_md
tunnel share single sequence number.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c|  4 +++-
 net/ipv4/ip_gre.c|  7 +--
 net/ipv6/ip6_gre.c   | 13 -
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc375126..2a66769e5875 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -800,6 +800,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
+#define BPF_F_SEQ_NUMBER   (1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
diff --git a/net/core/filter.c b/net/core/filter.c
index 0c121adbdbaa..33edfa8372fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2991,7 +2991,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
struct ip_tunnel_info *info;
 
if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
-  BPF_F_DONT_FRAGMENT)))
+  BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER)))
return -EINVAL;
if (unlikely(size != sizeof(struct bpf_tunnel_key))) {
switch (size) {
@@ -3025,6 +3025,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
info->key.tun_flags |= TUNNEL_DONT_FRAGMENT;
if (flags & BPF_F_ZERO_CSUM_TX)
info->key.tun_flags &= ~TUNNEL_CSUM;
+   if (flags & BPF_F_SEQ_NUMBER)
+   info->key.tun_flags |= TUNNEL_SEQ;
 
info->key.tun_id = cpu_to_be64(from->tunnel_id);
info->key.tos = from->tunnel_tos;
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0fe1d69b5df4..95fd225f402e 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -522,6 +522,7 @@ static struct rtable *prepare_fb_xmit(struct sk_buff *skb,
 static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
__be16 proto)
 {
+   struct ip_tunnel *tunnel = netdev_priv(dev);
struct ip_tunnel_info *tun_info;
const struct ip_tunnel_key *key;
struct rtable *rt = NULL;
@@ -545,9 +546,11 @@ static void gre_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
if (gre_handle_offloads(skb, !!(tun_info->key.tun_flags & TUNNEL_CSUM)))
goto err_free_rt;
 
-   flags = tun_info->key.tun_flags & (TUNNEL_CSUM | TUNNEL_KEY);
+   flags = tun_info->key.tun_flags &
+   (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
gre_build_header(skb, tunnel_hlen, flags, proto,
-tunnel_id_to_key32(tun_info->key.tun_id), 0);
+tunnel_id_to_key32(tun_info->key.tun_id),
+(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4f150a394387..16c5dfcbd195 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -695,9 +695,6 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
else
fl6->daddr = tunnel->parms.raddr;
 
-   if (tunnel->parms.o_flags & TUNNEL_SEQ)
-   tunnel->o_seqno++;
-
/* Push GRE header. */
protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto;
 
@@ -720,14 +717,20 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
dsfield = key->tos;
-   flags = key->tun_flags & (TUNNEL_CSUM | TUNNEL_KEY);
+   flags = key->tun_flags &
+   (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
tunnel->tun_hlen = gre_calc_hlen(flags);
 
gre_build_header(skb, tunnel->tun_hlen,
 flags, protocol,
-tunnel_id_to_key32(tun_info->key.tun_id), 0);
+tunnel_id_to_key32(tun_info->key.tun_id),
+(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
+ : 0);
 
} else {
+   if (tunnel->parms.o_flags & TUNNEL_SEQ)
+   tun

Re: help on iproute2 hangs

2018-03-01 Thread William Tu

On Thu, Mar 1, 2018 at 10:36 AM, David Ahern <dsah...@gmail.com> wrote:
> On 3/1/18 10:29 AM, William Tu wrote:
>> Hi,
>>
>> We're running commands below on kernel 4.15.0:
>> 1) ip netns add at_ns0
>> 2) ip link add p0 type veth peer name ovs-p0
>> 3) ip link set p0 netns at_ns0
>> 4) ip link set dev ovs-p0 up
>
> # uname -a
> Linux kenny-jessie3 4.16.0-rc2+ #162 SMP Thu Mar 1 08:48:58 PST 2018
> x86_64 GNU/Linux
>
> # bash -x /tmp/2
> + ip netns add at_ns0
> + ip link add p0 type veth peer name ovs-p0
> + ip link set p0 netns at_ns0
> + ip link set dev ovs-p0 up
>
> Works fine for me on top of tree.
>
> What is the output of 'cat /proc//stack' when it hangs?
>
root@osb:~/iproute2# ps aux | grep ip
root   3652  0.0  0.0  11532   884 pts/24   S+   10:43   0:00 ip
link add p0 type veth peer name ovs-p0

root@osb:~/iproute2# cat /proc/3652/stack
[<0>] __skb_wait_for_more_packets+0x103/0x160
[<0>] __skb_recv_datagram+0x69/0xc0
[<0>] skb_recv_datagram+0x3f/0x60
[<0>] netlink_recvmsg+0x59/0x420
[<0>] ___sys_recvmsg+0xee/0x230
[<0>] __sys_recvmsg+0x4e/0x90
[<0>] entry_SYSCALL_64_fastpath+0x24/0x87
[<0>] 0x

if I run strace on "ip link add p0 type veth peer name ovs-p0"
open("/usr/lib/ip/link_veth.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No
such file or directory)
sendmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=},
msg_iov(1)=[{"X\0\0\0\20\0\5\6\315J\230Z\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
88}], msg_controllen=0, msg_flags=0}, 0) = 88
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=}, msg_iov(1)=[{NULL, 0}], msg_controllen=0,
msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=},
msg_iov(1)=[{"$\0\0\0\2\0\0\1\315J\230Z1\24\0\0\0\0\0\0X\0\0\0\20\0\5\6\315J\230Z"...,
36}], msg_controllen=0, msg_flags=0}, 0) = 36

Thanks a lot
William

help on iproute2 hangs

2018-03-01 Thread William Tu

Hi,

We're running commands below on kernel 4.15.0:
1) ip netns add at_ns0
2) ip link add p0 type veth peer name ovs-p0
3) ip link set p0 netns at_ns0
4) ip link set dev ovs-p0 up

However, it always hangs at creating veth peer, command (2) when we
run above commands in GNU autotest. Running the same commands in a
bash script works fine.

Running bisect on iproute2, the first bad commit is
commit 72a2ff3916e59d2132a7d613d9e8f5eb372ba43e
Author: Chris Mi 
Date:   Fri Jan 12 14:13:15 2018 +0900

   lib/libnetlink: Add a new function rtnl_talk_iov

   rtnl_talk can only send a single message to kernel. Add a new function
   rtnl_talk_iov that can send multiple messages to kernel.
   rtnl_talk_iov takes struct iovec * and iovlen as arguments.

GDB shows it hangs on
#0  0x7f9c92e6b330 in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0044eac6 in __rtnl_recvmsg ()
#2  0x0044eb95 in rtnl_recvmsg ()
#3  0x0044ed46 in __rtnl_talk_iov ()
#4  0x0044fbb8 in rtnl_talk ()
#5  0x00422554 in iplink_modify ()

Any suggestion about how to debug this?
Thanks
William

Re: [PATCH net-next 2/2] samples/bpf: add gre sequence number test.

2018-03-01 Thread William Tu

On Thu, Mar 1, 2018 at 2:30 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 03/01/2018 01:11 AM, William Tu wrote:
>> The patch adds tests for GRE sequence number
>> support for metadata mode tunnel.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  samples/bpf/tcbpf2_kern.c  | 6 --
>>  samples/bpf/test_tunnel_bpf.sh | 4 ++--
>>  2 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
>> index efdc16d195ff..f9d0db2be21b 100644
>> --- a/samples/bpf/tcbpf2_kern.c
>> +++ b/samples/bpf/tcbpf2_kern.c
>> @@ -52,7 +52,8 @@ int _gre_set_tunnel(struct __sk_buff *skb)
>>   key.tunnel_tos = 0;
>>   key.tunnel_ttl = 64;
>>
>> - ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
>> BPF_F_ZERO_CSUM_TX);
>> + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
>> +  BPF_F_ZERO_CSUM_TX | BPF_F_GRE_SEQ);
>>   if (ret < 0) {
>>   ERROR(ret);
>>   return TC_ACT_SHOT;
>> @@ -92,7 +93,8 @@ int _ip6gretap_set_tunnel(struct __sk_buff *skb)
>>   key.tunnel_label = 0xabcde;
>>
>>   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
>> -  BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX);
>> +  BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
>> +  BPF_F_GRE_SEQ);
>>   if (ret < 0) {
>>   ERROR(ret);
>>   return TC_ACT_SHOT;
>> diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
>> index 43ce049996ee..01a07fb9efa9 100755
>> --- a/samples/bpf/test_tunnel_bpf.sh
>> +++ b/samples/bpf/test_tunnel_bpf.sh
>
> Can be as follow-up, but if you have a chance of moving this into BPF 
> kselftests,
> this would be really great. Otherwise this will get little actual test 
> coverage.
>
> Thanks,
> Daniel
>

Yes, this tunnel test is getting bigger and bigger, and it's better to
move to ksefltests.
I will work on it this month.
Thanks!
William

Re: [PATCH net-next 1/2] gre: add sequence number for collect md mode.

2018-03-01 Thread William Tu

On Thu, Mar 1, 2018 at 2:18 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 03/01/2018 01:11 AM, William Tu wrote:
>> Currently GRE sequence number can only be used in native
>> tunnel mode.  This patch adds sequence number support for
>> gre collect metadata mode.  RFC2890 defines GRE sequence
>> number to be specific to the traffic flow identified by the
>> key.  However, this patch does not implement per-key seqno.
>> The sequence number is shared in the same tunnel device.
>> That is, different tunnel keys using the same collect_md
>> tunnel share single sequence number.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  include/uapi/linux/bpf.h |  1 +
>>  net/core/filter.c|  4 +++-
>>  net/ipv4/ip_gre.c|  7 +--
>>  net/ipv6/ip6_gre.c   | 13 -
>>  4 files changed, 17 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index db6bdc375126..2c6dd942953d 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -800,6 +800,7 @@ enum bpf_func_id {
>>  /* BPF_FUNC_skb_set_tunnel_key flags. */
>>  #define BPF_F_ZERO_CSUM_TX   (1ULL << 1)
>>  #define BPF_F_DONT_FRAGMENT  (1ULL << 2)
>> +#define BPF_F_GRE_SEQ(1ULL << 3)
>>
>>  /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
>>   * BPF_FUNC_perf_event_read_value flags.
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0c121adbdbaa..010305e0791a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2991,7 +2991,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, 
>> skb,
>>   struct ip_tunnel_info *info;
>>
>>   if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
>> -BPF_F_DONT_FRAGMENT)))
>> +BPF_F_DONT_FRAGMENT | BPF_F_GRE_SEQ)))
>>   return -EINVAL;
>>   if (unlikely(size != sizeof(struct bpf_tunnel_key))) {
>>   switch (size) {
>> @@ -3025,6 +3025,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, 
>> skb,
>>   info->key.tun_flags |= TUNNEL_DONT_FRAGMENT;
>>   if (flags & BPF_F_ZERO_CSUM_TX)
>>   info->key.tun_flags &= ~TUNNEL_CSUM;
>> + if (flags & BPF_F_GRE_SEQ)
>> + info->key.tun_flags |= TUNNEL_SEQ;
>
> Ok, looks fine. My only minor request would be to rename BPF_F_GRE_SEQ
> into e.g. BPF_F_SEQ_NUMBER to at least not have something GRE specific
> in the name in case we could later on reuse it elsewhere as well, and
> the bpf_skb_set_tunnel_key() is unaware of the underlying encap anyway.

OK, make sense. Thanks!
I will rename it in the next version.
William

[PATCH net-next 1/2] gre: add sequence number for collect md mode.

2018-02-28 Thread William Tu

Currently GRE sequence number can only be used in native
tunnel mode.  This patch adds sequence number support for
gre collect metadata mode.  RFC2890 defines GRE sequence
number to be specific to the traffic flow identified by the
key.  However, this patch does not implement per-key seqno.
The sequence number is shared in the same tunnel device.
That is, different tunnel keys using the same collect_md
tunnel share single sequence number.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c|  4 +++-
 net/ipv4/ip_gre.c|  7 +--
 net/ipv6/ip6_gre.c   | 13 -
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc375126..2c6dd942953d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -800,6 +800,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
+#define BPF_F_GRE_SEQ  (1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
diff --git a/net/core/filter.c b/net/core/filter.c
index 0c121adbdbaa..010305e0791a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2991,7 +2991,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
struct ip_tunnel_info *info;
 
if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
-  BPF_F_DONT_FRAGMENT)))
+  BPF_F_DONT_FRAGMENT | BPF_F_GRE_SEQ)))
return -EINVAL;
if (unlikely(size != sizeof(struct bpf_tunnel_key))) {
switch (size) {
@@ -3025,6 +3025,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
info->key.tun_flags |= TUNNEL_DONT_FRAGMENT;
if (flags & BPF_F_ZERO_CSUM_TX)
info->key.tun_flags &= ~TUNNEL_CSUM;
+   if (flags & BPF_F_GRE_SEQ)
+   info->key.tun_flags |= TUNNEL_SEQ;
 
info->key.tun_id = cpu_to_be64(from->tunnel_id);
info->key.tos = from->tunnel_tos;
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0fe1d69b5df4..95fd225f402e 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -522,6 +522,7 @@ static struct rtable *prepare_fb_xmit(struct sk_buff *skb,
 static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
__be16 proto)
 {
+   struct ip_tunnel *tunnel = netdev_priv(dev);
struct ip_tunnel_info *tun_info;
const struct ip_tunnel_key *key;
struct rtable *rt = NULL;
@@ -545,9 +546,11 @@ static void gre_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
if (gre_handle_offloads(skb, !!(tun_info->key.tun_flags & TUNNEL_CSUM)))
goto err_free_rt;
 
-   flags = tun_info->key.tun_flags & (TUNNEL_CSUM | TUNNEL_KEY);
+   flags = tun_info->key.tun_flags &
+   (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
gre_build_header(skb, tunnel_hlen, flags, proto,
-tunnel_id_to_key32(tun_info->key.tun_id), 0);
+tunnel_id_to_key32(tun_info->key.tun_id),
+(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4f150a394387..16c5dfcbd195 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -695,9 +695,6 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
else
fl6->daddr = tunnel->parms.raddr;
 
-   if (tunnel->parms.o_flags & TUNNEL_SEQ)
-   tunnel->o_seqno++;
-
/* Push GRE header. */
protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto;
 
@@ -720,14 +717,20 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
dsfield = key->tos;
-   flags = key->tun_flags & (TUNNEL_CSUM | TUNNEL_KEY);
+   flags = key->tun_flags &
+   (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
tunnel->tun_hlen = gre_calc_hlen(flags);
 
gre_build_header(skb, tunnel->tun_hlen,
 flags, protocol,
-tunnel_id_to_key32(tun_info->key.tun_id), 0);
+tunnel_id_to_key32(tun_info->key.tun_id),
+(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
+ : 0);
 
} else {
+   if (tunnel->parms.o_flags & TUNNEL_SEQ)
+   tun

[PATCH net-next 0/2] gre: add sequence number for collect md mode.

2018-02-28 Thread William Tu

Currently GRE sequence number can only be used in native tunnel mode.
The first patch adds sequence number support for gre collect
metadata mode, and the second patch tests it using BPF.

RFC2890 defines GRE sequence number to be specific to the traffic
flow identified by the key.  However, this patch does not implement
per-key seqno.  The sequence number is shared in the same tunnel
device. That is, different tunnel keys using the same collect_md
tunnel share single sequence number.

A new BFP uapi tunnel flag 'BPF_F_GRE_SEQ' is added.  I name it
since GRE is the only tunnel type having sequence number.

William Tu (2):
  gre: add sequence number for collect md mode.
  samples/bpf: add gre sequence number test.

 include/uapi/linux/bpf.h   |  1 +
 net/core/filter.c  |  4 +++-
 net/ipv4/ip_gre.c  |  7 +--
 net/ipv6/ip6_gre.c | 13 -
 samples/bpf/tcbpf2_kern.c  |  6 --
 samples/bpf/test_tunnel_bpf.sh |  4 ++--
 6 files changed, 23 insertions(+), 12 deletions(-)

-- 
2.7.4

[PATCH net-next 2/2] samples/bpf: add gre sequence number test.

2018-02-28 Thread William Tu

The patch adds tests for GRE sequence number
support for metadata mode tunnel.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 samples/bpf/tcbpf2_kern.c  | 6 --
 samples/bpf/test_tunnel_bpf.sh | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index efdc16d195ff..f9d0db2be21b 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -52,7 +52,8 @@ int _gre_set_tunnel(struct __sk_buff *skb)
key.tunnel_tos = 0;
key.tunnel_ttl = 64;
 
-   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
BPF_F_ZERO_CSUM_TX);
+   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
+BPF_F_ZERO_CSUM_TX | BPF_F_GRE_SEQ);
if (ret < 0) {
ERROR(ret);
return TC_ACT_SHOT;
@@ -92,7 +93,8 @@ int _ip6gretap_set_tunnel(struct __sk_buff *skb)
key.tunnel_label = 0xabcde;
 
ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
-BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX);
+BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
+BPF_F_GRE_SEQ);
if (ret < 0) {
ERROR(ret);
return TC_ACT_SHOT;
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index 43ce049996ee..01a07fb9efa9 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -23,7 +23,7 @@ function config_device {
 function add_gre_tunnel {
# in namespace
ip netns exec at_ns0 \
-   ip link add dev $DEV_NS type $TYPE key 2 local 172.16.1.100 
remote 172.16.1.200
+   ip link add dev $DEV_NS type $TYPE seq key 2 local 172.16.1.100 
remote 172.16.1.200
ip netns exec at_ns0 ip link set dev $DEV_NS up
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
 
@@ -43,7 +43,7 @@ function add_ip6gretap_tunnel {
 
# in namespace
ip netns exec at_ns0 \
-   ip link add dev $DEV_NS type $TYPE flowlabel 0xbcdef key 2 \
+   ip link add dev $DEV_NS type $TYPE seq flowlabel 0xbcdef key 2 \
local ::11 remote ::22
 
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
-- 
2.7.4

[PATCH net 2/3] net: erspan: fix erspan config overwrite

2018-02-05 Thread William Tu

When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata.  The patch fixes it.

Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support")
Signed-off-by: William Tu <u9012...@gmail.com>
---
 net/ipv4/ip_gre.c  | 9 -
 net/ipv6/ip6_gre.c | 9 -
 2 files changed, 18 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 9b50eddd1882..45d97e9b2759 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -322,15 +322,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
info = _dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
-   } else {
-   tunnel->erspan_ver = ver;
-   if (ver == 1) {
-   tunnel->index = ntohl(pkt_md->u.index);
-   } else {
-   tunnel->dir = pkt_md->u.md2.dir;
-   tunnel->hwid = get_hwid(_md->u.md2);
-   }
-
}
 
skb_reset_mac_header(skb);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 50913dbd0612..3c353125546d 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -562,15 +562,6 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
 
} else {
-   tunnel->parms.erspan_ver = ver;
-
-   if (ver == 1) {
-   tunnel->parms.index = ntohl(pkt_md->u.index);
-   } else {
-   tunnel->parms.dir = pkt_md->u.md2.dir;
-   tunnel->parms.hwid = get_hwid(_md->u.md2);
-   }
-
ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
}
 
-- 
2.7.4

[PATCH net 3/3] sample/bpf: fix erspan metadata

2018-02-05 Thread William Tu

The commit c69de58ba84f ("net: erspan: use bitfield instead of
mask and offset") changes the erspan header to use bitfield, and
commit d350a823020e ("net: erspan: create erspan metadata uapi header")
creates a uapi header file.  The above two commit breaks the current
erspan test.  This patch fixes it by adapting the above two changes.

Fixes: ac80c2a165af ("samples/bpf: add erspan v2 sample code")
Fixes: ef88f89c830f ("samples/bpf: extend test_tunnel_bpf.sh with ERSPAN")
Signed-off-by: William Tu <u9012...@gmail.com>
---
 samples/bpf/tcbpf2_kern.c  | 41 -
 samples/bpf/test_tunnel_bpf.sh |  4 ++--
 2 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index f6bbf8f50da3..efdc16d195ff 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "bpf_helpers.h"
 #include "bpf_endian.h"
@@ -35,24 +36,10 @@ struct geneve_opt {
u8  opt_data[8]; /* hard-coded to 8 byte */
 };
 
-struct erspan_md2 {
-   __be32 timestamp;
-   __be16 sgt;
-   __be16 flags;
-};
-
 struct vxlan_metadata {
u32 gbp;
 };
 
-struct erspan_metadata {
-   union {
-   __be32 index;
-   struct erspan_md2 md2;
-   } u;
-   int version;
-};
-
 SEC("gre_set_tunnel")
 int _gre_set_tunnel(struct __sk_buff *skb)
 {
@@ -156,13 +143,15 @@ int _erspan_set_tunnel(struct __sk_buff *skb)
__builtin_memset(, 0, sizeof(md));
 #ifdef ERSPAN_V1
md.version = 1;
-   md.u.index = htonl(123);
+   md.u.index = bpf_htonl(123);
 #else
u8 direction = 1;
-   u16 hwid = 7;
+   u8 hwid = 7;
 
md.version = 2;
-   md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+   md.u.md2.dir = direction;
+   md.u.md2.hwid = hwid & 0xf;
+   md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
 #endif
 
ret = bpf_skb_set_tunnel_opt(skb, , sizeof(md));
@@ -207,9 +196,9 @@ int _erspan_get_tunnel(struct __sk_buff *skb)
char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
 
bpf_trace_printk(fmt2, sizeof(fmt2),
-   (ntohs(md.u.md2.flags) >> 3) & 0x1,
-   (ntohs(md.u.md2.flags) >> 4) & 0x3f,
-   bpf_ntohl(md.u.md2.timestamp));
+md.u.md2.dir,
+(md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+bpf_ntohl(md.u.md2.timestamp));
 #endif
 
return TC_ACT_OK;
@@ -242,10 +231,12 @@ int _ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
md.version = 1;
 #else
u8 direction = 0;
-   u16 hwid = 17;
+   u8 hwid = 17;
 
md.version = 2;
-   md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+   md.u.md2.dir = direction;
+   md.u.md2.hwid = hwid & 0xf;
+   md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
 #endif
 
ret = bpf_skb_set_tunnel_opt(skb, , sizeof(md));
@@ -290,9 +281,9 @@ int _ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
 
bpf_trace_printk(fmt2, sizeof(fmt2),
-   (ntohs(md.u.md2.flags) >> 3) & 0x1,
-   (ntohs(md.u.md2.flags) >> 4) & 0x3f,
-   bpf_ntohl(md.u.md2.timestamp));
+md.u.md2.dir,
+(md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+bpf_ntohl(md.u.md2.timestamp));
 #endif
 
return TC_ACT_OK;
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index ae7f7c38309b..43ce049996ee 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -68,7 +68,7 @@ function add_erspan_tunnel {
ip netns exec at_ns0 \
ip link add dev $DEV_NS type $TYPE seq key 2 \
local 172.16.1.100 remote 172.16.1.200 \
-   erspan_ver 2 erspan_dir 1 erspan_hwid 3
+   erspan_ver 2 erspan_dir egress erspan_hwid 3
fi
ip netns exec at_ns0 ip link set dev $DEV_NS up
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
@@ -97,7 +97,7 @@ function add_ip6erspan_tunnel {
ip netns exec at_ns0 \
ip link add dev $DEV_NS type $TYPE seq key 2 \
local ::11 remote ::22 \
-   erspan_ver 2 erspan_dir 1 erspan_hwid 7
+   erspan_ver 2 erspan_dir egress erspan_hwid 7
fi
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
ip netns exec at_ns0 ip link set dev $DEV_NS up
-- 
2.7.4

[PATCH net 1/3] net: erspan: fix metadata extraction

2018-02-05 Thread William Tu

Commit d350a823020e ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason.  This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr.  This patch
fixes it.

Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 26 +-
 net/ipv4/ip_gre.c|  5 -
 net/ipv6/ip6_gre.c   |  6 --
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index 5daa4866412b..d044aa60cc76 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -159,13 +159,13 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
struct ethhdr *eth = (struct ethhdr *)skb->data;
enum erspan_encap_type enc_type;
struct erspan_base_hdr *ershdr;
-   struct erspan_metadata *ersmd;
struct qtag_prefix {
__be16 eth_type;
__be16 tci;
} *qp;
u16 vlan_tci = 0;
u8 tos;
+   __be32 *idx;
 
tos = is_ipv4 ? ip_hdr(skb)->tos :
(ipv6_hdr(skb)->priority << 4) +
@@ -195,8 +195,8 @@ static inline void erspan_build_header(struct sk_buff *skb,
set_session_id(ershdr, id);
 
/* Build metadata */
-   ersmd = (struct erspan_metadata *)(ershdr + 1);
-   ersmd->u.index = htonl(index & INDEX_MASK);
+   idx = (__be32 *)(ershdr + 1);
+   *idx = htonl(index & INDEX_MASK);
 }
 
 /* ERSPAN GRA: timestamp granularity
@@ -225,7 +225,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
 {
struct ethhdr *eth = (struct ethhdr *)skb->data;
struct erspan_base_hdr *ershdr;
-   struct erspan_metadata *md;
+   struct erspan_md2 *md2;
struct qtag_prefix {
__be16 eth_type;
__be16 tci;
@@ -261,15 +261,15 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
set_session_id(ershdr, id);
 
/* Build metadata */
-   md = (struct erspan_metadata *)(ershdr + 1);
-   md->u.md2.timestamp = erspan_get_timestamp();
-   md->u.md2.sgt = htons(sgt);
-   md->u.md2.p = 1;
-   md->u.md2.ft = 0;
-   md->u.md2.dir = direction;
-   md->u.md2.gra = gra;
-   md->u.md2.o = 0;
-   set_hwid(>u.md2, hwid);
+   md2 = (struct erspan_md2 *)(ershdr + 1);
+   md2->timestamp = erspan_get_timestamp();
+   md2->sgt = htons(sgt);
+   md2->p = 1;
+   md2->ft = 0;
+   md2->dir = direction;
+   md2->gra = gra;
+   md2->o = 0;
+   set_hwid(md2, hwid);
 }
 
 #endif
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 6ec670fbbbdd..9b50eddd1882 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -261,6 +261,7 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
struct ip_tunnel_net *itn;
struct ip_tunnel *tunnel;
const struct iphdr *iph;
+   struct erspan_md2 *md2;
int ver;
int len;
 
@@ -313,8 +314,10 @@ static int erspan_rcv(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
return PACKET_REJECT;
 
md = ip_tunnel_info_opts(_dst->u.tun_info);
-   memcpy(md, pkt_md, sizeof(*md));
md->version = ver;
+   md2 = >u.md2;
+   memcpy(md2, pkt_md, ver == 1 ? ERSPAN_V1_MDSIZE :
+  ERSPAN_V2_MDSIZE);
 
info = _dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 05f070e123e4..50913dbd0612 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -505,6 +505,7 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
struct erspan_base_hdr *ershdr;
struct erspan_metadata *pkt_md;
const struct ipv6hdr *ipv6h;
+   struct erspan_md2 *md2;
struct ip6_tnl *tunnel;
u8 ver;
 
@@ -551,9 +552,10 @@ static int ip6erspan_rcv(struct sk_buff *skb, int 
gre_hdr_len,
 
info = _dst->u.tun_info;
md = ip_tunnel_info_opts(info);
-
-   memcpy(md, pkt_md, sizeof(*md));
md->version = ver;
+   md2 = >u.md2;
+   memcpy(md2, pkt_md, ver == 1 ? ERSPAN_V1_MDSIZE :
+  ERSPAN_V2_MDSIZE);
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
 
-- 
2.7.4

[PATCH net 0/3] net: erspan fixes

2018-02-05 Thread William Tu

The first patch fixes erspan metadata extraction issue from packet
header due to commit d350a823020e ("net: erspan: create erspan metadata
uapi header").  The commit moves the erspan 'version' in
'struct erspan_metadata' in front of 'struct erspan_md2' for later
extensibility, but breaks the existing metadata extraction code due
to extra 4-byte size 'version'.  The second patch fixes the case where
tunnel device receives an erspan packet with different tunnel metadata
(ex: version, index, hwid, direction), existing code overwrites the
tunnel device's erspan configuration.  The third patch fixes the bpf
tests due to the above patches.

William Tu (3):
  net: erspan: fix metadata extraction
  net: erspan: fix erspan config overwrite
  sample/bpf: fix erspan metadata

 include/net/erspan.h   | 26 +-
 net/ipv4/ip_gre.c  | 14 --
 net/ipv6/ip6_gre.c | 15 ---
 samples/bpf/tcbpf2_kern.c  | 41 -
 samples/bpf/test_tunnel_bpf.sh |  4 ++--
 5 files changed, 39 insertions(+), 61 deletions(-)

-- 
2.7.4

[PATCHv6 net-next 2/3] net: erspan: create erspan metadata uapi header

2018-01-25 Thread William Tu

The patch adds a new uapi header file, erspan.h, and moves
the 'struct erspan_metadata' from internal erspan.h to it.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h| 32 ++--
 include/uapi/linux/erspan.h | 52 +
 2 files changed, 54 insertions(+), 30 deletions(-)
 create mode 100644 include/uapi/linux/erspan.h

diff --git a/include/net/erspan.h b/include/net/erspan.h
index 6d30fe898286..5daa4866412b 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -46,6 +46,8 @@
  * GRE proto ERSPAN type II = 0x88BE, type III = 0x22EB
  */
 
+#include 
+
 #define ERSPAN_VERSION 0x1 /* ERSPAN type II */
 #define VER_MASK   0xf000
 #define VLAN_MASK  0x0fff
@@ -68,29 +70,6 @@
 #define HWID_OFFSET4
 #define DIR_OFFSET 3
 
-/* ERSPAN version 2 metadata header */
-struct erspan_md2 {
-   __be32 timestamp;
-   __be16 sgt; /* security group tag */
-#if defined(__LITTLE_ENDIAN_BITFIELD)
-   __u8hwid_upper:2,
-   ft:5,
-   p:1;
-   __u8o:1,
-   gra:2,
-   dir:1,
-   hwid:4;
-#elif defined(__BIG_ENDIAN_BITFIELD)
-   __u8p:1,
-   ft:5,
-   hwid_upper:2;
-   __u8hwid:4,
-   dir:1,
-   gra:2,
-   o:1;
-#endif
-};
-
 enum erspan_encap_type {
ERSPAN_ENCAP_NOVLAN = 0x0,  /* originally without VLAN tag */
ERSPAN_ENCAP_ISL = 0x1, /* originally ISL encapsulated */
@@ -100,13 +79,6 @@ enum erspan_encap_type {
 
 #define ERSPAN_V1_MDSIZE   4
 #define ERSPAN_V2_MDSIZE   8
-struct erspan_metadata {
-   union {
-   __be32 index;   /* Version 1 (type II)*/
-   struct erspan_md2 md2;  /* Version 2 (type III) */
-   } u;
-   int version;
-};
 
 struct erspan_base_hdr {
 #if defined(__LITTLE_ENDIAN_BITFIELD)
diff --git a/include/uapi/linux/erspan.h b/include/uapi/linux/erspan.h
new file mode 100644
index ..841573019ae1
--- /dev/null
+++ b/include/uapi/linux/erspan.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * ERSPAN Tunnel Metadata
+ *
+ * Copyright (c) 2018 VMware
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * Userspace API for metadata mode ERSPAN tunnel
+ */
+#ifndef _UAPI_ERSPAN_H
+#define _UAPI_ERSPAN_H
+
+#include/* For __beXX in userspace */
+#include 
+
+/* ERSPAN version 2 metadata header */
+struct erspan_md2 {
+   __be32 timestamp;
+   __be16 sgt; /* security group tag */
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#else
+#error "Please fix "
+#endif
+};
+
+struct erspan_metadata {
+   int version;
+   union {
+   __be32 index;   /* Version 1 (type II)*/
+   struct erspan_md2 md2;  /* Version 2 (type III) */
+   } u;
+};
+
+#endif /* _UAPI_ERSPAN_H */
-- 
2.7.4

[PATCHv6 net-next 3/3] openvswitch: add erspan version I and II support

2018-01-25 Thread William Tu

The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a binary blob to support all ERSPAN v1 and v2's
fields.  Note that Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/openvswitch.h |  1 +
 net/openvswitch/flow_netlink.c   | 52 +++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dcfab5e3b55c..713e56ce681f 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -363,6 +363,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index eb55f1b3d047..7322aa1e382e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "flow_netlink.h"
 
@@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
-   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
+* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
@@ -400,6 +402,7 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_VARIABLE },
 };
 
 static const struct ovs_len_tbl
@@ -631,6 +634,33 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
+static int erspan_tun_opt_from_nlattr(const struct nlattr *a,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+   unsigned long opt_key_offset;
+
+   BUILD_BUG_ON(sizeof(struct erspan_metadata) >
+sizeof(match->key->tun_opts));
+
+   if (nla_len(a) > sizeof(match->key->tun_opts)) {
+   OVS_NLERR(log, "ERSPAN option length err (len %d, max %zu).",
+ nla_len(a), sizeof(match->key->tun_opts));
+   return -EINVAL;
+   }
+
+   if (!is_mask)
+   SW_FLOW_KEY_PUT(match, tun_opts_len,
+   sizeof(struct erspan_metadata), false);
+   else
+   SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
+
+   opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
+   SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
+ nla_len(a), is_mask);
+   return 0;
+}
+
 static int ip_tun_from_nlattr(const struct nlattr *attr,
  struct sw_flow_match *match, bool is_mask,
  bool log)
@@ -738,6 +768,20 @@ static int ip_tun_from_nlattr(const struct nlattr *attr,
break;
case OVS_TUNNEL_KEY_ATTR_PAD:
break;
+   case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+   if (opts_type) {
+   OVS_NLERR(log, "Multiple metadata blocks 
provided");
+   return -EINVAL;
+   }
+
+   err = erspan_tun_opt_from_nlattr(a, match, is_mask,
+log);
+   if (err)
+   return err;
+
+   tun_flags |= TUNNEL_ERSPAN_OPT;
+   opts_type = type;
+   break;
default:
OVS_NLERR(log, "Unknown IP tunnel attribute %d",
  type);
@@ -862,6 +906,10 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
 v

[PATCHv6 net-next 0/3] net: erspan: add support for openvswitch

2018-01-25 Thread William Tu

The first patch refactors the erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch creates erspan.h in UAPI and move the definition
'struct erspan_metadata' to it for later openvswitch to use.  The final patch
introduces the new OVS tunnel key attribute, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (3):
  net: erspan: use bitfield instead of mask and offset
  net: erspan: create erspan metadata uapi header
  openvswitch: add erspan version I and II support

 include/net/erspan.h | 123 +--
 include/uapi/linux/erspan.h  |  52 +
 include/uapi/linux/openvswitch.h |   1 +
 net/ipv4/ip_gre.c|  38 +---
 net/ipv6/ip6_gre.c   |  36 +---
 net/openvswitch/flow_netlink.c   |  52 -
 6 files changed, 209 insertions(+), 93 deletions(-)
 create mode 100644 include/uapi/linux/erspan.h
---
v5->v6
  move field 'version' to the begining of the struct for easy expansion later.
  remove redundant erspan validation function
  create erspan.h in uapi

v4->v5
  rather than passing individual members of erspan_metadata,
  just pass the whole binary structure between kernel and userspace,
  suggested by Pravin.

v3->v4
  change from be32 to u32 for OVS_ERSPAN_OPT_IDX, suggested by Jiri Benc.

v2->v3
  revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e.
  redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute

v1->v2
  Fix compatibility issue suggested by Pravin.


-- 
2.7.4

[PATCHv6 net-next 1/3] net: erspan: use bitfield instead of mask and offset

2018-01-25 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index 712ea1b1f4db..6d30fe898286 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = (struct ethhdr *)skb->data;
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,

Re: [PATCHv5 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-25 Thread William Tu

Thanks for the review.

On Thu, Jan 25, 2018 at 9:32 AM, Pravin Shelar <pshe...@ovn.org> wrote:
> On Wed, Jan 24, 2018 at 11:06 AM, William Tu <u9012...@gmail.com> wrote:
>> The patch adds support for openvswitch to configure erspan
>> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
>> to uapi as a binary blob to support all ERSPAN v1 and v2's
>> fields.  Note that Previous commit "openvswitch: Add erspan tunnel
>> support." was reverted since it does not design properly.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  include/uapi/linux/openvswitch.h |  2 +-
>>  net/openvswitch/flow_netlink.c   | 90 
>> +++-
>>  2 files changed, 90 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/openvswitch.h 
>> b/include/uapi/linux/openvswitch.h
>> index dcfab5e3b55c..158c2e45c0a5 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -273,7 +273,6 @@ enum {
>>
>>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>>
>> -
>>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>>   */
>>  enum {
>> @@ -363,6 +362,7 @@ enum ovs_tunnel_key_attr {
>> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
>> address. */
>> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
>> address. */
>> OVS_TUNNEL_KEY_ATTR_PAD,
>> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
>> __OVS_TUNNEL_KEY_ATTR_MAX
>>  };
>>
> Since this is uapi, we need to define the struct erspan_metadata in a
> UAPI header file.

Should I define "struct erspan_metadata" in include/uapi/linux/openvswitch.h?

Or I'm planning to create a new file in uapi "include/uapi/linux/erspan.h",
define "struct erspan_metadata" there, and remove its duplicate at
include/net/erspan.h.

>
> Also lets move field 'version' to the begining of the struct for easy
> expansion later.
> struct erspan_metadata {
> int version;
> union {
> __be32 index;   /* Version 1 (type II)*/
> struct erspan_md2 md2;  /* Version 2 (type III) */
> } u;
> };
>
Sure, will do it.

>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>> index f143908b651d..9d00c24b2836 100644
>> --- a/net/openvswitch/flow_netlink.c
>> +++ b/net/openvswitch/flow_netlink.c
>> @@ -49,6 +49,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include "flow_netlink.h"
>>
>> @@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
>> + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
>> + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
>> + nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
>> -   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
>> +   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
>> +* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
>>  * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
>>  */
>> + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
>> @@ -400,6 +402,7 @@ static const struct ovs_len_tbl 
>> ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
>> .next = 
>> ovs_vxlan_ext_key_lens },
>> [OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct 
>> in6_addr) },
>> [OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct 
>> in6_addr) },
>> +   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_VARIABLE },
>>  };
>>
>>  static const struct ovs_len_tbl
>> @@ -631,6 +634,33 @@ static int vxlan_tun_opt_from_nlattr(const struct 
>> nlattr *attr,
>> return 0;
>>  }
>>
>> +static int erspan_tun_opt_from_nlattr(const struct nlattr *a,
>> + struct sw_flow_match *match, bool 
>> is_mask,
>> + bool log)
>> +{
>> +   unsigned long opt_key_offset;
>> +
>> +   BUILD_BUG_ON(sizeof(struct erspan_metadata) >
>> +sizeof(match->key->tun_opts));
>> +
>> +   if (nla_len(a) > sizeof(match->key->tun_opts)) {
>> +   OVS_NLERR(log, "ERSPAN option length err (len %d, max %zu).",
>&g

[PATCHv5 net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-24 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool trunca

[PATCHv5 net-next 0/2] net: erspan: add support for openvswitch

2018-01-24 Thread William Tu

The first patch refactors the erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch introduces the new OVS tunnel key attribute
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version I and II support

 include/net/erspan.h | 127 +--
 include/uapi/linux/openvswitch.h |   2 +-
 net/ipv4/ip_gre.c|  38 +---
 net/ipv6/ip6_gre.c   |  36 ---
 net/openvswitch/flow_netlink.c   |  90 ++-
 5 files changed, 211 insertions(+), 82 deletions(-)

-- 
v4->v5
  rather than passing individual members of erspan_metadata,
  just pass the whole binary structure between kernel and userspace,
  suggested by Pravin.

v3->v4
  change from be32 to u32 for OVS_ERSPAN_OPT_IDX, suggested by Jiri Benc.

v2->v3
  revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e.
  redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute

v1->v2
  Fix compatibility issue suggested by Pravin.

2.7.4

[PATCHv5 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-24 Thread William Tu

The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a binary blob to support all ERSPAN v1 and v2's
fields.  Note that Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/openvswitch.h |  2 +-
 net/openvswitch/flow_netlink.c   | 90 +++-
 2 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dcfab5e3b55c..158c2e45c0a5 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,7 +273,6 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
-
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
 enum {
@@ -363,6 +362,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index f143908b651d..9d00c24b2836 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "flow_netlink.h"
 
@@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
-   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
+* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
@@ -400,6 +402,7 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_VARIABLE },
 };
 
 static const struct ovs_len_tbl
@@ -631,6 +634,33 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
+static int erspan_tun_opt_from_nlattr(const struct nlattr *a,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+   unsigned long opt_key_offset;
+
+   BUILD_BUG_ON(sizeof(struct erspan_metadata) >
+sizeof(match->key->tun_opts));
+
+   if (nla_len(a) > sizeof(match->key->tun_opts)) {
+   OVS_NLERR(log, "ERSPAN option length err (len %d, max %zu).",
+ nla_len(a), sizeof(match->key->tun_opts));
+   return -EINVAL;
+   }
+
+   if (!is_mask)
+   SW_FLOW_KEY_PUT(match, tun_opts_len,
+   sizeof(struct erspan_metadata), false);
+   else
+   SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
+
+   opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
+   SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
+ nla_len(a), is_mask);
+   return 0;
+}
+
 static int ip_tun_from_nlattr(const struct nlattr *attr,
  struct sw_flow_match *match, bool is_mask,
  bool log)
@@ -738,6 +768,20 @@ static int ip_tun_from_nlattr(const struct nlattr *attr,
break;
case OVS_TUNNEL_KEY_ATTR_PAD:
break;
+   case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+   if (opts_type) {
+   OVS_NLERR(log, "Multiple metadata blocks 
provided");
+   return -EINVAL;
+   }
+
+   err = erspan_tun_opt_from_nlattr(a, match, is_mask,
+log);
+   if (err)
+   return err;
+
+   tun_flags |= TUNNEL_ERSPAN_OPT;
+   opts_type = type;
+   break;
default:
OVS_NLERR(log, "Unknown IP tunnel attribute %d",
  type);
@@ -862,6 +906,10 @@ static int __

[PATCH net] net: erspan: fix use-after-free

2018-01-23 Thread William Tu

When building the erspan header for either v1 or v2, the eth_hdr()
does not point to the right inner packet's eth_hdr,
causing kasan report use-after-free and slab-out-of-bouds read.

The patch fixes the following syzkaller issues:
[1] BUG: KASAN: slab-out-of-bounds in erspan_xmit+0x22d4/0x2430 
net/ipv4/ip_gre.c:735
[2] BUG: KASAN: slab-out-of-bounds in erspan_build_header+0x3bf/0x3d0 
net/ipv4/ip_gre.c:698
[3] BUG: KASAN: use-after-free in erspan_xmit+0x22d4/0x2430 
net/ipv4/ip_gre.c:735
[4] BUG: KASAN: use-after-free in erspan_build_header+0x3bf/0x3d0 
net/ipv4/ip_gre.c:698

[2] CPU: 0 PID: 3654 Comm: syzkaller377964 Not tainted 4.15.0-rc9+ #185
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:440
 erspan_build_header+0x3bf/0x3d0 net/ipv4/ip_gre.c:698
 erspan_xmit+0x3b8/0x13b0 net/ipv4/ip_gre.c:740
 __netdev_start_xmit include/linux/netdevice.h:4042 [inline]
 netdev_start_xmit include/linux/netdevice.h:4051 [inline]
 packet_direct_xmit+0x315/0x6b0 net/packet/af_packet.c:266
 packet_snd net/packet/af_packet.c:2943 [inline]
 packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2968
 sock_sendmsg_nosec net/socket.c:638 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:648
 SYSC_sendto+0x361/0x5c0 net/socket.c:1729
 SyS_sendto+0x40/0x50 net/socket.c:1697
 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
 do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
 entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:129
RIP: 0023:0xf7fcfc79
RSP: 002b:ffc6976c EFLAGS: 0286 ORIG_RAX: 0171
RAX: ffda RBX: 0004 RCX: 20011000
RDX:  RSI:  RDI: 20008000
RBP: 001c R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 

Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
Reported-by: syzbot+9723f2d288e49b492...@syzkaller.appspotmail.com
Reported-by: syzbot+f0ddeb2b032a8e1d9...@syzkaller.appspotmail.com
Reported-by: syzbot+f14b3703cd8d76702...@syzkaller.appspotmail.com
Reported-by: syzbot+eefa384efad8d7997...@syzkaller.appspotmail.com
Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..712ea1b1f4db 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -123,7 +123,7 @@ static inline void erspan_build_header(struct sk_buff *skb,
__be32 id, u32 index,
bool truncate, bool is_ipv4)
 {
-   struct ethhdr *eth = eth_hdr(skb);
+   struct ethhdr *eth = (struct ethhdr *)skb->data;
enum erspan_encap_type enc_type;
struct erspan_base_hdr *ershdr;
struct erspan_metadata *ersmd;
@@ -190,7 +190,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
  __be32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
 {
-   struct ethhdr *eth = eth_hdr(skb);
+   struct ethhdr *eth = (struct ethhdr *)skb->data;
struct erspan_base_hdr *ershdr;
struct erspan_metadata *md;
struct qtag_prefix {
-- 
2.7.4

Re: KASAN: slab-out-of-bounds Read in erspan_xmit

2018-01-23 Thread William Tu

On Tue, Jan 23, 2018 at 11:45 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Tue, Jan 23, 2018 at 8:17 PM, William Tu <u9012...@gmail.com> wrote:
>> Thanks for the reply.
>>
>> On Tue, Jan 23, 2018 at 11:03 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
>>> On Tue, Jan 23, 2018 at 7:58 PM, David Ahern <dsah...@gmail.com> wrote:
>>>> On 1/23/18 11:50 AM, William Tu wrote:
>>>>> Hi,
>>>>>
>>>>> I'm new to kasan and trying to follow this instruction to reproduce the 
>>>>> issue:
>>>>> https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md
>>>>>
>>>>> After re-compile my kernel with KASAN related config enable, I run
>>>>> $ ./syz-execprog -cover=0 -repeat=0 -procs=16 program
>>>>>
>>>>> I wonder does the "program" mean the repro.c.txt? or I should compile
>>>>> it to binary?
>>>>> # gcc -o program repro.c.txt
>>>>> # ./syz-execprog myprogram
>>>>> 2018/01/23 10:45:19 parsed 0 programs
>>>>>
>>>>> And how to use the "repro.syz.txt"?
>>>>> It seems to have some command like "syz_emit_ethernet" to generate packet.
>>>>> but I have no clue where to run it. Maybe I'm still missing something?
>>>>>
>>>>
>>>> In the past I have only compiled a kernel with KASAN, compiled the
>>>> reproducer program and run it in a VM. No need for the syzbot overhead.
>>>
>>> Yes, if C program reproducer the crash then it's easier to use.
>>> repro.c.txt is the C program, you need to rename it to repro.c,
>>> compile with gcc and run just as ./a.out.
>>> But make sure that you have a gcc that supports KASAN (kernel build
>>> does not in the beginning on compiler not supporting KASAN). I think
>>> it's at least gcc 5+, but gcc 7+ would be better.
>>
>> I was using gcc 5+ and "gcc repro.c".
>> Running ./a.out does not show any issue on dmesg. Let me switch to gcc 7+.
>>
>>>
>>> You can also run the syzkaller reproducer as:
>>> ./syz-execprog -cover=0 -repeat=0 -procs=16 repro.syz.txt
>>
>> When using repro.syz.txt, which binary or what tests does it execute?
>
> It interprets the program in syzkaller notation in repro.syz.txt file.
> It should be more of less equivalent to repro.c.txt C program in
> behavior.
>
thanks!. Now I can reproduce the issue.

Re: KASAN: slab-out-of-bounds Read in erspan_xmit

2018-01-23 Thread William Tu

Thanks for the reply.

On Tue, Jan 23, 2018 at 11:03 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Tue, Jan 23, 2018 at 7:58 PM, David Ahern <dsah...@gmail.com> wrote:
>> On 1/23/18 11:50 AM, William Tu wrote:
>>> Hi,
>>>
>>> I'm new to kasan and trying to follow this instruction to reproduce the 
>>> issue:
>>> https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md
>>>
>>> After re-compile my kernel with KASAN related config enable, I run
>>> $ ./syz-execprog -cover=0 -repeat=0 -procs=16 program
>>>
>>> I wonder does the "program" mean the repro.c.txt? or I should compile
>>> it to binary?
>>> # gcc -o program repro.c.txt
>>> # ./syz-execprog myprogram
>>> 2018/01/23 10:45:19 parsed 0 programs
>>>
>>> And how to use the "repro.syz.txt"?
>>> It seems to have some command like "syz_emit_ethernet" to generate packet.
>>> but I have no clue where to run it. Maybe I'm still missing something?
>>>
>>
>> In the past I have only compiled a kernel with KASAN, compiled the
>> reproducer program and run it in a VM. No need for the syzbot overhead.
>
> Yes, if C program reproducer the crash then it's easier to use.
> repro.c.txt is the C program, you need to rename it to repro.c,
> compile with gcc and run just as ./a.out.
> But make sure that you have a gcc that supports KASAN (kernel build
> does not in the beginning on compiler not supporting KASAN). I think
> it's at least gcc 5+, but gcc 7+ would be better.

I was using gcc 5+ and "gcc repro.c".
Running ./a.out does not show any issue on dmesg. Let me switch to gcc 7+.

>
> You can also run the syzkaller reproducer as:
> ./syz-execprog -cover=0 -repeat=0 -procs=16 repro.syz.txt

When using repro.syz.txt, which binary or what tests does it execute?
I didn't see it uses/compiles the repro.c.txt.
But it seems to run something...
~/net-next# ./syz-execprog -cover=0 -repeat=0 -procs=2 repro.syz.txt
2018/01/23 11:15:24 parsed 1 programs
2018/01/23 11:15:24 executed programs: 0
2018/01/23 11:15:29 executed programs: 210
2018/01/23 11:15:34 executed programs: 422
..

Thanks
William

Re: KASAN: slab-out-of-bounds Read in erspan_xmit

2018-01-23 Thread William Tu

Hi,

I'm new to kasan and trying to follow this instruction to reproduce the issue:
https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md

After re-compile my kernel with KASAN related config enable, I run
$ ./syz-execprog -cover=0 -repeat=0 -procs=16 program

I wonder does the "program" mean the repro.c.txt? or I should compile
it to binary?
# gcc -o program repro.c.txt
# ./syz-execprog myprogram
2018/01/23 10:45:19 parsed 0 programs

And how to use the "repro.syz.txt"?
It seems to have some command like "syz_emit_ethernet" to generate packet.
but I have no clue where to run it. Maybe I'm still missing something?

Thanks a lot
William

On Mon, Jan 22, 2018 at 2:57 PM, William Tu <u9012...@gmail.com> wrote:
> On Mon, Jan 22, 2018 at 2:45 PM, David Ahern <dsah...@gmail.com> wrote:
>> [ cc William Tu ]
>>
>> On 1/22/18 12:58 PM, syzbot wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on net-next commit
>>> 9d6474e458b13a94a0d5b141f2b8f38adf1991ae (Mon Jan 22 02:55:38 2018 +)
>>> tun: add missing rcu annotation
>>>
>>> So far this crash happened 5 times on net-next.
>>> C reproducer is attached.
>>> syzkaller reproducer is attached.
>>> Raw console output is attached.
>>> compiler: gcc (GCC) 7.1.1 20170620
>>> .config is attached.
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+9723f2d288e49b492...@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>> ==
>>> BUG: KASAN: slab-out-of-bounds in erspan_xmit+0x22d4/0x2430
>>> net/ipv4/ip_gre.c:735
>>> Read of size 2 at addr 8801c50bb08b by task syzkaller525754/3647
>>>
>>> CPU: 0 PID: 3647 Comm: syzkaller525754 Not tainted 4.15.0-rc8+ #203
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:17 [inline]
>>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>>  __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:440
>>>  erspan_xmit+0x22d4/0x2430 net/ipv4/ip_gre.c:735
>>>  __netdev_start_xmit include/linux/netdevice.h:4053 [inline]
>>>  netdev_start_xmit include/linux/netdevice.h:4062 [inline]
>>>  packet_direct_xmit+0x3ad/0x790 net/packet/af_packet.c:267
>>>  packet_snd net/packet/af_packet.c:2944 [inline]
>>>  packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
>>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>>  sock_sendmsg+0xca/0x110 net/socket.c:640
>>>  SYSC_sendto+0x361/0x5c0 net/socket.c:1721
>>>  SyS_sendto+0x40/0x50 net/socket.c:1689
>>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>>> RIP: 0033:0x445649
>>> RSP: 002b:7ffe82dde5b8 EFLAGS: 0217 ORIG_RAX: 002c
>>> RAX: ffda RBX:  RCX: 00445649
>>> RDX:  RSI: 20003fd9 RDI: 0004
>>> RBP: 004a78c5 R08: 20008000 R09: 001c
>>> R10:  R11: 0217 R12: 00402720
>>> R13: 004027b0 R14:  R15: 
>>>
>>> Allocated by task 3221:
>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>>  set_track mm/kasan/kasan.c:459 [inline]
>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>>>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>>>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>>>  getname_flags+0xcb/0x580 fs/namei.c:138
>>>  getname+0x19/0x20 fs/namei.c:209
>>>  do_sys_open+0x2e7/0x6d0 fs/open.c:1053
>>>  SYSC_open fs/open.c:1077 [inline]
>>>  SyS_open+0x2d/0x40 fs/open.c:1072
>>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>>>
>>> Freed by task 3221:
>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>>  set_track mm/kasan/kasan.c:459 [inline]
>>>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>>>  __cache_free mm/slab.c:3488 [inline]
>>>  kmem_cache_free+0x83/0x2a0 mm/slab.c:3746
>>>  putname+0xee/0x130 fs/namei.c:258
>>>  do_sys_open+0x31b/0x6d0 fs/open.c:1068
>&

Re: KASAN: slab-out-of-bounds Read in erspan_xmit

2018-01-22 Thread William Tu

On Mon, Jan 22, 2018 at 2:45 PM, David Ahern <dsah...@gmail.com> wrote:
> [ cc William Tu ]
>
> On 1/22/18 12:58 PM, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 9d6474e458b13a94a0d5b141f2b8f38adf1991ae (Mon Jan 22 02:55:38 2018 +)
>> tun: add missing rcu annotation
>>
>> So far this crash happened 5 times on net-next.
>> C reproducer is attached.
>> syzkaller reproducer is attached.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+9723f2d288e49b492...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> ==
>> BUG: KASAN: slab-out-of-bounds in erspan_xmit+0x22d4/0x2430
>> net/ipv4/ip_gre.c:735
>> Read of size 2 at addr 8801c50bb08b by task syzkaller525754/3647
>>
>> CPU: 0 PID: 3647 Comm: syzkaller525754 Not tainted 4.15.0-rc8+ #203
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>  __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:440
>>  erspan_xmit+0x22d4/0x2430 net/ipv4/ip_gre.c:735
>>  __netdev_start_xmit include/linux/netdevice.h:4053 [inline]
>>  netdev_start_xmit include/linux/netdevice.h:4062 [inline]
>>  packet_direct_xmit+0x3ad/0x790 net/packet/af_packet.c:267
>>  packet_snd net/packet/af_packet.c:2944 [inline]
>>  packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:640
>>  SYSC_sendto+0x361/0x5c0 net/socket.c:1721
>>  SyS_sendto+0x40/0x50 net/socket.c:1689
>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>> RIP: 0033:0x445649
>> RSP: 002b:7ffe82dde5b8 EFLAGS: 0217 ORIG_RAX: 002c
>> RAX: ffda RBX:  RCX: 00445649
>> RDX:  RSI: 20003fd9 RDI: 0004
>> RBP: 004a78c5 R08: 20008000 R09: 001c
>> R10:  R11: 0217 R12: 00402720
>> R13: 004027b0 R14:  R15: 
>>
>> Allocated by task 3221:
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>  set_track mm/kasan/kasan.c:459 [inline]
>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>>  getname_flags+0xcb/0x580 fs/namei.c:138
>>  getname+0x19/0x20 fs/namei.c:209
>>  do_sys_open+0x2e7/0x6d0 fs/open.c:1053
>>  SYSC_open fs/open.c:1077 [inline]
>>  SyS_open+0x2d/0x40 fs/open.c:1072
>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>>
>> Freed by task 3221:
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>  set_track mm/kasan/kasan.c:459 [inline]
>>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>>  __cache_free mm/slab.c:3488 [inline]
>>  kmem_cache_free+0x83/0x2a0 mm/slab.c:3746
>>  putname+0xee/0x130 fs/namei.c:258
>>  do_sys_open+0x31b/0x6d0 fs/open.c:1068
>>  SYSC_open fs/open.c:1077 [inline]
>>  SyS_open+0x2d/0x40 fs/open.c:1072
>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>>
>> The buggy address belongs to the object at 8801c50ba000
>>  which belongs to the cache names_cache of size 4096
>> The buggy address is located 139 bytes to the right of
>>  4096-byte region [8801c50ba000, 8801c50bb000)
>> The buggy address belongs to the page:
>> page:ea0007142e80 count:1 mapcount:0 mapping:8801c50ba000
>> index:0x0 compound_mapcount: 0
>> flags: 0x2fffc008100(slab|head)
>> raw: 02fffc008100 8801c50ba000  00010001
>> raw: ea0007145320 ea00071433a0 8801dae2c600 
>> page dumped because: kasan: bad access detected
>>
>> Memory state around the buggy address:
>>  8801c50baf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>  8801c50bb000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>> 8801c50bb080: fc fc fc fc fc fc fc fc fc fc fc

Re: [PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-22 Thread William Tu

On Sat, Jan 20, 2018 at 9:52 PM, Pravin Shelar <pshe...@ovn.org> wrote:
> On Thu, Jan 18, 2018 at 2:04 PM, William Tu <u9012...@gmail.com> wrote:
>> The patch adds support for openvswitch to configure erspan
>> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
>> to uapi as a nested attribute to support all ERSPAN v1 and v2's
>> fields.  Note taht Previous commit "openvswitch: Add erspan tunnel
>> support." was reverted since it does not design properly.
>>
>> Signed-off-by: William Tu <u9012...@gmail.com>
>> ---
>>  include/uapi/linux/openvswitch.h |  11 +++
>>  net/openvswitch/flow_netlink.c   | 154 
>> ++-
>>  2 files changed, 164 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/openvswitch.h 
>> b/include/uapi/linux/openvswitch.h
>> index dcfab5e3b55c..f1f98fd703fe 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -273,6 +273,16 @@ enum {
>>
>>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>>
>> +enum {
>> +   OVS_ERSPAN_OPT_UNSPEC,
>> +   OVS_ERSPAN_OPT_IDX, /* u32 index */
>> +   OVS_ERSPAN_OPT_VER, /* u8 version number */
>> +   OVS_ERSPAN_OPT_DIR, /* u8 direction */
>> +   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
>> +   __OVS_ERSPAN_OPT_MAX,
>> +};
>> +
>> +#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
>>
>>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>>   */
>> @@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr {
>> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
>> address. */
>> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
>> address. */
>> OVS_TUNNEL_KEY_ATTR_PAD,
>> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
>> __OVS_TUNNEL_KEY_ATTR_MAX
>>  };
>>
> Rather than passing individual members of erspan_metadata, can you
> just pass whole structure between kernel and userspace. So that ovs
> kernel module can just handle all erspan options as one binary blob
> and does not need to interpret it.

That's a good idea. I will work on it for v5.
Thanks!
William

[PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-18 Thread William Tu

The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a nested attribute to support all ERSPAN v1 and v2's
fields.  Note taht Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/openvswitch.h |  11 +++
 net/openvswitch/flow_netlink.c   | 154 ++-
 2 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dcfab5e3b55c..f1f98fd703fe 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,6 +273,16 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
+enum {
+   OVS_ERSPAN_OPT_UNSPEC,
+   OVS_ERSPAN_OPT_IDX, /* u32 index */
+   OVS_ERSPAN_OPT_VER, /* u8 version number */
+   OVS_ERSPAN_OPT_DIR, /* u8 direction */
+   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
+   __OVS_ERSPAN_OPT_MAX,
+};
+
+#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
 
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
@@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index f143908b651d..c57b96b595b5 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "flow_netlink.h"
 
@@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
-   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
+* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
@@ -384,6 +386,13 @@ static const struct ovs_len_tbl 
ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
[OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
 };
 
+static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = {
+   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
+   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -400,6 +409,8 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
+   .next = ovs_erspan_opt_lens },
 };
 
 static const struct ovs_len_tbl
@@ -631,6 +642,94 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
+static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+   unsigned long opt_key_offset;
+   struct erspan_metadata opts;
+   struct nlattr *a;
+   u16 hwid, dir;
+   int rem;
+
+   BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+   memset(, 0, sizeof(opts));
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
+
+   if (type > OVS_ERSPAN_OPT_MAX) {
+   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
+ type, OVS_ERSPAN_OPT_MAX);
+   return -EINVAL;
+   }
+
+   if (!check_attr_len(nla_len(a),
+   ovs_erspan_opt_lens[type].len)) {
+   OVS_NLERR(log, "ERSPAN option %d has unexpected len %d 
expected %d",
+ type, nla_len(a),
+

[PATCHv4 net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-18 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool trunca

[PATCHv4 net-next 0/2] net: erspan: add support for openvswitch

2018-01-18 Thread William Tu

The first patch refactors the erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch introduces the new OVS tunnel key attribute
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version I and II support

 include/net/erspan.h | 127 +++-
 include/uapi/linux/openvswitch.h |  11 +++
 net/ipv4/ip_gre.c|  38 --
 net/ipv6/ip6_gre.c   |  36 -
 net/openvswitch/flow_netlink.c   | 154 ++-
 5 files changed, 285 insertions(+), 81 deletions(-)

---
v3->v4
  change from be32 to u32 for OVS_ERSPAN_OPT_IDX, suggested by Jiri Benc.

v2->v3
  revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e.
  redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute

v1->v2
  Fix compatibility issue suggested by Pravin.
-- 
2.7.4

Re: [PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-18 Thread William Tu

Hi Jiri,

On Thu, Jan 18, 2018 at 2:13 AM, Jiri Benc <jb...@redhat.com> wrote:
> Looks much better.
>
> On Wed, 17 Jan 2018 09:32:51 -0800, William Tu wrote:
>> + OVS_ERSPAN_OPT_IDX, /* be32 index */
>
> Why don't you convert this to u32 while passing to/from user space?
>
>> + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
>
> sizeof(__be32) but see above.
>
> Thanks,
>
>  Jiri

Thanks. Your suggestion makes sense.
In the beginning I just want to avoid another ntoh, hton conversion.
Since the ERSPAN iproute2 also assume u32 to/from userspace, I will
change it here to use u32.

William

[PATCHv3 net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-17 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool trunca

[PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-17 Thread William Tu

The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a nested attribute to support all ERSPAN v1 and v2's
fields.  Note taht Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/uapi/linux/openvswitch.h |  11 +++
 net/openvswitch/flow_netlink.c   | 154 ++-
 2 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dcfab5e3b55c..3b1950c59a0c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,6 +273,16 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
+enum {
+   OVS_ERSPAN_OPT_UNSPEC,
+   OVS_ERSPAN_OPT_IDX, /* be32 index */
+   OVS_ERSPAN_OPT_VER, /* u8 version number */
+   OVS_ERSPAN_OPT_DIR, /* u8 direction */
+   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
+   __OVS_ERSPAN_OPT_MAX,
+};
+
+#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
 
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
@@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index f143908b651d..a5234ffc9e49 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "flow_netlink.h"
 
@@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
-   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
+* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
@@ -384,6 +386,13 @@ static const struct ovs_len_tbl 
ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
[OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
 };
 
+static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = {
+   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
+   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -400,6 +409,8 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
+   .next = ovs_erspan_opt_lens },
 };
 
 static const struct ovs_len_tbl
@@ -631,6 +642,94 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
+static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+   unsigned long opt_key_offset;
+   struct erspan_metadata opts;
+   struct nlattr *a;
+   u16 hwid, dir;
+   int rem;
+
+   BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+   memset(, 0, sizeof(opts));
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
+
+   if (type > OVS_ERSPAN_OPT_MAX) {
+   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
+ type, OVS_ERSPAN_OPT_MAX);
+   return -EINVAL;
+   }
+
+   if (!check_attr_len(nla_len(a),
+   ovs_erspan_opt_lens[type].len)) {
+   OVS_NLERR(log, "ERSPAN option %d has unexpected len %d 
expected %d",
+ type, nla_len(a),
+

[PATCHv3 net-next 0/2] net: erspan: add support for openvswitch

2018-01-17 Thread William Tu

The first patch refactors the erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch introduces the new OVS tunnel key attribute
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version I and II support

 include/net/erspan.h | 127 +++-
 include/uapi/linux/openvswitch.h |  11 +++
 net/ipv4/ip_gre.c|  38 --
 net/ipv6/ip6_gre.c   |  36 -
 net/openvswitch/flow_netlink.c   | 154 ++-
 5 files changed, 285 insertions(+), 81 deletions(-)

---
v2->v3
  revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e.
  redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute

v1->v2
  Fix compatibility issue suggested by Pravin.
- 
2.7.4

[PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread William Tu

This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.

The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
as a nested attribute to support all ERSPAN v1 and v2's fields.
The current attr is a be32 supporting only one field.  Thus, this
patch reverts it and later patch will redo it using nested attr.

Signed-off-by: William Tu <u9012...@gmail.com>
Cc: Jiri Benc <jb...@redhat.com>
Cc: Pravin Shelar <pshe...@ovn.org>
---
 include/uapi/linux/openvswitch.h |  1 -
 net/openvswitch/flow_netlink.c   | 51 +---
 2 files changed, 1 insertion(+), 51 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 4265d7f9e1f2..dcfab5e3b55c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -363,7 +363,6 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
-   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 624ea74353dd..f143908b651d 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,7 +49,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "flow_netlink.h"
 
@@ -334,8 +333,7 @@ size_t ovs_tun_key_attr_size(void)
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
-   + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_DST */
-   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
+   + nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
 static size_t ovs_nsh_key_attr_size(void)
@@ -402,7 +400,6 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
-   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
 };
 
 static const struct ovs_len_tbl
@@ -634,33 +631,6 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
-static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
- struct sw_flow_match *match, bool is_mask,
- bool log)
-{
-   unsigned long opt_key_offset;
-   struct erspan_metadata opts;
-
-   BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
-
-   memset(, 0, sizeof(opts));
-   opts.index = nla_get_be32(attr);
-
-   /* Index has only 20-bit */
-   if (ntohl(opts.index) & ~INDEX_MASK) {
-   OVS_NLERR(log, "ERSPAN index number %x too large.",
- ntohl(opts.index));
-   return -EINVAL;
-   }
-
-   SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), is_mask);
-   opt_key_offset = TUN_METADATA_OFFSET(sizeof(opts));
-   SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, , sizeof(opts),
- is_mask);
-
-   return 0;
-}
-
 static int ip_tun_from_nlattr(const struct nlattr *attr,
  struct sw_flow_match *match, bool is_mask,
  bool log)
@@ -768,19 +738,6 @@ static int ip_tun_from_nlattr(const struct nlattr *attr,
break;
case OVS_TUNNEL_KEY_ATTR_PAD:
break;
-   case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
-   if (opts_type) {
-   OVS_NLERR(log, "Multiple metadata blocks 
provided");
-   return -EINVAL;
-   }
-
-   err = erspan_tun_opt_from_nlattr(a, match, is_mask, 
log);
-   if (err)
-   return err;
-
-   tun_flags |= TUNNEL_ERSPAN_OPT;
-   opts_type = type;
-   break;
default:
OVS_NLERR(log, "Unknown IP tunnel attribute %d",
  type);
@@ -905,10 +862,6 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
 vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
return -EMSGSIZE;
-   else if (output->tun_flags & TUNNEL_ERSPAN_OPT &&
-nla_put_be32(skb, OVS_TUNNEL_KEY

Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-12 Thread William Tu

On Fri, Jan 12, 2018 at 10:39 AM, Pravin Shelar <pshe...@ovn.org> wrote:
> On Fri, Jan 12, 2018 at 12:27 AM, Jiri Benc <jb...@redhat.com> wrote:
>> On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote:
>>> I'd also prefer reverting ceaa001a170e since it's more clean but I
>>> also hope to have this feature in 4.15.
>>> How long does reverting take? Am I only able to submit the new patch
>>> after the reverting is merged? Or I can submit revert and this new
>>> patch in one series? I have little experience in reverting, can you
>>> suggest which way is better?
>>
>> Send the revert for net (subject will be "[PATCH net] revert:
>> openvswitch: Add erspan tunnel support."). Don't forget to explain why
>> you're proposing a revert.
>>
>> After it is accepted and applied to net.git, wait until the patch
>> appears in net-next.git. It may take a little while. After that, send
>> the new patch(es) for net-next.
>>
>
> I agree, Once we have the V2 interface, this current ERSAN interface
> unlikely to be used by any one, so it would be nice to get rid of the
> old interface while we can.

Thanks Jiri and Pravin.
I will send out revert patch request.
William

Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-11 Thread William Tu

Hi Jiri,
Thanks a lot for the comments.

On Wed, Jan 10, 2018 at 2:02 PM, Jiri Benc  wrote:
> On Wed, 10 Jan 2018 22:35:14 +0100, Jiri Benc wrote:
>> The existing field must continue to work in the same way as before. It must
>> be accepted and *returned* by the kernel. You may add an additional field
>> but the existing behavior must be 100% preserved, both uABI and uAPI wise.
>
> Another way around this is reverting ceaa001a170e in net.git and
> designing the uAPI properly in net-next. I think that should be the
> preferred way, as ceaa001a170e is clearly wrong since you need to redo
> it after 3 months.

The ceaa001a170e is designed for configuring the ERSPAN v1's fields only,
not thinking about the future needs for more fields in ERSPAN v2.
This patch tries to use the nested attr to handle both v1 and v2.

>
> Not sure when Linus intends to release 4.15 and how much time you have
> for this, though.
>
>  Jiri

I'd also prefer reverting ceaa001a170e since it's more clean but I
also hope to have this feature in 4.15.
How long does reverting take? Am I only able to submit the new patch
after the reverting is merged? Or I can submit revert and this new
patch in one series? I have little experience in reverting, can you
suggest which way is better?

Thanks
William

[PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-09 Thread William Tu

The patch adds support for configuring the erspan V2 fields for
openvswitch.  For compatibility reason, the previously added
attribute 'OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS' is renamed to
'OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1' and deprecated, and the newly added
attribute 'OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS' will handle both V1 and V2.

Signed-off-by: William Tu <u9012...@gmail.com>
Cc: Pravin B Shelar <pshe...@ovn.org>
---
 include/uapi/linux/openvswitch.h |  13 +++-
 net/openvswitch/flow_netlink.c   | 129 ---
 2 files changed, 132 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 4265d7f9e1f2..77c3424cc4ef 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,6 +273,16 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
+enum {
+   OVS_ERSPAN_OPT_UNSPEC,
+   OVS_ERSPAN_OPT_IDX, /* be32 index */
+   OVS_ERSPAN_OPT_VER, /* u8 version number */
+   OVS_ERSPAN_OPT_DIR, /* u8 direction */
+   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
+   __OVS_ERSPAN_OPT_MAX,
+};
+
+#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
 
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
@@ -363,7 +373,8 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
-   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1,  /* be32 ERSPAN v1 index 
(deprecated). */
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index bce1f78b0de5..9c6b210e7893 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -335,7 +335,10 @@ size_t ovs_tun_key_attr_size(void)
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_DST */
-   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
+   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1 */
+   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
+* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
+*/
 }
 
 static size_t ovs_nsh_key_attr_size(void)
@@ -386,6 +389,13 @@ static const struct ovs_len_tbl 
ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
[OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
 };
 
+static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = {
+   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
+   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -402,7 +412,9 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
-   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1] = { .len = sizeof(u32) },
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
+   .next = ovs_erspan_opt_lens },
 };
 
 static const struct ovs_len_tbl
@@ -640,16 +652,78 @@ static int erspan_tun_opt_from_nlattr(const struct nlattr 
*attr,
 {
unsigned long opt_key_offset;
struct erspan_metadata opts;
+   struct nlattr *a;
+   u16 hwid, dir;
+   int rem;
 
BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
 
memset(, 0, sizeof(opts));
-   opts.u.index = nla_get_be32(attr);
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
+
+   if (type > OVS_ERSPAN_OPT_MAX) {
+   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
+ type, OVS_ERSPAN_OPT_MAX);
+   return -EINVAL;
+   }
+
+   if (!check_attr_len(nla_len(a),
+   ovs_erspan_opt_lens[type].len)) {
+   OVS_NLERR(log, "ERSPAN option %d has unexpected len %d 
expected %d",
+

[PATCHv2 net-next 0/2] net: erspan: add support for openvswitch

2018-01-09 Thread William Tu

The first patch refactors the originally erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch introduces the new OVS tunnel key attribute
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version II support

 include/net/erspan.h | 127 --
 include/uapi/linux/openvswitch.h |  13 +++-
 net/ipv4/ip_gre.c|  38 +---
 net/ipv6/ip6_gre.c   |  36 ---
 net/openvswitch/flow_netlink.c   | 129 ---
 5 files changed, 253 insertions(+), 90 deletions(-)

--
v1->v2
  Fix compatibility issue suggested by Pravin.
--
2.7.4

[PATCHv2 net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-09 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012...@gmail.com>
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool trunca

1 2 3 >

1 - 100 of 257 matches

Mail list logo