date:20160511

From: "Samudrala, Sridhar" 
Date: Wed, 11 May 2016 16:44:55 -0700

> 
> On 5/11/2016 4:23 PM, David Miller wrote:
>> This is a core semantic issue, and we have to make sure all amongst us
>> that we are all comfortable with exporting the offloadability controls
>> in the way you are implementing them.
> 
> I tried to implement the semantics based on an earlier discussion
> about these flags in this
> email thread.
> http://thread.gmane.org/gmane.linux.network/401733

Most developers reading your patches will not know to go to that URL,
nor that we even had that discussion at all.

Re: [PATCH net 0/2] bnxt_en: Add workaround to detect bad opaque in rx completion.

From: Michael Chan 
Date: Tue, 10 May 2016 19:17:58 -0400

> 2-part workaround for this hardware bug.

Series applied.

Re: [patch -mainline] qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()

From: Dan Carpenter 
Date: Tue, 10 May 2016 22:20:04 +0300

> If qlcnic_fw_cmd_get_minidump_temp() fails then "fw_dump->tmpl_hdr" is
> NULL or possibly freed.  It can lead to an oops later.
> 
> Fixes: d01a6d3c8ae1 ('qlcnic: Add support to enable capability to extend 
> minidump for iSCSI')
> Signed-off-by: Dan Carpenter 

Applied.

Re: [GIT] [4.6] NFC update

From: Samuel Ortiz 
Date: Wed, 11 May 2016 11:11:28 +0200

> This is the first NFC pull request for 4.7. With this one we
> mainly have:

Pulled, thanks.

Re: [net-next PATCH] udp: Resolve NULL pointer dereference

On Wed, 2016-05-11 at 19:24 -0700, Alexander Duyck wrote:
> While testing an OpenStack configuration using VXLANs I saw the following
> call trace:

> 
> I believe the trace is pointing to the call to dev_net(dev) in
> udp4_lib_lookup_skb.  If I am not mistaken I believe it is possible for us
> to have skb_dst(skb)->dev be NULL.  So to resolve that I am adding a check
> for this case and skipping the assignment if such an event occurs.

skb_dst(skb)->dev can be NULL ???

Why only UDP ipv4 would need a fix, and not ipv6 ?

Looks the bug is somewhere else maybe ?

[net-next PATCH] udp: Resolve NULL pointer dereference

2016-05-11 Thread Alexander Duyck

While testing an OpenStack configuration using VXLANs I saw the following
call trace:

 RIP: 0010:[] udp4_lib_lookup_skb+0x49/0x80
 RSP: 0018:88103867bc50  EFLAGS: 00010286
 RAX: 88103269bf00 RBX: 88103269bf00 RCX: 
 RDX: 4300 RSI:  RDI: 880f2932e780
 RBP: 88103867bc60 R08:  R09: 9001a8c0
 R10: 4400 R11: 81333a58 R12: 880f2932e794
 R13: 0014 R14: 0014 R15: e8efbfd89ca0
 FS:  () GS:88103fd8() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0488 CR3: 01c06000 CR4: 001426e0
 Stack:
  81576515 815733c0 88103867bc98 815fcc17
  88103269bf00 e8efbfd89ca0 0014 0080
  e8efbfd89ca0 88103867bcc8 815fcf8b 880f2932e794
 Call Trace:
  [] ? skb_checksum+0x35/0x50
  [] ? skb_push+0x40/0x40
  [] udp_gro_receive+0x57/0x130
  [] udp4_gro_receive+0x10b/0x2c0
  [] inet_gro_receive+0x1d3/0x270
  [] dev_gro_receive+0x269/0x3b0
  [] napi_gro_receive+0x38/0x120
  [] gro_cell_poll+0x57/0x80 [vxlan]
  [] net_rx_action+0x160/0x380
  [] __do_softirq+0xd7/0x2c5
  [] run_ksoftirqd+0x29/0x50
  [] smpboot_thread_fn+0x10f/0x160
  [] ? sort_range+0x30/0x30
  [] kthread+0xd8/0xf0
  [] ret_from_fork+0x22/0x40
  [] ? kthread_park+0x60/0x60

I believe the trace is pointing to the call to dev_net(dev) in
udp4_lib_lookup_skb.  If I am not mistaken I believe it is possible for us
to have skb_dst(skb)->dev be NULL.  So to resolve that I am adding a check
for this case and skipping the assignment if such an event occurs.

Signed-off-by: Alexander Duyck 
---
 net/ipv4/udp.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f67f52ba4809..ff8d9ff3048b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -613,8 +613,12 @@ struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport)
 {
const struct iphdr *iph = ip_hdr(skb);
-   const struct net_device *dev =
-   skb_dst(skb) ? skb_dst(skb)->dev : skb->dev;
+   const struct net_device *dev;
+
+   if (skb_dst(skb) && skb_dst(skb)->dev)
+   dev = skb_dst(skb)->dev;
+   else
+   dev = skb->dev;
 
return __udp4_lib_lookup(dev_net(dev), iph->saddr, sport,
 iph->daddr, dport, inet_iif(skb),

Re: [patch V4 23/31] ethernet: use parity8 in sun/niu.c

From: zengzhao...@163.com
Date: Wed, 11 May 2016 17:22:17 +0800

> From: Zhaoxiu Zeng 
> 
> Signed-off-by: Zhaoxiu Zeng 
> Acked-by: Michal Nazarewicz 

Acked-by: David S. Miller

Re: [patch V4 29/31] ethernet: use parity8 in broadcom/tg3.c

From: zengzhao...@163.com
Date: Wed, 11 May 2016 17:24:33 +0800

> From: Zhaoxiu Zeng 
> 
> Signed-off-by: Zhaoxiu Zeng 
> Acked-by: Siva Reddy Kallam 

Acked-by: David S. Miller

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

From: Ard Biesheuvel 
Date: Wed, 11 May 2016 09:47:49 +0200

> The current logic around the 'use_dac' module parameter prevents the
> r81969 driver from being loadable on 64-bit systems without any RAM
> below 4 GB when the parameter is left at its default value.
> So introduce a new default value -1 which indicates that 64-bit DMA
> should be enabled implicitly, but only if setting a 32-bit DMA mask
> has failed earlier. This should prevent any regressions like the ones
> caused by previous attempts to change this code.
> 
> Cc: Realtek linux nic maintainers  
> Signed-off-by: Ard Biesheuvel 

I think we should just seriously consider changing the default, it's
a really outdated reasoning behind the current default setting.  Maybe
relevant a decade ago, but probably not now.

And if the card is completely disfunctional in said configuration, the
default is definitely wrong.

linux-next: manual merge of the net-next tree with the net tree

2016-05-11 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/ipv4/ip_gre.c

between commit:

  e271c7b4420d ("gre: do not keep the GRE header around in collect medata mode")

from the net tree and commit:

  244a797bdcf1 ("gre: move iptunnel_pull_header down to ipgre_rcv")

from the net-next tree.

I fixed it up (hopefully - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/ipv4/ip_gre.c
index 4cc84212cce1,2b267e71ebf5..4d1030739efa
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@@ -398,10 -272,11 +272,14 @@@ static int __ipgre_rcv(struct sk_buff *
  iph->saddr, iph->daddr, tpi->key);
  
if (tunnel) {
+   if (__iptunnel_pull_header(skb, hdr_len, tpi->proto,
+  raw_proto, false) < 0)
+   goto drop;
+ 
 -  skb_pop_mac_header(skb);
 +  if (tunnel->dev->type != ARPHRD_NONE)
 +  skb_pop_mac_header(skb);
 +  else
 +  skb_reset_mac_header(skb);
if (tunnel->collect_md) {
__be16 flags;
__be64 tun_id;

Re: [PATCH -next 3/4] net: w5100: increase TX timeout period

From: Akinobu Mita 
Date: Wed, 11 May 2016 15:30:26 +0900

> This increases TX timeout period from one second to 5 seconds which is
> default value defined in net/sched/sch_generic.c.
> 
> The one second timeout is too short for W5100 with SPI interface mode
> which doesn't support burst READ/WRITE processing in the SPI transfer.
> If the packet is transmitted while RX packets are being received at a
> very high rate, the TX transmittion work in the workqueue is delayed
> and the watchdog timer is expired.
> 
> Signed-off-by: Akinobu Mita 

It would be just cleaner to just remove the assignment completely, and
let said net/sched/sch_generic.c code set the default for you.

Re: [PATCH net-next 2/2] net: cls_u32: Add support for skip-sw flag to tc u32 classifier.

2016-05-11 Thread Samudrala, Sridhar

On 5/11/2016 4:23 PM, David Miller wrote:

From: Sridhar Samudrala 
Date: Mon,  9 May 2016 12:18:44 -0700

On devices that support TC U32 offloads, this flag enables a filter to be
added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
default without any flags, the filter is added to both HW and SW, but no
error checks are done in case of failure to add to HW. With skip-sw,
failure to add to HW is treated as an error.

I really want you to provide a "[PATCH net-next 0/2]" header posting
explaining what this series is doing, and why.

Sure. Will submit a v2 with a header patch in a day or so after waiting 
for any other comments.

This is a core semantic issue, and we have to make sure all amongst us
that we are all comfortable with exporting the offloadability controls
in the way you are implementing them.

I tried to implement the semantics based on an earlier discussion about 
these flags in this

email thread.
http://thread.gmane.org/gmane.linux.network/401733

Also:

@@ -871,10 +889,15 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
return err;
}

+		err = u32_replace_hw_knode(tp, new, flags);

+   if (err) {
+   u32_destroy_key(tp, new, false);
+   return err;
+   }
+
u32_replace_knode(tp, tp_c, new);
tcf_unbind_filter(tp, >res);
call_rcu(>rcu, u32_delete_key_rcu);
-   u32_replace_hw_knode(tp, new, flags);
return 0;
}

Are you sure this reordering is OK?

I think so. This reordering is required to support skip-sw semantic of 
returning error in case of failure to add to hardware.
It doesn't break the default semantics of adding to both hw and sw as 
u32_replace_hw_knode() will not return err if skip-sw is not set.

Thanks
Sridhar

Re: [PATCH net-next 0/3] Mellanox 100G mlx5 CQE compression

From: Saeed Mahameed 
Date: Wed, 11 May 2016 00:29:13 +0300

> Introducing ConnectX-4 CQE (Completion Queue Entry) compression
> feature for mlx5 etherent driver.

Seires applied, thanks.

Re: [PATCH v1 net-next 0/7] More enabler patches for DSA probing

From: Andrew Lunn 
Date: Tue, 10 May 2016 23:27:18 +0200

> The complete set of patches for the reworked DSA probing is too big to
> post as once. These subset contains some enablers which are easy to
> review.
> 
> Eventually, the Marvell driver will instantiate its own internal MDIO
> bus, rather than have the framework do it, thus allows devices on the
> bus to be listed in the device tree. Initialize the main mutex as soon
> as it is created, to avoid lifetime issues with the mdio bus.
> 
> A previous patch renamed all the DSA probe functions to make room for
> a true device probe. However the recent merging of all the Marvell
> switch drivers resulted in mv88e6xxx going back to the old probe
> name. Rename it again, so we can have a driver probe function.
> 
> Add minimum support for the Marvell switch driver to probe as an MDIO
> device, as well as an DSA driver. Later patches will then register
> this device with the new DSA core framework.
> 
> Move the GPIO reset code out of the DSA code. Different drivers may
> need different reset mechanisms, e.g. via a reset controller for
> memory mapped devices. Don't clutter up the core with this. Let each
> driver implement what it needs.
> 
> master_dev is no longer needed in the switch drivers, since they have
> access to a device pointer from the probe function. Remove it.
> 
> Let the switch parse the eeprom length from its one device tree
> node. This is required with the new binding when the central DSA
> platform device no longer exists.

Series applied, thanks Andrew.

Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: abstract VTU/STU data access

From: Vivien Didelot 
Date: Tue, 10 May 2016 15:44:28 -0400

> Both VTU and STU operations use the same routine to access their
> (common) data registers, with a different offset.
> 
> Add VTU and STU specific read and write functions to the data registers
> to abstract the required offset.
> 
> Signed-off-by: Vivien Didelot 

Applied.

Re: [PATCH net-next v4 0/2] net: vrf: Fixup PKTINFO to return enslaved device index

From: David Ahern 
Date: Tue, 10 May 2016 11:19:49 -0700

> Applications such as OSPF and BFD need the original ingress device not
> the VRF device; the latter can be derived from the former. To that end
> move the packet intercept from an rx handler that is invoked by
> __netif_receive_skb_core to the ipv4 and ipv6 receive processing.
> 
> IPv6 already saves the skb_iif to the control buffer in ipv6_rcv. Since
> the skb->dev has not been switched the cb has the enslaved device. Make
> the same happen for IPv4 by adding the skb_iif to inet_skb_parm and set
> it in ipv4 code after clearing the skb control buffer similar to IPv6.
> From there the pktinfo can just pull it from cb with the PKTINFO_SKB_CB
> cast.

Series applied.

Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add STU capability

From: Vivien Didelot 
Date: Tue, 10 May 2016 15:44:29 -0400

> Some switch models have a STU (per VLAN port state database). Add a new
> capability flag to switches info, instead of checking their family.
> 
> Also if the 6165 family has an STU, it must have a VTU, so add the
> MV88E6XXX_FLAG_VTU to its family flags.
> 
> Signed-off-by: Vivien Didelot 

Applied.

Re: [PATCH net] drivers: net: Don't print unpopulated net_device name

From: Harvey Hunt 
Date: Tue, 10 May 2016 17:43:21 +0100

> @@ -1686,8 +1686,7 @@ dm9000_probe(struct platform_device *pdev)
>   }
>  
>   if (!is_valid_ether_addr(ndev->dev_addr)) {
> - dev_warn(db->dev, "%s: Invalid ethernet MAC address. Please "
> -  "set using ifconfig\n", ndev->name);
> + dev_warn(db->dev, "Invalid ethernet MAC address. Please set 
> using ifconfig\n");
>  
>   eth_hw_addr_random(ndev);
>   mac_src = "random";

If we don't print the netdev name, it's harder for the user to see which
adapter has the problem.

Therefore, it is better if you save some boolean state into a local variable
here, then print the warning right after register_netdev().

Likewise for the rest of your changes too.

Re: [PATCH net-next] ipv6: fix 4in6 tunnel receive path

From: Nicolas Dichtel 
Date: Tue, 10 May 2016 16:08:17 +0200

> Protocol for 4in6 tunnel is IPPROTO_IPIP. This was wrongly changed by
> the last cleanup.
> 
> CC: Tom Herbert 
> Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
> Signed-off-by: Nicolas Dichtel 

Applied, thanks.

Re: [PATCH net-next 2/2] net: cls_u32: Add support for skip-sw flag to tc u32 classifier.

From: Sridhar Samudrala 
Date: Mon,  9 May 2016 12:18:44 -0700

> On devices that support TC U32 offloads, this flag enables a filter to be
> added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
> default without any flags, the filter is added to both HW and SW, but no
> error checks are done in case of failure to add to HW. With skip-sw,
> failure to add to HW is treated as an error.

I really want you to provide a "[PATCH net-next 0/2]" header posting
explaining what this series is doing, and why.

This is a core semantic issue, and we have to make sure all amongst us
that we are all comfortable with exporting the offloadability controls
in the way you are implementing them.

Also:

> @@ -871,10 +889,15 @@ static int u32_change(struct net *net, struct sk_buff 
> *in_skb,
>   return err;
>   }
>  
> + err = u32_replace_hw_knode(tp, new, flags);
> + if (err) {
> + u32_destroy_key(tp, new, false);
> + return err;
> + }
> +
>   u32_replace_knode(tp, tp_c, new);
>   tcf_unbind_filter(tp, >res);
>   call_rcu(>rcu, u32_delete_key_rcu);
> - u32_replace_hw_knode(tp, new, flags);
>   return 0;
>   }
>  

Are you sure this reordering is OK?

[PATCH net-next 1/1] tipc: eliminate risk of double link_up events

2016-05-11 Thread Jon Maloy

When an ACTIVATE or data packet is received in a link in state
ESTABLISHING, the link does not immediately change state to
ESTABLISHED, but does instead return a LINK_UP event to the caller,
which will execute the state change in a different lock context.

This non-atomic approach incurs a low risk that we may have two
LINK_UP events pending simultaneously for the same link, resulting
in the final part of the setup procedure being executed twice. The
only potential harm caused by this it that we may see two LINK_UP
events issued to subsribers of the topology server, something that
may cause confusion.

This commit eliminates this risk by checking if the link is already
up before proceeding with the second half of the setup.

Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index d903f56..e01e2c71 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -542,7 +542,7 @@ static void __tipc_node_link_up(struct tipc_node *n, int 
bearer_id,
struct tipc_link *ol = node_active_link(n, 0);
struct tipc_link *nl = n->links[bearer_id].link;
 
-   if (!nl)
+   if (!nl || tipc_link_is_up(nl))
return;
 
tipc_link_fsm_evt(nl, LINK_ESTABLISH_EVT);
-- 
1.9.1

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-11 Thread Hannes Frederic Sowa

On 11.05.2016 16:45, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni  wrote:
> 
>> Uh, we have likely the same issue in the net_rx_action() function, which
>> also execute with bh disabled and check for jiffies changes even on
>> single core hosts ?!?
> 
> That is why we have a loop break after netdev_budget=300 packets.
> And a sysctl to eventually tune this.
> 
> Same issue for softirq handler, look at commit
> 34376a50fb1fa095b9d0636fa41ed2e73125f214
> 
> Your questions about this central piece of networking code are worrying.
> 
>>
>> Aren't jiffies updated by the timer interrupt ? and thous even with
>> bh_disabled ?!?
> 
> Exactly my point : jiffie wont be updated in your code, since you block BH.

To be fair, jiffies get updated in hardirq and not softirq context. The
cond_resched_softirq not looking for pending softirqs is indeed a problem.

Thanks,
Hannes

Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

2016-05-11 Thread Simon Horman

On Wed, May 11, 2016 at 04:09:28PM +0200, Jiri Benc wrote:
> On Wed, 11 May 2016 12:06:35 +0900, Simon Horman wrote:
> > Is this close to what you had in mind?
> 
> Yes but see below.
> 
> > @@ -739,17 +729,17 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
> > *tun_info,
> > key->phy.skb_mark = skb->mark;
> > ovs_ct_fill_key(skb, key);
> > key->ovs_flow_hash = 0;
> > -   key->phy.is_layer3 = is_layer3;
> > +   key->phy.is_layer3 = (tun_info && skb->mac_len == 0);
> 
> Do we have to depend on tun_info? It would be nice to support all
> ARPHRD_NONE interfaces, not just tunnels. The tun interface (from
> the tuntap driver) comes to mind, for example.

Yes, I think that should work. I was just being cautious.

Do you think it is safe to detect TEB based on skb->protocol regardless
of the presence of tun_info?

> > +++ b/net/openvswitch/vport-netdev.c
> > @@ -60,7 +60,21 @@ static void netdev_port_receive(struct sk_buff *skb)
> > if (vport->dev->type == ARPHRD_ETHER) {
> > skb_push(skb, ETH_HLEN);
> > skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
> > +   } else if (vport->dev->type == ARPHRD_NONE) {
> > +   if (skb->protocol == htons(ETH_P_TEB)) {
> > +   struct ethhdr *eth = eth_hdr(skb);
> > +
> > +   if (unlikely(skb->len < ETH_HLEN))
> > +   goto error;
> > +
> > +   skb->mac_len = ETH_HLEN;
> > +   if (eth->h_proto == htons(ETH_P_8021Q))
> > +   skb->mac_len += VLAN_HLEN;
> > +   } else {
> > +   skb->mac_len = 0;
> > +   }
> 
> Without putting much thought into this, could this perhaps be left for
> parse_ethertype (called from key_extract) to do?

I think I am confused.

I believe that key_extract() does already do all of the above (and more).

The purpose of the above change was to do this work here rather than
leaving it to parse_ethertype. This is because I was under the impression
that is what you were after. Specifically as a mechanism to avoid relying
on vport->dev->type in ovs_flow_key_extract.

If we can live with a bogus skb->mac_len value that is sufficient for
ovs_flow_key_extract.() and set correctly by key_extract() (which happens
anyway) we could do something like this:

} else if (vport->dev->type == ARPHRD_NONE) {
if (skb->protocol == htons(ETH_P_TEB))
/* Ignores presence of VLAN but is sufficient for
 * ovs_flow_key_extract() which then calls key_extract()
 * which calculates skb->mac_len correctly. */
skb->mac_len = ETH_HLEN; /* Ignores presence of VLAN */
else
skb->mac_len = 0;
}


But perhaps I have missed the point somehow.

Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions

2016-05-11 Thread Dean Jenkins


Hi John,

I have purchased a "uGreen" USB Ethernet Adaptor which was reported as 
showing the issue:


lsusb shows:
ID 0b95:772b ASIX Electronics Corp. AX88772B

dmesg shows:
[119591.413298] usb 2-1: new high-speed USB device number 12 using ci_hdrc
[119591.576970] usb 2-1: New USB device found, idVendor=0b95, idProduct=772b
[119591.576994] usb 2-1: New USB device strings: Mfr=1, Product=2, 
SerialNumber=3

[119591.577010] usb 2-1: Product: AX88772C
[119591.577025] usb 2-1: Manufacturer: ASIX Elec. Corp.

Strangely the product string says "AX88772C" and lsusb shows "AX88772B"

I used our ARM (32-bit 2 core) board running our highly customised 3.14 
kernel and ran a ping test that slowly increments the ping payload size 
so forcing the Ethernet frames to slowly extend in length and eventually 
forcing IPv4 fragmentation to occur due to the MTU limit of 1500. In my 
test the ICMP ping payload lengths ranged from 1 to 5000.


During the test run I saw (only 3 errors):
[27455.113010] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header 
synchronisation was lost, remaining 23
[27455.113037] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 
0x77767574, offset 4
[27456.113269] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header 
synchronisation was lost, remaining 27
[27456.113329] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 
0x77767574, offset 4
[27457.113271] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header 
synchronisation was lost, remaining 30
[27457.113328] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 
0x77767574, offset 4


This meets my expectation of "sync lost" followed immediately by "Bad 
Header Length". A close look at the timestamps shows gaps of around 20us 
to 50us which suggests the code is processing the same URB eg. "sync 
lost" and "Bad Header Length" are written from the same instance of 
asix_rx_fixup_internal().


My example suggests that the previous URB went missing so data was lost 
causing a discontinuity in the data stream. This was the intended 
purpose of the commit to prevent bad Ethernet frames being sent up the 
IP stack when an URB went missing. A bad Ethernet frame would otherwise 
be created by having the start of an Ethernet frame appended with data 
from the current URB causing a corrupted Ethernet frame to be generated 
and sent up the IP stack.


Also the failure seems to be independent of the ping payload length but 
longer test periods of specific payload lengths would be needed to allow 
the 32 bit header word to move around relative to the start of the URB 
buffer.


In my example, the "Bad Header Length 0x77767574" is reading the ping 
payload data of 0x74, 0x75, 0x76, 077 which is located at the start of 
the URB buffer. The remaining values are low at 23 to 30 which suggests 
the end of the Ethernet frame was in the missing URB. The ICMP ping data 
of 0x74, 0x75, 0x76, 077 is from the next Ethernet frame meaning the end 
of the current Ethernet frame is missing and the next frame has a 
missing start of Ethernet frame.


Note that due to IPv4 fragmentation "consecutive" Ethernet frames will 
contain payloads of 1500 (MTU size) octets typically followed by a short 
Ethernet frame. The payloads are fragmented IP packets.




So I've been trying to add some print messages here to better
understand whats going on.

Again, I'm a bit new to this code, so forgive me for my lack of
understanding things. Since the remaining value seems to be key, I
tried to look around and figure out where it was being set. It seems
like its only set in this function, is that right?  So this made me
guess something might be happening in a previous iteration that was
causing this to trigger.

I added some debug prints to every time we set the remaining value, or
modify it, as well as to print the value if we enter the fixup
function with a non-zero remaining value.

When we set the remaining value, usually its to 1514, when the skblen is 1518.

1514 is Ethernet header length + payload at MTU size of 1500.
skblen of 1518 is 32-bit header word + maximum Ethernet frame length 
(for your Network).


However, right before we catch the problem, I see this:
I am guessing where your debug is located in the code so I may have 
ms-interpreted your information.



[   84.844337] JDB set remaining to 1514 (skblen: 1518)
This suggests 1 maximum length Ethernet frame of 1514 octets in the URB 
buffer.

[   84.844379] JDB set remaining to 1514 (skblen: 1518)
[   84.844429] JDB set remaining to 1514 (skblen: 1518)
[   84.844458] JDB set remaining to 1514 (skblen: 1518)
[   84.844483] JDB set remaining to 1514 (skblen: 1518)
[   84.844507] JDB set remaining to 1514 (skblen: 1518)
[   84.844559] JDB set remaining to 1514 (skblen: 2048)
This URB probably has 2 Ethernet frames; 1 complete frame plus the start 
of the next Ethernet frame.


I think 2048 could be the maximum URB transfer length for the USB bulk 
transfer. 2048 seems to be a low value so should be investigated.



[   84.844583] JDB set

Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Wed, 2016-05-11 at 08:55 +0200, Peter Zijlstra wrote:
> On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 17caf4b63342..22463217e3cf 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
> >  static struct softirq_action softirq_vec[NR_SOFTIRQS] 
> > __cacheline_aligned_in_smp;
> >  
> >  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> > +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
> >  
> >  const char * const softirq_to_name[NR_SOFTIRQS] = {
> > "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> > @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
> > /* Interrupts are disabled: no need to stop preemption */
> > struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> >  
> > -   if (tsk && tsk->state != TASK_RUNNING)
> > +   if (tsk && tsk->state != TASK_RUNNING) {
> > +   __this_cpu_write(ksoftirqd_scheduled, true);
> > wake_up_process(tsk);
> 
> Since we're already looking at tsk->state, and the wake_up_process()
> ensures the thing becomes TASK_RUNNING, you could add:
> 
> static inline bool ksoftirqd_running(void)
> {
>   return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> }

Indeed, and the patch looks quite simple now ;)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 
17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09
 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
 
+static inline bool ksoftirqd_running(void)
+{
+   return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
+}
+
 const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
 
pending = local_softirq_pending();
 
-   if (pending)
+   if (pending && !ksoftirqd_running())
do_softirq_own_stack();
 
local_irq_restore(flags);
@@ -340,6 +345,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+   if (ksoftirqd_running())
+   return;
+
if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-11 Thread Francois Romieu

Ard Biesheuvel  :
> On 11 May 2016 at 22:31, Francois Romieu  wrote:
[...]
> It has little to do with f*cking legacy 32-bits-only-devices if DRAM
> simply starts at 0x80__. This is on an AMD arm64 chip.

The lack of IOMMU surprizes me.

[...]
> OK, if you prefer. Should I send a v2?

Don't bother unless someones comes with a more substantial request.

-- 
Ueimor

Re: [PATCH net-next] phy: micrel: Use MICREL_PHY_ID_MASK definition

2016-05-11 Thread Andrew Lunn

On Wed, May 11, 2016 at 05:02:05PM -0300, Fabio Estevam wrote:
> From: Fabio Estevam 
> 
> Replace the hardcoded mask 0x00f0 with MICREL_PHY_ID_MASK for
> better readability.
> 
> Suggested-by: Andrew Lunn 
> Signed-off-by: Fabio Estevam 

Reviewed-by: Andrew Lunn 

Thanks
Andrew

Re: [PATCH net-next] phy: micrel: Use MICREL_PHY_ID_MASK definition

2016-05-11 Thread Florian Fainelli

On 05/11/2016 01:02 PM, Fabio Estevam wrote:
> From: Fabio Estevam 
> 
> Replace the hardcoded mask 0x00f0 with MICREL_PHY_ID_MASK for
> better readability.
> 
> Suggested-by: Andrew Lunn 
> Signed-off-by: Fabio Estevam 

Acked-by: Florian Fainelli 
-- 
Florian

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-11 Thread Ard Biesheuvel

On 11 May 2016 at 22:31, Francois Romieu  wrote:
> Ard Biesheuvel  :
>> The current logic around the 'use_dac' module parameter prevents the
>> r81969 driver from being loadable on 64-bit systems without any RAM
>> below 4 GB when the parameter is left at its default value.
>> So introduce a new default value -1 which indicates that 64-bit DMA
>> should be enabled implicitly, but only if setting a 32-bit DMA mask
>> has failed earlier. This should prevent any regressions like the ones
>> caused by previous attempts to change this code.
>
> I am not a huge fan but if you really need it...
>
> Which current kernel arches do exhibit the interesting
> f*ck-legacy-32-bits-only-devices property you just described ?
>

It has little to do with f*cking legacy 32-bits-only-devices if DRAM
simply starts at 0x80__. This is on an AMD arm64 chip.

> [...]
>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>> b/drivers/net/ethernet/realtek/r8169.c
>> index 94f08f1e841c..a49e8a58e539 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
> [...]
>> @@ -859,7 +859,8 @@ struct rtl8169_private {
>>  MODULE_AUTHOR("Realtek and the Linux r8169 crew ");
>>  MODULE_DESCRIPTION("RealTek RTL-8169 Gigabit Ethernet driver");
>>  module_param(use_dac, int, 0);
>> -MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
>> +MODULE_PARM_DESC(use_dac,
>> + "Enable PCI DAC. Unsafe on 32 bit PCI slot (default -1: enable on 
>> 64-bit archs only if needed");
>
> Nit: the parameter is bizarre enough that you could leave the original
> description.
>

OK, if you prefer. Should I send a v2?

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-11 Thread Francois Romieu

Ard Biesheuvel  :
> The current logic around the 'use_dac' module parameter prevents the
> r81969 driver from being loadable on 64-bit systems without any RAM
> below 4 GB when the parameter is left at its default value.
> So introduce a new default value -1 which indicates that 64-bit DMA
> should be enabled implicitly, but only if setting a 32-bit DMA mask
> has failed earlier. This should prevent any regressions like the ones
> caused by previous attempts to change this code.

I am not a huge fan but if you really need it...

Which current kernel arches do exhibit the interesting
f*ck-legacy-32-bits-only-devices property you just described ?

[...]
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 94f08f1e841c..a49e8a58e539 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
[...]
> @@ -859,7 +859,8 @@ struct rtl8169_private {
>  MODULE_AUTHOR("Realtek and the Linux r8169 crew ");
>  MODULE_DESCRIPTION("RealTek RTL-8169 Gigabit Ethernet driver");
>  module_param(use_dac, int, 0);
> -MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
> +MODULE_PARM_DESC(use_dac,
> + "Enable PCI DAC. Unsafe on 32 bit PCI slot (default -1: enable on 
> 64-bit archs only if needed");

Nit: the parameter is bizarre enough that you could leave the original
description.

-- 
Ueimor

Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver

2016-05-11 Thread Timur Tabi


Timur Tabi wrote:

I think the problem is that the current driver seems to be too eager to
start/stop the MAC.

Please take a look at emac_work_thread_link_check() at
https://lkml.org/lkml/2016/4/13/670.  Every time the PHY link goes up,
it does this:


Never mind, I figured out the problem.  I still have a lot of work ahead 
of me, but at least I'm not stuck any more.


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

[PATCH] net: mvneta: bm: fix dependencies again

2016-05-11 Thread Arnd Bergmann

I tried to fix this before, but my previous fix was incomplete
and we can still get the same link error in randconfig builds
because of the way that Kconfig treats the

default y if MVNETA=y && MVNETA_BM_ENABLE

line that does not actually trigger when MVNETA_BM_ENABLE=m,
unlike I intended.
Changing the line to use MVNETA_BM_ENABLE!=n however has
the desired effect and hopefully makes all configurations
work as expected.

Signed-off-by: Arnd Bergmann 
Fixes: 019ded3aa7c9 ("net: mvneta: bm: clarify dependencies")
---
 drivers/net/ethernet/marvell/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig 
b/drivers/net/ethernet/marvell/Kconfig
index b5c6d42daa12..2664827ddecd 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -68,7 +68,7 @@ config MVNETA
 
 config MVNETA_BM
tristate
-   default y if MVNETA=y && MVNETA_BM_ENABLE
+   default y if MVNETA=y && MVNETA_BM_ENABLE!=n
default MVNETA_BM_ENABLE
select HWBM
help
-- 
2.7.0

[PATCH net-next] phy: micrel: Use MICREL_PHY_ID_MASK definition

2016-05-11 Thread Fabio Estevam

From: Fabio Estevam 

Replace the hardcoded mask 0x00f0 with MICREL_PHY_ID_MASK for
better readability.

Suggested-by: Andrew Lunn 
Signed-off-by: Fabio Estevam 
---
 drivers/net/phy/micrel.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 4516c8a..5a8fefc 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -726,7 +726,7 @@ static int kszphy_probe(struct phy_device *phydev)
 static struct phy_driver ksphy_driver[] = {
 {
.phy_id = PHY_ID_KS8737,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KS8737",
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
@@ -781,7 +781,7 @@ static struct phy_driver ksphy_driver[] = {
.resume = genphy_resume,
 }, {
.phy_id = PHY_ID_KSZ8041,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ8041",
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause
| SUPPORTED_Asym_Pause),
@@ -800,7 +800,7 @@ static struct phy_driver ksphy_driver[] = {
.resume = genphy_resume,
 }, {
.phy_id = PHY_ID_KSZ8041RNLI,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ8041RNLI",
.features   = PHY_BASIC_FEATURES |
  SUPPORTED_Pause | SUPPORTED_Asym_Pause,
@@ -819,7 +819,7 @@ static struct phy_driver ksphy_driver[] = {
.resume = genphy_resume,
 }, {
.phy_id = PHY_ID_KSZ8051,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ8051",
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause
| SUPPORTED_Asym_Pause),
@@ -857,7 +857,7 @@ static struct phy_driver ksphy_driver[] = {
 }, {
.phy_id = PHY_ID_KSZ8081,
.name   = "Micrel KSZ8081 or KSZ8091",
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.driver_data= _type,
@@ -875,7 +875,7 @@ static struct phy_driver ksphy_driver[] = {
 }, {
.phy_id = PHY_ID_KSZ8061,
.name   = "Micrel KSZ8061",
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.config_init= kszphy_config_init,
@@ -909,7 +909,7 @@ static struct phy_driver ksphy_driver[] = {
.write_mmd_indirect = ksz9021_wr_mmd_phyreg,
 }, {
.phy_id = PHY_ID_KSZ9031,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ9031 Gigabit PHY",
.features   = (PHY_GBIT_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
@@ -926,7 +926,7 @@ static struct phy_driver ksphy_driver[] = {
.resume = genphy_resume,
 }, {
.phy_id = PHY_ID_KSZ8873MLL,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ8873MLL Switch",
.features   = (SUPPORTED_Pause | SUPPORTED_Asym_Pause),
.flags  = PHY_HAS_MAGICANEG,
@@ -940,7 +940,7 @@ static struct phy_driver ksphy_driver[] = {
.resume = genphy_resume,
 }, {
.phy_id = PHY_ID_KSZ886X,
-   .phy_id_mask= 0x00f0,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
.name   = "Micrel KSZ886X Switch",
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
@@ -962,17 +962,17 @@ MODULE_LICENSE("GPL");
 
 static struct mdio_device_id __maybe_unused micrel_tbl[] = {
{ PHY_ID_KSZ9021, 0x000e },
-   { PHY_ID_KSZ9031, 0x00f0 },
+   { PHY_ID_KSZ9031, MICREL_PHY_ID_MASK },
{ PHY_ID_KSZ8001, 0x00ff },
-   { PHY_ID_KS8737, 0x00f0 },
+   { PHY_ID_KS8737, MICREL_PHY_ID_MASK },
{ PHY_ID_KSZ8021, 0x00ff },
{ PHY_ID_KSZ8031, 0x00ff },
-   { PHY_ID_KSZ8041, 0x00f0 },
-   { PHY_ID_KSZ8051, 0x00f0 },
-   { PHY_ID_KSZ8061, 0x00f0 },
-   { PHY_ID_KSZ8081, 0x00f0 },
-   { PHY_ID_KSZ8873MLL, 0x00f0 },
-   { PHY_ID_KSZ886X, 0x00f0 },
+   { PHY_ID_KSZ8041, MICREL_PHY_ID_MASK },
+   { PHY_ID_KSZ8051,

Fw: [GIT] Networking


Sorry, forgot to CC: the lists on the initial send.
--- Begin Message ---

Hopefully the last round of fixes this release, fingers crossed :)

1) Initialize static nf_conntrack_locks_all_lock properly, from
   Florian Westphal.

2) Need to cancel pending work when destroying IDLETIMER entries, from
   Liping Zhang.

3) Fix TX param usage when sending TSO over iwlwifi devices, from
   Emmanuel Grumbach.

4) NFACCT quota params not validated properly, from Phil Turnbull.

5) Resolve more glibc vs. kernel header conflicts, from Mikko Tapeli.

6) Missing IRQ free in ravb_close(), from Geert Uytterhoeven.

7) Fix infoleak in x25, from Kangjie Lu.

8) Similarly in thunderx driver, from Heinrich Schuchardt.

9) tc_ife.h uapi header not exported properly, from Jamal Hadi Salim.

10) Don't reenable PHY interreupts if device is in polling mode, from
Shaohui Xie.

11) Packet scheduler actions late binding was not being handled properly
at all, from Jamal Hadi Salim.

12) Fix binding of conntrack entries to helpers in openvswitch, from
Joe Stringer.

Please pull, thanks a lot!

The following changes since commit b507146bb6b9ac0c0197100ba3e299825a21fed3:

  Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 (2016-05-09 
12:24:19 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to e271c7b4420ddbb9fae82a2b31a5ab3edafcf4fe:

  gre: do not keep the GRE header around in collect medata mode (2016-05-11 
15:16:32 -0400)


David S. Miller (4):
  Merge git://git.kernel.org/.../pablo/nf
  Merge tag 'wireless-drivers-for-davem-2016-05-09' of 
git://git.kernel.org/.../kvalo/wireless-drivers
  Merge branch 'nps_enet-fixes'
  Merge branch 'net-sched-fixes'

Elad Kanfi (2):
  net: nps_enet: Tx handler synchronization
  net: nps_enet: bug fix - handle lost tx interrupts

Emmanuel Grumbach (1):
  iwlwifi: mvm: don't override the rate with the AMSDU len

Eric Dumazet (1):
  tcp: refresh skb timestamp at retransmit time

Florian Westphal (1):
  netfilter: conntrack: init all_locks to avoid debug warning

Geert Uytterhoeven (1):
  ravb: Add missing free_irq() call to ravb_close()

Jamal Hadi Salim (7):
  export tc ife uapi header
  net sched: vlan action fix late binding
  net sched: ipt action fix late binding
  net sched: mirred action fix late binding
  net sched: simple action fix late binding
  net sched: skbedit action fix late binding
  net sched: ife action fix late binding

Jiri Benc (1):
  gre: do not keep the GRE header around in collect medata mode

Joe Stringer (1):
  openvswitch: Fix cached ct with helper.

Kalle Valo (1):
  Merge tag 'iwlwifi-for-kalle-2016-05-04' of 
https://git.kernel.org/.../iwlwifi/iwlwifi-fixes

Kangjie Lu (1):
  net: fix a kernel infoleak in x25 module

Liping Zhang (1):
  netfilter: IDLETIMER: fix race condition when destroy the target

Mikko Rapeli (1):
  uapi glibc compat: fix compile errors when glibc net/if.h included before 
linux/if.h

Phil Turnbull (1):
  netfilter: nfnetlink_acct: validate NFACCT_QUOTA parameter

Shaohui Xie (1):
  net: phylib: fix interrupts re-enablement in phy_start

xypron.g...@gmx.de (1):
  net: thunderx: avoid exposing kernel stack

 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  4 
 drivers/net/ethernet/ezchip/nps_enet.c | 30 
--
 drivers/net/ethernet/ezchip/nps_enet.h |  2 --
 drivers/net/ethernet/renesas/ravb_main.c   |  2 ++
 drivers/net/phy/phy.c  |  8 +---
 drivers/net/wireless/intel/iwlwifi/mvm/tx.c| 83 
---
 include/uapi/linux/if.h| 28 

 include/uapi/linux/libc-compat.h   | 44 

 include/uapi/linux/tc_act/Kbuild   |  1 +
 net/ipv4/ip_gre.c  |  7 ++-
 net/ipv4/tcp_output.c  |  6 --
 net/netfilter/nf_conntrack_core.c  |  2 +-
 net/netfilter/nfnetlink_acct.c |  2 ++
 net/netfilter/xt_IDLETIMER.c   |  1 +
 net/openvswitch/conntrack.c| 13 +
 net/sched/act_ife.c| 14 ++
 net/sched/act_ipt.c| 19 ---
 net/sched/act_mirred.c | 19 +--
 net/sched/act_simple.c | 18 --
 net/sched/act_skbedit.c| 18 +++---
 net/sched/act_vlan.c   | 22

Re: [PATCH v4 net-next] tcp: replace cnt & rtt with struct in pkts_acked()

From: Lawrence Brakmo 
Date: Wed, 11 May 2016 10:02:13 -0700

> Replace 2 arguments (cnt and rtt) in the congestion control modules'
> pkts_acked() function with a struct. This will allow adding more
> information without having to modify existing congestion control
> modules (tcp_nv in particular needs bytes in flight when packet
> was sent).
> 
> As proposed by Neal Cardwell in his comments to the tcp_nv patch.
> 
> Signed-off-by: Lawrence Brakmo 
> Acked-by: Yuchung Cheng 

Looks a lot better, applied, thanks!

Re: [PATCH net] gre: do not keep the GRE header around in collect medata mode

From: Jiri Benc 
Date: Wed, 11 May 2016 15:53:57 +0200

> For ipgre interface in collect metadata mode, it doesn't make sense for the
> interface to be of ARPHRD_IPGRE type. The outer header of received packets
> is not needed, as all the information from it is present in metadata_dst. We
> already don't set ipgre_header_ops for collect metadata interfaces, which is
> the only consumer of mac_header pointing to the outer IP header.
> 
> Just set the interface type to ARPHRD_NONE in collect metadata mode for
> ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
> mac_header.
> 
> Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode")
> Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
> Signed-off-by: Jiri Benc 

Applied, thanks Jiri.

Re: [PATCHv2 net] openvswitch: Fix cached ct with helper.

From: Joe Stringer 
Date: Wed, 11 May 2016 10:29:26 -0700

> When using conntrack helpers from OVS, a common configuration is to
> perform a lookup without specifying a helper, then go through a
> firewalling policy, only to decide to attach a helper afterwards.
> 
> In this case, the initial lookup will cause a ct entry to be attached to
> the skb, then the later commit with helper should attach the helper and
> confirm the connection. However, the helper attachment has been missing.
> If the user has enabled automatic helper attachment, then this issue
> will be masked as it will be applied in init_conntrack(). It is also
> masked if the action is executed from ovs_packet_cmd_execute() as that
> will construct a fresh skb.
> 
> This patch fixes the issue by making an explicit call to try to assign
> the helper if there is a discrepancy between the action's helper and the
> current skb->nfct.
> 
> Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action")
> Signed-off-by: Joe Stringer 
> ---
> v2: Only apply to connections that we will commit.

Applied and queued up for -stable, thanks.

Re: [v2] rtlwifi: pci: use dev_kfree_skb_irq instead of kfree_skb inrtl_pci_reset_trx_ring

wang yanqing  wrote:
> We can't use kfree_skb in irq disable context, because spin_lock_irqsave
> make sure we are always in irq disable context, use dev_kfree_skb_irq
> instead of kfree_skb is better than dev_kfree_skb_any.
> 
> This patch fix below kernel warning:
> [ 7612.095528] [ cut here ]
> [ 7612.095546] WARNING: CPU: 3 PID: 4460 at kernel/softirq.c:150 
> __local_bh_enable_ip+0x58/0x80()
> [ 7612.095550] Modules linked in: rtl8723be x86_pkg_temp_thermal btcoexist 
> rtl_pci rtlwifi rtl8723_common
> [ 7612.095567] CPU: 3 PID: 4460 Comm: ifconfig Tainted: GW   
> 4.4.0+ #4
> [ 7612.095570] Hardware name: LENOVO 20DFA04FCD/20DFA04FCD, BIOS J5ET48WW 
> (1.19 ) 08/27/2015
> [ 7612.095574]    da37fc70 c12ce7c5  da37fca0 
> c104cc59 c19d4454
> [ 7612.095584]  0003 116c c19d4784 0096 c10508a8 c10508a8 
> 0200 c1b42400
> [ 7612.095594]  f29be780 da37fcb0 c104ccad 0009  da37fcbc 
> c10508a8 f21f08b8
> [ 7612.095604] Call Trace:
> [ 7612.095614]  [] dump_stack+0x41/0x5c
> [ 7612.095620]  [] warn_slowpath_common+0x89/0xc0
> [ 7612.095628]  [] ? __local_bh_enable_ip+0x58/0x80
> [ 7612.095634]  [] ? __local_bh_enable_ip+0x58/0x80
> [ 7612.095640]  [] warn_slowpath_null+0x1d/0x20
> [ 7612.095646]  [] __local_bh_enable_ip+0x58/0x80
> [ 7612.095653]  [] destroy_conntrack+0x64/0xa0
> [ 7612.095660]  [] nf_conntrack_destroy+0xf/0x20
> [ 7612.095665]  [] skb_release_head_state+0x55/0xa0
> [ 7612.095670]  [] skb_release_all+0xb/0x20
> [ 7612.095674]  [] __kfree_skb+0xb/0x60
> [ 7612.095679]  [] kfree_skb+0x30/0x70
> [ 7612.095686]  [] ? rtl_pci_reset_trx_ring+0x22d/0x370 [rtl_pci]
> [ 7612.095692]  [] rtl_pci_reset_trx_ring+0x22d/0x370 [rtl_pci]
> [ 7612.095698]  [] rtl_pci_start+0x19/0x190 [rtl_pci]
> [ 7612.095705]  [] rtl_op_start+0x56/0x90 [rtlwifi]
> [ 7612.095712]  [] drv_start+0x36/0xc0
> [ 7612.095717]  [] ieee80211_do_open+0x2d3/0x890
> [ 7612.095725]  [] ? call_netdevice_notifiers_info+0x2e/0x60
> [ 7612.095730]  [] ieee80211_open+0x4d/0x50
> [ 7612.095736]  [] __dev_open+0xa3/0x130
> [ 7612.095742]  [] ? _raw_spin_unlock_bh+0x13/0x20
> [ 7612.095748]  [] __dev_change_flags+0x89/0x140
> [ 7612.095753]  [] ? selinux_capable+0xd/0x10
> [ 7612.095759]  [] dev_change_flags+0x29/0x60
> [ 7612.095765]  [] devinet_ioctl+0x553/0x670
> [ 7612.095772]  [] ? _copy_to_user+0x28/0x40
> [ 7612.095777]  [] inet_ioctl+0x85/0xb0
> [ 7612.095783]  [] sock_ioctl+0x67/0x260
> [ 7612.095788]  [] ? sock_fasync+0x80/0x80
> [ 7612.095795]  [] do_vfs_ioctl+0x6b/0x550
> [ 7612.095800]  [] ? selinux_file_ioctl+0x102/0x1e0
> [ 7612.095807]  [] ? timekeeping_suspend+0x294/0x320
> [ 7612.095813]  [] ? __hrtimer_run_queues+0x14a/0x210
> [ 7612.095820]  [] ? security_file_ioctl+0x34/0x50
> [ 7612.095827]  [] SyS_ioctl+0x70/0x80
> [ 7612.095832]  [] do_fast_syscall_32+0x84/0x120
> [ 7612.095839]  [] sysenter_past_esp+0x36/0x55
> [ 7612.095844] ---[ end trace 97e9c637a20e8348 ]---
> 
> Signed-off-by: Wang YanQing 
> Cc: Stable 
> Acked-by: Larry Finger 

Thanks, 1 patch applied to wireless-drivers-next.git:

cf968937d277 rtlwifi: pci: use dev_kfree_skb_irq instead of kfree_skb in 
rtl_pci_reset_trx_ring

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/9034801/

Re: [v2] rtlwifi: Remove double check for cnt_after_linked

wang yanqing  wrote:
> rtl_lps_enter does two successive check for cnt_after_linked
> to make sure some time has elapsed after linked. The second
> check isn't necessary, because if cnt_after_linked is bigger
> than 5, it is bigger than 2 of course!
> 
> This patch remove the second check code.
> 
> Signed-off-by: Wang YanQing 

Thanks, 1 patch applied to wireless-drivers-next.git:

976aff5fc94b rtlwifi: Remove double check for cnt_after_linked

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/9025161/

Re: rtlwifi: rtl818x: silence uninitialized variable warning

Dan Carpenter  wrote:
> What about if "rtlphy->pwrgroup_cnt" is 2?  In that case we would use an
> uninitialized "chnlgroup" variable and probably crash.  Maybe that can't
> happen for some reason which is not obvious but in that case this patch
> is harmless.
> 
> Setting it to zero seems like a standard default in the surrounding code
> so it's probably fine here as well.
> 
> Signed-off-by: Dan Carpenter 

Thanks, 1 patch applied to wireless-drivers-next.git:

2f8514b8b036 rtlwifi: rtl818x: silence uninitialized variable warning

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/9010761/

Re: [v2] rtlwifi: Fix logic error in enter/exit power-save mode

wang yanqing  wrote:
> In commit a269913c52ad ("rtlwifi: Rework rtl_lps_leave() and
> rtl_lps_enter() to use work queue"), the tests for enter/exit
> power-save mode were inverted. With this change applied, the
> wifi connection becomes much more stable.
> 
> Fixes: a269913c52ad ("rtlwifi: Rework rtl_lps_leave() and rtl_lps_enter() to 
> use work queue")
> Signed-off-by: Wang YanQing 
> CC: Stable  [3.10+]
> Acked-by: Larry Finger 

Thanks, 1 patch applied to wireless-drivers-next.git:

873ffe154ae0 rtlwifi: Fix logic error in enter/exit power-save mode

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/8993841/

Re: rtlwifi: rtl818x: constify rtl_intf_ops structures

Julia Lawall  wrote:
> The rtl_intf_ops structures are never modified, so declare them as const.
> 
> Done with the help of Coccinelle.
> 
> Signed-off-by: Julia Lawall 

Thanks, 1 patch applied to wireless-drivers-next.git:

1bfcfdcca142 rtlwifi: rtl818x: constify rtl_intf_ops structures

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/8989291/

Re: [PATCH (net.git) 2/3] Revert "stmmac: Fix 'eth0: No PHY found' regression"

2016-05-11 Thread Marc Haber

On Wed, Apr 13, 2016 at 05:44:25PM +0200, Marc Haber wrote:
> On Fri, Apr 01, 2016 at 09:07:15AM +0200, Giuseppe Cavallaro wrote:
> > This reverts commit 88f8b1bb41c6208f81b6a480244533ded7b59493.
> > due to problems on GeekBox and Banana Pi M1 board when
> > connected to a real transceiver instead of a switch via
> > fixed-link.
> 
> This reversal is still needed in Linux 4.5.1 on Banana Pi.
> 
> Please consider including it in Linux 4.5.2.

This reversal is still needed in Linux 4.5.4 on Banana Pi.

Please consider including it in Linux 4.5.5.

Greetings
Marc



> 
> > 
> > Signed-off-by: Giuseppe Cavallaro 
> > Cc: Gabriel Fernandez 
> > Cc: Andreas Färber 
> > Cc: Frank Schäfer 
> > Cc: Dinh Nguyen 
> > Cc: David S. Miller 
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c  |   11 ++-
> >  .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |9 +
> >  include/linux/stmmac.h |1 -
> >  3 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c 
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> > index ea76129..af09ced 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> > @@ -199,12 +199,21 @@ int stmmac_mdio_register(struct net_device *ndev)
> > struct stmmac_priv *priv = netdev_priv(ndev);
> > struct stmmac_mdio_bus_data *mdio_bus_data = priv->plat->mdio_bus_data;
> > int addr, found;
> > -   struct device_node *mdio_node = priv->plat->mdio_node;
> > +   struct device_node *mdio_node = NULL;
> > +   struct device_node *child_node = NULL;
> >  
> > if (!mdio_bus_data)
> > return 0;
> >  
> > if (IS_ENABLED(CONFIG_OF)) {
> > +   for_each_child_of_node(priv->device->of_node, child_node) {
> > +   if (of_device_is_compatible(child_node,
> > +   "snps,dwmac-mdio")) {
> > +   mdio_node = child_node;
> > +   break;
> > +   }
> > +   }
> > +
> > if (mdio_node) {
> > netdev_dbg(ndev, "FOUND MDIO subnode\n");
> > } else {
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> > index dcbd2a1..9cf181f 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> > @@ -146,7 +146,6 @@ stmmac_probe_config_dt(struct platform_device *pdev, 
> > const char **mac)
> > struct device_node *np = pdev->dev.of_node;
> > struct plat_stmmacenet_data *plat;
> > struct stmmac_dma_cfg *dma_cfg;
> > -   struct device_node *child_node = NULL;
> >  
> > plat = devm_kzalloc(>dev, sizeof(*plat), GFP_KERNEL);
> > if (!plat)
> > @@ -177,19 +176,13 @@ stmmac_probe_config_dt(struct platform_device *pdev, 
> > const char **mac)
> > plat->phy_node = of_node_get(np);
> > }
> >  
> > -   for_each_child_of_node(np, child_node)
> > -   if (of_device_is_compatible(child_node, "snps,dwmac-mdio")) {
> > -   plat->mdio_node = child_node;
> > -   break;
> > -   }
> > -
> > /* "snps,phy-addr" is not a standard property. Mark it as deprecated
> >  * and warn of its use. Remove this when phy node support is added.
> >  */
> > if (of_property_read_u32(np, "snps,phy-addr", >phy_addr) == 0)
> > dev_warn(>dev, "snps,phy-addr property is deprecated\n");
> >  
> > -   if ((plat->phy_node && !of_phy_is_fixed_link(np)) || !plat->mdio_node)
> > +   if ((plat->phy_node && !of_phy_is_fixed_link(np)) || plat->phy_bus_name)
> > plat->mdio_bus_data = NULL;
> > else
> > plat->mdio_bus_data =
> > diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
> > index 4bcf5a6..6e53fa8 100644
> > --- a/include/linux/stmmac.h
> > +++ b/include/linux/stmmac.h
> > @@ -114,7 +114,6 @@ struct plat_stmmacenet_data {
> > int interface;
> > struct stmmac_mdio_bus_data *mdio_bus_data;
> > struct device_node *phy_node;
> > -   struct device_node *mdio_node;
> > struct stmmac_dma_cfg *dma_cfg;
> > int clk_csr;
> > int has_gmac;
> > -- 
> > 1.7.4.4
> > 
> 
> -- 
> -
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

-- 
-
Marc Haber | "I don't trust Computers. They |

Re: [PATCH] cfg80211/nl80211: add wifi tx power mode switching support

2016-05-11 Thread Dan Williams

On Wed, 2016-05-11 at 13:03 +0800, Wei-Ning Huang wrote:
> On Fri, May 6, 2016 at 4:19 PM, Wei-Ning Huang 
> wrote:
> > 
> > On Fri, May 6, 2016 at 12:07 AM, Dan Williams 
> > wrote:
> > > 
> > > 
> > > On Thu, 2016-05-05 at 14:44 +0800, Wei-Ning Huang wrote:
> > > > 
> > > > Recent new hardware has the ability to switch between tablet
> > > > mode and
> > > > clamshell mode. To optimize WiFi performance, we want to be
> > > > able to
> > > > use
> > > > different power table between modes. This patch adds a new
> > > > netlink
> > > > message type and cfg80211_ops function to allow userspace to
> > > > trigger
> > > > a
> > > > power mode switch for a given wireless interface.
> > > > 
> > > > Signed-off-by: Wei-Ning Huang 
> > > > ---
> > > >  include/net/cfg80211.h   | 11 +++
> > > >  include/uapi/linux/nl80211.h | 21 +
> > > >  net/wireless/nl80211.c   | 16 
> > > >  net/wireless/rdev-ops.h  | 22 ++
> > > >  net/wireless/trace.h | 20 
> > > >  5 files changed, 90 insertions(+)
> > > > 
> > > > diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
> > > > index 9e1b24c..aa77fa0 100644
> > > > --- a/include/net/cfg80211.h
> > > > +++ b/include/net/cfg80211.h
> > > > @@ -2370,6 +2370,12 @@ struct cfg80211_qos_map {
> > > >   * @get_tx_power: store the current TX power into the dbm
> > > > variable;
> > > >   *   return 0 if successful
> > > >   *
> > > > + * @set_tx_power_mode: set the transmit power mode. Some
> > > > device have
> > > > the ability
> > > > + *   to transform between different mode such as clamshell and
> > > > tablet mode.
> > > > + *   set_tx_power_mode allows setting of different TX power
> > > > mode at runtime.
> > > > + * @get_tx_power_mode: store the current TX power mode into
> > > > the mode
> > > > variable;
> > > > + *   return 0 if successful
> > > > + *
> > > >   * @set_wds_peer: set the WDS peer for a WDS interface
> > > >   *
> > > >   * @rfkill_poll: polls the hw rfkill line, use cfg80211
> > > > reporting
> > > > @@ -2631,6 +2637,11 @@ struct cfg80211_ops {
> > > >   int (*get_tx_power)(struct wiphy *wiphy, struct
> > > > wireless_dev *wdev,
> > > >   int *dbm);
> > > > 
> > > > + int (*set_tx_power_mode)(struct wiphy *wiphy,
> > > > +  enum nl80211_tx_power_mode
> > > > mode);
> > > > + int (*get_tx_power_mode)(struct wiphy *wiphy,
> > > > +  enum nl80211_tx_power_mode
> > > > *mode);
> > > > +
> > > >   int (*set_wds_peer)(struct wiphy *wiphy, struct
> > > > net_device *dev,
> > > >   const u8 *addr);
> > > > 
> > > > diff --git a/include/uapi/linux/nl80211.h
> > > > b/include/uapi/linux/nl80211.h
> > > > index 5a30a75..9b1888a 100644
> > > > --- a/include/uapi/linux/nl80211.h
> > > > +++ b/include/uapi/linux/nl80211.h
> > > > @@ -1796,6 +1796,9 @@ enum nl80211_commands {
> > > >   *   connecting to a PCP, and in %NL80211_CMD_START_AP to
> > > > start
> > > >   *   a PCP instead of AP. Relevant for DMG networks only.
> > > >   *
> > > > + * @NL80211_ATTR_WIPHY_TX_POWER_MODE: Transmit power mode. See
> > > > + *   nl80211_tx_power_mode for possible values.
> > > > + *
> > > >   * @NUM_NL80211_ATTR: total number of nl80211_attrs available
> > > >   * @NL80211_ATTR_MAX: highest attribute number currently
> > > > defined
> > > >   * @__NL80211_ATTR_AFTER_LAST: internal use
> > > > @@ -2172,6 +2175,8 @@ enum nl80211_attrs {
> > > > 
> > > >   NL80211_ATTR_PBSS,
> > > > 
> > > > + NL80211_ATTR_WIPHY_TX_POWER_MODE,
> > > > +
> > > >   /* add attributes here, update the policy in nl80211.c */
> > > > 
> > > >   __NL80211_ATTR_AFTER_LAST,
> > > > @@ -3703,6 +3708,22 @@ enum nl80211_tx_power_setting {
> > > >  };
> > > > 
> > > >  /**
> > > > + * enum nl80211_tx_power_mode - TX power mode setting
> > > > + * @NL80211_TX_POWER_LOW: general low TX power mode
> > > > + * @NL80211_TX_POWER_MEDIUM: general medium TX power mode
> > > > + * @NL80211_TX_POWER_HIGH: general high TX power mode
> > > > + * @NL80211_TX_POWER_CLAMSHELL: clamshell mode TX power mode
> > > > + * @NL80211_TX_POWER_TABLET: tablet mode TX power mode
> > > > + */
> > > > +enum nl80211_tx_power_mode {
> > > > + NL80211_TX_POWER_LOW,
> > > > + NL80211_TX_POWER_MEDIUM,
> > > > + NL80211_TX_POWER_HIGH,
> > > > + NL80211_TX_POWER_CLAMSHELL,
> > > > + NL80211_TX_POWER_TABLET,
> > > 
> > > "clamshell" and "tablet" probably mean many different things to
> > > many
> > > different people with respect to whether or not they should do
> > > anything
> > > with power saving or wifi.  I feel like a more generic interface
> > > is
> > > needed here.
> > We could probably drop those two CLAMSHELL and TABLET constant or
> > describing what they mean
> > in more detail?
> >

Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > You might need another one of these in invoke_softirq()
> > 
> 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.

Well, we instantly trigger rcu issues.

How to reproduce :

netserver &
for i in `seq 1 100`
do
  netperf -H 127.0.0.1 -t TCP_RR -l 1000 &
done
# local hack to enable the new behavior
# without having to add a new sysctl, but hacking an existing one
echo 1001 >/proc/sys/net/core/netdev_max_backlog





[  236.977511] INFO: rcu_sched self-detected stall on CPU
[  236.977512] INFO: rcu_sched self-detected stall on CPU
[  236.977515] INFO: rcu_sched self-detected stall on CPU
[  236.977518] INFO: rcu_sched self-detected stall on CPU
[  236.977519] INFO: rcu_sched self-detected stall on CPU
[  236.977521] INFO: rcu_sched self-detected stall on CPU
[  236.977522] INFO: rcu_sched self-detected stall on CPU
[  236.977523] INFO: rcu_sched self-detected stall on CPU
[  236.977525] INFO: rcu_sched self-detected stall on CPU
[  236.977526] INFO: rcu_sched self-detected stall on CPU
[  236.977527] INFO: rcu_sched self-detected stall on CPU
[  236.977529] INFO: rcu_sched self-detected stall on CPU
[  236.977530] INFO: rcu_sched self-detected stall on CPU
[  236.977532] INFO: rcu_sched self-detected stall on CPU
[  236.977532]  47-...: (1 GPs behind) idle=8d1/1/0 softirq=2500/2506 fqs=1 
[  236.977535] INFO: rcu_sched self-detected stall on CPU
[  236.977536] INFO: rcu_sched self-detected stall on CPU
[  236.977540]  36-...: (1 GPs behind) idle=d05/1/0 softirq=2637/2644 fqs=1 
[  236.977546]  
[  236.977546]  38-...: (1 GPs behind) idle=a5b/1/0 softirq=2612/2618 fqs=1 
[  236.977549]  0-...: (1 GPs behind) idle=c39/1/0 softirq=15315/15321 fqs=1 
[  236.977551]  24-...: (1 GPs behind) idle=ea3/1/0 softirq=2455/2461 fqs=1 
[  236.977554]  18-...: (20995 ticks this GP) idle=ef5/1/0 softirq=8530/8530 
fqs=1 
[  236.977556]  39-...: (1 GPs behind) idle=f9d/1/0 softirq=2144/2150 fqs=1 
[  236.977558]  
[  236.977558]  22-...: (1 GPs behind) idle=5a7/1/0 softirq=10238/10244 fqs=1 
[  236.977561]  7-...: (1 GPs behind) idle=323/1/0 softirq=5279/5285 fqs=1 
[  236.977563]  31-...: (1 GPs behind) idle=47d/1/0 softirq=2526/2532 fqs=1 
[  236.977565]  33-...: (1 GPs behind) idle=175/1/0 softirq=2060/2066 fqs=1 
[  236.977568]  10-...: (1 GPs behind) idle=c3d/1/0 softirq=4864/4870 fqs=1 
[  236.977570]  34-...: (20995 ticks this GP) idle=dd5/1/0 softirq=2243/2243 
fqs=1 
[  236.977574]  
[  236.977574]  37-...: (1 GPs behind) idle=aef/1/0 softirq=2660/2666 fqs=1 
[  236.977576]  13-...: (1 GPs behind) idle=a2b/1/0 softirq=9928/9934 fqs=1 
[  236.977578]  
[  236.977578]  
[  236.977579]  
[  236.977580]  
[  236.977582]  
[  236.977583]  
[  236.977583]  
[  236.977584]  
[  236.977584]  
[  236.977586]  
[  236.977587] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977588]  
[  236.977589]  
[  236.977595] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977603] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977607] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977609] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977610] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977612] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977614] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977616] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977618] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977619] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977620] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977622] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977626] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977627] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.978512] INFO: rcu_sched self-detected stall on CPU
[  236.978512] INFO: rcu_sched self-detected stall on CPU
[  236.978514] INFO: rcu_sched self-detected stall on CPU
[  236.978516] INFO: rcu_sched self-detected stall on CPU
[  236.978517] INFO: rcu_sched self-detected stall on CPU
[  236.978518] INFO: rcu_sched self-detected stall on CPU
[  236.978519] INFO: rcu_sched

[PATCHv2 net] openvswitch: Fix cached ct with helper.

2016-05-11 Thread Joe Stringer

When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.

In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.

This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.

Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer 
---
v2: Only apply to connections that we will commit.
---
 net/openvswitch/conntrack.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index b5fea1101faa..10c84d882881 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -776,6 +776,19 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
return -EINVAL;
}
 
+   /* Userspace may decide to perform a ct lookup without a helper
+* specified followed by a (recirculate and) commit with one.
+* Therefore, for unconfirmed connections which we will commit,
+* we need to attach the helper here.
+*/
+   if (!nf_ct_is_confirmed(ct) && info->commit &&
+   info->helper && !nfct_help(ct)) {
+   int err = __nf_ct_try_assign_helper(ct, info->ct,
+   GFP_ATOMIC);
+   if (err)
+   return err;
+   }
+
/* Call the helper only if:
 * - nf_conntrack_in() was executed above ("!cached") for a
 *   confirmed connection, or
-- 
2.1.4

[PATCH v4 net-next] tcp: replace cnt & rtt with struct in pkts_acked()

2016-05-11 Thread Lawrence Brakmo

Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).

As proposed by Neal Cardwell in his comments to the tcp_nv patch.

Signed-off-by: Lawrence Brakmo 
Acked-by: Yuchung Cheng 
---
 include/net/tcp.h   |  7 ++-
 net/ipv4/tcp_bic.c  |  6 +++---
 net/ipv4/tcp_cdg.c  | 14 +++---
 net/ipv4/tcp_cubic.c|  6 +++---
 net/ipv4/tcp_htcp.c | 10 +-
 net/ipv4/tcp_illinois.c | 21 +++--
 net/ipv4/tcp_input.c|  8 ++--
 net/ipv4/tcp_lp.c   |  6 +++---
 net/ipv4/tcp_vegas.c|  6 +++---
 net/ipv4/tcp_vegas.h|  2 +-
 net/ipv4/tcp_veno.c |  7 ---
 net/ipv4/tcp_westwood.c |  7 ---
 net/ipv4/tcp_yeah.c |  7 ---
 13 files changed, 60 insertions(+), 47 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 24ec804..dc588c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -849,6 +849,11 @@ enum tcp_ca_ack_event_flags {
 
 union tcp_cc_info;
 
+struct ack_sample {
+   u32 pkts_acked;
+   s32 rtt_us;
+};
+
 struct tcp_congestion_ops {
struct list_headlist;
u32 key;
@@ -872,7 +877,7 @@ struct tcp_congestion_ops {
/* new value of cwnd after loss (optional) */
u32  (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
-   void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us);
+   void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
index fd1405d..36087bc 100644
--- a/net/ipv4/tcp_bic.c
+++ b/net/ipv4/tcp_bic.c
@@ -197,15 +197,15 @@ static void bictcp_state(struct sock *sk, u8 new_state)
 /* Track delayed acknowledgment ratio using sliding window
  * ratio = (15*ratio + sample) / 16
  */
-static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt)
+static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
 
if (icsk->icsk_ca_state == TCP_CA_Open) {
struct bictcp *ca = inet_csk_ca(sk);
 
-   cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT;
-   ca->delayed_ack += cnt;
+   ca->delayed_ack += sample->pkts_acked -
+   (ca->delayed_ack >> ACK_RATIO_SHIFT);
}
 }
 
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index ccce8a5..03725b2 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -294,12 +294,12 @@ static void tcp_cdg_cong_avoid(struct sock *sk, u32 ack, 
u32 acked)
ca->shadow_wnd = max(ca->shadow_wnd, ca->shadow_wnd + incr);
 }
 
-static void tcp_cdg_acked(struct sock *sk, u32 num_acked, s32 rtt_us)
+static void tcp_cdg_acked(struct sock *sk, const struct ack_sample *sample)
 {
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (rtt_us <= 0)
+   if (sample->rtt_us <= 0)
return;
 
/* A heuristic for filtering delayed ACKs, adapted from:
@@ -307,20 +307,20 @@ static void tcp_cdg_acked(struct sock *sk, u32 num_acked, 
s32 rtt_us)
 * delay and rate based TCP mechanisms." TR 100219A. CAIA, 2010.
 */
if (tp->sacked_out == 0) {
-   if (num_acked == 1 && ca->delack) {
+   if (sample->pkts_acked == 1 && ca->delack) {
/* A delayed ACK is only used for the minimum if it is
 * provenly lower than an existing non-zero minimum.
 */
-   ca->rtt.min = min(ca->rtt.min, rtt_us);
+   ca->rtt.min = min(ca->rtt.min, sample->rtt_us);
ca->delack--;
return;
-   } else if (num_acked > 1 && ca->delack < 5) {
+   } else if (sample->pkts_acked > 1 && ca->delack < 5) {
ca->delack++;
}
}
 
-   ca->rtt.min = min_not_zero(ca->rtt.min, rtt_us);
-   ca->rtt.max = max(ca->rtt.max, rtt_us);
+   ca->rtt.min = min_not_zero(ca->rtt.min, sample->rtt_us);
+   ca->rtt.max = max(ca->rtt.max, sample->rtt_us);
 }
 
 static u32 tcp_cdg_ssthresh(struct sock *sk)
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 0ce946e..c99230e 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -437,21 +437,21 @@ static void hystart_update(struct sock *sk, u32 delay)
 /* Track delayed acknowledgment ratio using sliding window
  * ratio = (15*ratio + sample) / 16
  */
-static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)

[PATCH net-next 11/13] ip6_tunnel: Add support for fou/gue encapsulation

Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_tunnel.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 66e3a63..52792f9 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1743,13 +1743,55 @@ static void ip6_tnl_netlink_parms(struct nlattr *data[],
parms->proto = nla_get_u8(data[IFLA_IPTUN_PROTO]);
 }
 
+static bool ip6_tnl_netlink_encap_parms(struct nlattr *data[],
+   struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_IPTUN_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_IPTUN_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_IPTUN_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_IPTUN_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_IPTUN_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev,
   struct nlattr *tb[], struct nlattr *data[])
 {
struct net *net = dev_net(dev);
struct ip6_tnl *nt, *t;
+   struct ip_tunnel_encap ipencap;
 
nt = netdev_priv(dev);
+
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6_tnl_netlink_parms(data, >parms);
 
t = ip6_tnl_locate(net, >parms, 0);
@@ -1766,10 +1808,17 @@ static int ip6_tnl_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct __ip6_tnl_parm p;
struct net *net = t->net;
struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
+   struct ip_tunnel_encap ipencap;
 
if (dev == ip6n->fb_tnl_dev)
return -EINVAL;
 
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(t, );
+
+   if (err < 0)
+   return err;
+   }
ip6_tnl_netlink_parms(data, );
 
t = ip6_tnl_locate(net, , 0);
@@ -1810,6 +1859,14 @@ static size_t ip6_tnl_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_IPTUN_PROTO */
nla_total_size(1) +
+   /* IFLA_IPTUN_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1827,6 +1884,17 @@ static int ip6_tnl_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u32(skb, IFLA_IPTUN_FLAGS, parm->flags) ||
nla_put_u8(skb, IFLA_IPTUN_PROTO, parm->proto))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_IPTUN_ENCAP_TYPE,
+   tunnel->encap.type) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_SPORT,
+tunnel->encap.sport) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_DPORT,
+tunnel->encap.dport) ||
+   nla_put_u16(skb, IFLA_IPTUN_ENCAP_FLAGS,
+   tunnel->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1850,6 +1918,10 @@ static const struct nla_policy 
ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
[IFLA_IPTUN_FLOWINFO]   = { .type = NLA_U32 },
[IFLA_IPTUN_FLAGS]  = { .type = NLA_U32 },
[IFLA_IPTUN_PROTO]  = { .type = NLA_U8 },
+   [IFLA_IPTUN_ENCAP_TYPE] = { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_FLAGS]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_SPORT]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_DPORT]= { .type = NLA_U16 },
 };
 
 static struct rtnl_link_ops ip6_link_ops __read_mostly = {
-- 
2.8.0.rc2

[PATCH net-next 04/13] fou: Split out {fou,gue}_build_header

Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h |  8 
 net/ipv4/fou.c| 47 +--
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/include/net/fou.h b/include/net/fou.h
index 19b8a0c..7d2fda2 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -11,9 +11,9 @@
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
 static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
-int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
 
 #endif
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 6cbc725..f4f2ddd 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -780,6 +780,22 @@ static void fou_build_udp(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
*protocol = IPPROTO_UDP;
 }
 
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
+{
+   int err;
+
+   err = iptunnel_handle_offloads(skb, type);
+   if (err)
+   return err;
+
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
+
+   return 0;
+}
+EXPORT_SYMBOL(__fou_build_header);
+
 int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 u8 *protocol, struct flowi4 *fl4)
 {
@@ -788,26 +804,21 @@ int fou_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
__be16 sport;
int err;
 
-   err = iptunnel_handle_offloads(skb, type);
+   err = __fou_build_header(skb, e, protocol, , type);
if (err)
return err;
 
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
 }
 EXPORT_SYMBOL(fou_build_header);
 
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4)
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
 {
-   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
-  SKB_GSO_UDP_TUNNEL;
struct guehdr *guehdr;
size_t hdrlen, optlen = 0;
-   __be16 sport;
void *data;
bool need_priv = false;
int err;
@@ -826,8 +837,8 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return err;
 
/* Get source port (based on flow hash) before skb_push */
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
 
hdrlen = sizeof(struct guehdr) + optlen;
 
@@ -872,6 +883,22 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
 
}
 
+   return 0;
+}
+EXPORT_SYMBOL(__gue_build_header);
+
+int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+u8 *protocol, struct flowi4 *fl4)
+{
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
+  SKB_GSO_UDP_TUNNEL;
+   __be16 sport;
+   int err;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
-- 
2.8.0.rc2

[PATCH net-next 12/13] ip6ip6: Support for GSO/GRO

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_offload.c | 24 +---
 net/ipv6/ip6_tunnel.c  |  3 +++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 787e55f..332d6a0 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -253,9 +253,11 @@ out:
return pp;
 }
 
-static struct sk_buff **sit_gro_receive(struct sk_buff **head,
-   struct sk_buff *skb)
+static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
 {
+   /* Common GRO receive for SIT and IP6IP6 */
+
if (NAPI_GRO_CB(skb)->encap_mark) {
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
@@ -298,6 +300,13 @@ static int sit_gro_complete(struct sk_buff *skb, int nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return ipv6_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -310,11 +319,19 @@ static struct packet_offload ipv6_packet_offload 
__read_mostly = {
 static const struct net_offload sit_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
-   .gro_receive= sit_gro_receive,
+   .gro_receive= sit_ip6ip6_gro_receive,
.gro_complete   = sit_gro_complete,
},
 };
 
+static const struct net_offload ip6ip6_offload = {
+   .callbacks = {
+   .gso_segment= ipv6_gso_segment,
+   .gro_receive= sit_ip6ip6_gro_receive,
+   .gro_complete   = ip6ip6_gro_complete,
+   },
+};
+
 static int __init ipv6_offload_init(void)
 {
 
@@ -326,6 +343,7 @@ static int __init ipv6_offload_init(void)
dev_add_offload(_packet_offload);
 
inet_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPV6);
 
return 0;
 }
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 52792f9..0fab341 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1238,6 +1238,9 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
 
+   if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
+   return -1;
+
err = ip6_tnl_xmit(skb, dev, dsfield, , encap_limit, ,
   IPPROTO_IPV6);
if (err != 0) {
-- 
2.8.0.rc2

[PATCH net-next 10/13] ip6_gre: Add support for fou/gue encapsulation

Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_gre.c | 77 +++---
 1 file changed, 74 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index ee62ec4..4110189 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1022,9 +1022,7 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
}
 
tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
-
-   tunnel->hlen = tunnel->tun_hlen;
-
+   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
 
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
@@ -1290,15 +1288,57 @@ static void ip6gre_tap_setup(struct net_device *dev)
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
 }
 
+static bool ip6gre_netlink_encap_parms(struct nlattr *data[],
+  struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_GRE_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_GRE_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_GRE_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_GRE_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
 {
struct ip6_tnl *nt;
struct net *net = dev_net(dev);
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
+   struct ip_tunnel_encap ipencap;
int err;
 
nt = netdev_priv(dev);
+
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, >parms);
 
if (ip6gre_tunnel_find(net, >parms, dev->type))
@@ -1345,10 +1385,18 @@ static int ip6gre_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct net *net = nt->net;
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
struct __ip6_tnl_parm p;
+   struct ip_tunnel_encap ipencap;
 
if (dev == ign->fb_tunnel_dev)
return -EINVAL;
 
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, );
 
t = ip6gre_tunnel_locate(net, , 0);
@@ -1402,6 +1450,14 @@ static size_t ip6gre_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_GRE_FLAGS */
nla_total_size(4) +
+   /* IFLA_GRE_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1425,6 +1481,17 @@ static int ip6gre_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) ||
nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE,
+   t->encap.type) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_SPORT,
+t->encap.sport) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_DPORT,
+t->encap.dport) ||
+   nla_put_u16(skb, IFLA_GRE_ENCAP_FLAGS,
+   t->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1443,6 +1510,10 @@ static const struct nla_policy 
ip6gre_policy[IFLA_GRE_MAX + 1] = {
[IFLA_GRE_ENCAP_LIMIT] = { .type = NLA_U8 },
[IFLA_GRE_FLOWINFO]= { .type = NLA_U32 },
[IFLA_GRE_FLAGS]   = { .type = NLA_U32 },
+   [IFLA_GRE_ENCAP_TYPE]   = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_FLAGS]  = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_SPORT]  = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_DPORT]  = { .type = NLA_U16 },
 };
 
 static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
-- 
2.8.0.rc2

[PATCH net-next 02/13] net: define gso types for IPx over IPv4 and IPv6

This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  5 ++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  3 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  3 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  3 +--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |  3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  3 +--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
 include/linux/netdev_features.h   | 12 ++--
 include/linux/netdevice.h |  4 ++--
 include/linux/skbuff.h|  4 ++--
 net/core/ethtool.c|  4 ++--
 net/ipv4/af_inet.c|  2 +-
 net/ipv4/ipip.c   |  2 +-
 net/ipv6/ip6_offload.c|  4 ++--
 net/ipv6/sit.c|  4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c   | 17 +++--
 17 files changed, 35 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index d465bd7..0a5b770 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13259,12 +13259,11 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX;
if (!chip_is_e1x) {
dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT;
+   NETIF_F_GSO_IPXIP4;
dev->hw_enc_features =
NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 |
-   NETIF_F_GSO_IPIP |
-   NETIF_F_GSO_SIT |
+   NETIF_F_GSO_IPXIP4 |
NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL;
}
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6a5a717..85adcb0 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6233,7 +6233,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
   NETIF_F_TSO | NETIF_F_TSO6 |
   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
-  NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
+  NETIF_F_GSO_IPXIP4 |
   NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_PARTIAL | NETIF_F_RXHASH |
   NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO;
@@ -6243,7 +6243,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
NETIF_F_TSO | NETIF_F_TSO6 |
NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
+   NETIF_F_GSO_IPXIP4;
NETIF_F_GSO_PARTIAL;
dev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM |
NETIF_F_GSO_GRE_CSUM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 46a3a67..e4284b5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9082,8 +9082,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
   NETIF_F_TSO6 |
   NETIF_F_GSO_GRE  |
   NETIF_F_GSO_GRE_CSUM |
-  NETIF_F_GSO_IPIP |
-  NETIF_F_GSO_SIT  |
+  NETIF_F_GSO_IPXIP4   |
   NETIF_F_GSO_UDP_TUNNEL   |
   NETIF_F_GSO_UDP_TUNNEL_CSUM  |

[PATCH net-next 13/13] ip4ip6: Support for GSO/GRO

Signed-off-by: Tom Herbert 
---
 include/net/inet_common.h |  5 +
 net/ipv4/af_inet.c| 12 +++-
 net/ipv6/ip6_offload.c| 33 -
 net/ipv6/ip6_tunnel.c |  5 -
 4 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 109e3ee..5d68342 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -39,6 +39,11 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short 
family,
 int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
 
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb);
+int inet_gro_complete(struct sk_buff *skb, int nhoff);
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features);
+
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
if (sk)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 25040b1..377424e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1192,8 +1192,8 @@ int inet_sk_rebuild_header(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_sk_rebuild_header);
 
-static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
-   netdev_features_t features)
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features)
 {
bool udpfrag = false, fixedid = false, encap;
struct sk_buff *segs = ERR_PTR(-EINVAL);
@@ -1280,9 +1280,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
 out:
return segs;
 }
+EXPORT_SYMBOL(inet_gso_segment);
 
-static struct sk_buff **inet_gro_receive(struct sk_buff **head,
-struct sk_buff *skb)
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 {
const struct net_offload *ops;
struct sk_buff **pp = NULL;
@@ -1398,6 +1398,7 @@ out:
 
return pp;
 }
+EXPORT_SYMBOL(inet_gro_receive);
 
 static struct sk_buff **ipip_gro_receive(struct sk_buff **head,
 struct sk_buff *skb)
@@ -1449,7 +1450,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
-static int inet_gro_complete(struct sk_buff *skb, int nhoff)
+int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
@@ -1479,6 +1480,7 @@ out_unlock:
 
return err;
 }
+EXPORT_SYMBOL(inet_gro_complete);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 332d6a0..22e90e5 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ip6_offload.h"
 
@@ -268,6 +269,21 @@ static struct sk_buff **sit_ip6ip6_gro_receive(struct 
sk_buff **head,
return ipv6_gro_receive(head, skb);
 }
 
+static struct sk_buff **ip4ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
+{
+   /* Common GRO receive for SIT and IP6IP6 */
+
+   if (NAPI_GRO_CB(skb)->encap_mark) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   NAPI_GRO_CB(skb)->encap_mark = 1;
+
+   return inet_gro_receive(head, skb);
+}
+
 static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
const struct net_offload *ops;
@@ -307,6 +323,13 @@ static int ip6ip6_gro_complete(struct sk_buff *skb, int 
nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return inet_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -324,6 +347,14 @@ static const struct net_offload sit_offload = {
},
 };
 
+static const struct net_offload ip4ip6_offload = {
+   .callbacks = {
+   .gso_segment= inet_gso_segment,
+   .gro_receive= ip4ip6_gro_receive,
+   .gro_complete   = ip4ip6_gro_complete,
+   },
+};
+
 static const struct net_offload ip6ip6_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
@@ -331,7 +362,6 @@ static const struct net_offload ip6ip6_offload = {
.gro_complete   = ip6ip6_gro_complete,
},
 };
-
 static int __init ipv6_offload_init(void)
 {
 
@@ -344,6 +374,7 @@ static int __init ipv6_offload_init(void)
 
inet_add_offload(_offload, IPPROTO_IPV6);
inet6_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPIP);
 
return

[PATCH net-next 08/13] fou: Support IPv6 in fou

This patch adds receive path support for IPv6 with fou.

- Add address family to fou structure for open sockets. This supports
  AF_INET and AF_INET6. Lookups for fou ports are performed on both the
  port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
  based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 47 +++
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index f4f2ddd..5f9207c 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -21,6 +21,7 @@ struct fou {
u8 protocol;
u8 flags;
__be16 port;
+   u8 family;
u16 type;
struct list_head list;
struct rcu_head rcu;
@@ -47,14 +48,17 @@ static inline struct fou *fou_from_sock(struct sock *sk)
return sk->sk_user_data;
 }
 
-static int fou_recv_pull(struct sk_buff *skb, size_t len)
+static int fou_recv_pull(struct sk_buff *skb, struct fou *fou, size_t len)
 {
-   struct iphdr *iph = ip_hdr(skb);
-
/* Remove 'len' bytes from the packet (UDP header and
 * FOU header if present).
 */
-   iph->tot_len = htons(ntohs(iph->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
+
__skb_pull(skb, len);
skb_postpull_rcsum(skb, udp_hdr(skb), len);
skb_reset_transport_header(skb);
@@ -68,7 +72,7 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb)
if (!fou)
return 1;
 
-   if (fou_recv_pull(skb, sizeof(struct udphdr)))
+   if (fou_recv_pull(skb, fou, sizeof(struct udphdr)))
goto drop;
 
return -fou->protocol;
@@ -141,7 +145,11 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff 
*skb)
 
hdrlen = sizeof(struct guehdr) + optlen;
 
-   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
 
/* Pull csum through the guehdr now . This can be used if
 * there is a remote checksum offload.
@@ -426,7 +434,8 @@ static int fou_add_to_port_list(struct net *net, struct fou 
*fou)
 
mutex_lock(>fou_lock);
list_for_each_entry(fout, >fou_list, list) {
-   if (fou->port == fout->port) {
+   if (fou->port == fout->port &&
+   fou->family == fout->family) {
mutex_unlock(>fou_lock);
return -EALREADY;
}
@@ -471,8 +480,9 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
sk = sock->sk;
 
-   fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->family = cfg->udp_config.family;
+   fou->flags = cfg->flags;
fou->type = cfg->type;
fou->sock = sock;
 
@@ -524,12 +534,13 @@ static int fou_destroy(struct net *net, struct fou_cfg 
*cfg)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
__be16 port = cfg->udp_config.local_udp_port;
+   u8 family = cfg->udp_config.family;
int err = -EINVAL;
struct fou *fou;
 
mutex_lock(>fou_lock);
list_for_each_entry(fou, >fou_list, list) {
-   if (fou->port == port) {
+   if (fou->port == port && fou->family == family) {
fou_release(fou);
err = 0;
break;
@@ -567,8 +578,15 @@ static int parse_nl_config(struct genl_info *info,
if (info->attrs[FOU_ATTR_AF]) {
u8 family = nla_get_u8(info->attrs[FOU_ATTR_AF]);
 
-   if (family != AF_INET)
-   return -EINVAL;
+   switch (family) {
+   case AF_INET:
+   break;
+   case AF_INET6:
+   cfg->udp_config.ipv6_v6only = 1;
+   break;
+   default:
+   return -EAFNOSUPPORT;
+   }
 
cfg->udp_config.family = family;
}
@@ -659,6 +677,7 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
struct fou_cfg cfg;
struct fou *fout;
__be16 port;
+   u8 family;
int ret;
 
ret = parse_nl_config(info, );
@@ -668,6 +687,10 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
if (port == 0)
return -EINVAL;
 
+   family = cfg.udp_config.family;
+   if (family !=

[PATCH net-next 07/13] ipv6: Change "final" protocol processing for encapsulation

When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.

This patch set adds a check of a final protocol has been seen. The
rules are:
  - If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
  - If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 2a0258a..7d98d01 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -216,6 +216,7 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
unsigned int nhoff;
int nexthdr;
bool raw;
+   bool have_final = false;
 
/*
 *  Parse extension headers
@@ -235,9 +236,21 @@ resubmit:
if (ipprot) {
int ret;
 
-   if (ipprot->flags & INET6_PROTO_FINAL) {
+   if (have_final) {
+   if (!(ipprot->flags & INET6_PROTO_FINAL)) {
+   /* Once we've seen a final protocol don't
+* allow encapsulation on any non-final
+* ones. This allows foo in UDP encapsulation
+* to work.
+*/
+   goto discard;
+   }
+   } else if (ipprot->flags & INET6_PROTO_FINAL) {
const struct ipv6hdr *hdr;
 
+   /* Only do this once for first final protocol */
+   have_final = true;
+
/* Free reference early: we don't need it any more,
   and it may hold ip_conntrack module loaded
   indefinitely. */
-- 
2.8.0.rc2

[PATCH net-next 06/13] ipv6: Fix nexthdr for reinjection

In ip6_input_finish the protocol handle returns a value greater than
zero the packet needs to be resubmitted using the returned protocol.
The returned protocol is being ignored and each time through resubmit
nexthdr is taken from an offest in the packet. This patch fixes that
so that nexthdr is taken from return value of the protocol handler.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 6ed5601..2a0258a 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -222,13 +222,14 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
 */
 
rcu_read_lock();
-resubmit:
+
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)->nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -256,10 +257,12 @@ resubmit:
goto discard;
 
ret = ipprot->handler(skb);
-   if (ret > 0)
+   if (ret > 0) {
+   nexthdr = ret;
goto resubmit;
-   else if (ret == 0)
+   } else if (ret == 0) {
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS);
+   }
} else {
if (!raw) {
if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-- 
2.8.0.rc2

[PATCH net-next 03/13] fou: Call setup_udp_tunnel_sock

Use helper function to set up UDP tunnel related information for a fou
socket.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 50 --
 1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index eeec7d6..6cbc725 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -448,31 +448,13 @@ static void fou_release(struct fou *fou)
kfree_rcu(fou, rcu);
 }
 
-static int fou_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = fou_udp_recv;
-   udp_sk(sk)->gro_receive = fou_gro_receive;
-   udp_sk(sk)->gro_complete = fou_gro_complete;
-   fou_from_sock(sk)->protocol = cfg->protocol;
-
-   return 0;
-}
-
-static int gue_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = gue_udp_recv;
-   udp_sk(sk)->gro_receive = gue_gro_receive;
-   udp_sk(sk)->gro_complete = gue_gro_complete;
-
-   return 0;
-}
-
 static int fou_create(struct net *net, struct fou_cfg *cfg,
  struct socket **sockp)
 {
struct socket *sock = NULL;
struct fou *fou = NULL;
struct sock *sk;
+   struct udp_tunnel_sock_cfg tunnel_cfg;
int err;
 
/* Open UDP socket */
@@ -491,33 +473,33 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
 
fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->type = cfg->type;
+   fou->sock = sock;
+
+   memset(_cfg, 0, sizeof(tunnel_cfg));
+   tunnel_cfg.encap_type = 1;
+   tunnel_cfg.sk_user_data = fou;
+   tunnel_cfg.encap_destroy = NULL;
 
/* Initial for fou type */
switch (cfg->type) {
case FOU_ENCAP_DIRECT:
-   err = fou_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = fou_udp_recv;
+   tunnel_cfg.gro_receive = fou_gro_receive;
+   tunnel_cfg.gro_complete = fou_gro_complete;
+   fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
-   err = gue_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = gue_udp_recv;
+   tunnel_cfg.gro_receive = gue_gro_receive;
+   tunnel_cfg.gro_complete = gue_gro_complete;
break;
default:
err = -EINVAL;
goto error;
}
 
-   fou->type = cfg->type;
-
-   udp_sk(sk)->encap_type = 1;
-   udp_encap_enable();
-
-   sk->sk_user_data = fou;
-   fou->sock = sock;
-
-   inet_inc_convert_csum(sk);
+   setup_udp_tunnel_sock(net, sock, _cfg);
 
sk->sk_allocation = GFP_ATOMIC;
 
-- 
2.8.0.rc2

[PATCH net-next 01/13] gso: Remove arbitrary checks for unsupported GSO

In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.

This is a code simplication especially for add new GSO types.

Signed-off-by: Tom Herbert 
---
 net/ipv4/af_inet.c | 18 --
 net/ipv4/gre_offload.c | 14 --
 net/ipv4/tcp_offload.c | 19 ---
 net/ipv4/udp_offload.c | 10 --
 net/ipv6/ip6_offload.c | 18 --
 net/ipv6/udp_offload.c | 13 -
 net/mpls/mpls_gso.c|  9 -
 7 files changed, 101 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2e6e65f..7f08d45 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1205,24 +1205,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
int ihl;
int id;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_UDP |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TUNNEL_REMCSUM |
-  SKB_GSO_PARTIAL |
-  0)))
-   goto out;
-
skb_reset_network_header(skb);
nhoff = skb_network_header(skb) - skb_mac_header(skb);
if (unlikely(!pskb_may_pull(skb, sizeof(*iph
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index e88190a..ecd1e09 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -26,20 +26,6 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
int gre_offset, outer_hlen;
bool need_csum, ufo;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-   ~(SKB_GSO_TCPV4 |
- SKB_GSO_TCPV6 |
- SKB_GSO_UDP |
- SKB_GSO_DODGY |
- SKB_GSO_TCP_ECN |
- SKB_GSO_TCP_FIXEDID |
- SKB_GSO_GRE |
- SKB_GSO_GRE_CSUM |
- SKB_GSO_IPIP |
- SKB_GSO_SIT |
- SKB_GSO_PARTIAL)))
-   goto out;
-
if (!skb->encapsulation)
goto out;
 
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 02737b6..5c59649 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -83,25 +83,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TUNNEL_REMCSUM |
-  0) ||
-!(type & (SKB_GSO_TCPV4 |
-  SKB_GSO_TCPV6
-   goto out;
 
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 6b7459c..81f253b 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -209,16 +209,6 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY |
- SKB_GSO_UDP_TUNNEL |
- SKB_GSO_UDP_TUNNEL_CSUM |
- SKB_GSO_TUNNEL_REMCSUM |
- SKB_GSO_IPIP |
-

[PATCH net-next 00/13] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

This patch set:
  - Adds support for GSO and GRO for ip6ip6 and ip4ip6
  - Add support for FOU and GUE in IPv6
  - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE
  - Fixes ip6_input to deal with UDP encapsulations
  - Some other minor fixes

v2:
  - Removed a check of GSO types in MPLS
  - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input
from Alexander)
  - Don't define GSO types specifally for IP6IP6 and IP4IP6, above
fix makes that uncessary
  - Don't bother clearing encapsulation flag in UDP tunnel segment
(another item suggested by Alexander).

v3:
  - Address some minor comments from Alexander

v4:
  - Rebase on changes to fix IP TX tunnels
  - Fix MTU issues in ip4ip6, ip6ip6
  - Add test data for above

Tested:
   Tested a variety of case, but not the full matrix (which is quite
   large now). Most of the obivous cases (e.g. GRE) work fine. Still
   some issues probably with GSO/GRO being effective in all cases.

- IPv4/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6616 Mbps
  200 TCP_RR
1244043 tps
141/243/446 90/95/99% latencies
86.61% CPU utilization

- IPv6/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6940 Mbps
  200 TCP_RR
1270903 tps
138/236/440 90/95/99% latencies
87.51% CPU utilization

 - IP6IP6
  1 TCP_STREAM
2576 Mbps
  200 TCP_RR
498981 tps
388/498/631 90/95/99% latencies
19.75% CPU utilization (1 CPU saturated)

 - IP6IP6/GUE with RCO
  1 TCP_STREAM
2031 Mbps
  200 TCP_RR
1233818 tps
143/244/451 90/95/99% latencies
87.57 CPU utilization

 - IP4IP6
  1 TCP_STREAM
2371 Mbps
  200 TCP_RR
763774 tps
250/318/466 90/95/99% latencies
35.25% CPU utilization (1 CPU saturated)

 - IP4IP6/GUE with RCO
  1 TCP_STREAM
2054 Mbps
  200 TCP_RR
1196385 tps
148/251/460 90/95/99% latencies
87.56 CPU utilization

 - GRE with keyid
  200 TCP_RR
744173 tps
258/332/461 90/95/99% latencies
34.59% CPU utilization (1 CPU saturated)
  

Tom Herbert (13):
  gso: Remove arbitrary checks for unsupported GSO
  net: define gso types for IPx over IPv4 and IPv6
  fou: Call setup_udp_tunnel_sock
  fou: Split out {fou,gue}_build_header
  fou: Add encap ops for IPv6 tunnels
  ipv6: Fix nexthdr for reinjection
  ipv6: Change "final" protocol processing for encapsulation
  fou: Support IPv6 in fou
  ip6_tun: Add infrastructure for doing encapsulation
  ip6_gre: Add support for fou/gue encapsulation
  ip6_tunnel: Add support for fou/gue encapsulation
  ip6ip6: Support for GSO/GRO
  ip4ip6: Support for GSO/GRO

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   3 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
 include/linux/netdev_features.h   |  12 +-
 include/linux/netdevice.h |   4 +-
 include/linux/skbuff.h|   4 +-
 include/net/fou.h |  10 +-
 include/net/inet_common.h |   5 +
 include/net/ip6_tunnel.h  |  22 +++-
 net/core/ethtool.c|   4 +-
 net/ipv4/af_inet.c|  32 ++---
 net/ipv4/fou.c| 144 +-
 net/ipv4/gre_offload.c|  14 ---
 net/ipv4/ipip.c   |   2 +-
 net/ipv4/tcp_offload.c|  19 ---
 net/ipv4/udp_offload.c|  10 --
 net/ipv6/Makefile |   4 +-
 net/ipv6/fou6.c   | 140 +
 net/ipv6/ip6_gre.c|  77 +++-
 net/ipv6/ip6_input.c  |  24 +++-
 net/ipv6/ip6_offload.c|  77 
 net/ipv6/ip6_tunnel.c | 116 +++--
 net/ipv6/ip6_tunnel_core.c| 108 
 net/ipv6/sit.c|   4 +-
 net/ipv6/udp_offload.c|  13 --
 net/mpls/mpls_gso.c   |   9 --
 net/netfilter/ipvs/ip_vs_xmit.c   |  17 ++-
 32 files changed, 662 insertions(+), 236 deletions(-)
 create mode 100644 net/ipv6/fou6.c
 create mode 100644 net/ipv6/ip6_tunnel_core.c

-- 
2.8.0.rc2

[PATCH net-next 05/13] fou: Add encap ops for IPv6 tunnels

This patch adds IP tunnel encapsulation operations for IPv6. This
includes the infrastructure to add and delete operations. IPv6 variants
of fou6_build_header and gue6_build_header are added in a new
fou6 module. These encapsulation operations for fou and gue are
automatically added when the fou6 module loads.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h  |   2 +-
 include/net/ip6_tunnel.h   |  14 +
 net/ipv6/Makefile  |   4 +-
 net/ipv6/fou6.c| 140 +
 net/ipv6/ip6_tunnel_core.c |  44 ++
 5 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 net/ipv6/fou6.c
 create mode 100644 net/ipv6/ip6_tunnel_core.c

diff --git a/include/net/fou.h b/include/net/fou.h
index 7d2fda2..f5cc691 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -9,7 +9,7 @@
 #include 
 
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
-static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
+size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
 int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
   u8 *protocol, __be16 *sport, int type);
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fb9e015..1c14c27 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -34,6 +34,20 @@ struct __ip6_tnl_parm {
__be32  o_key;
 };
 
+struct ip6_tnl_encap_ops {
+   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
+   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
+   u8 *protocol, struct flowi6 *fl6);
+};
+
+extern const struct ip6_tnl_encap_ops __rcu *
+   ip6tun_encaps[MAX_IPTUN_ENCAP_OPS];
+
+int ip6_tnl_encap_add_ops(const struct ip6_tnl_encap_ops *op,
+ unsigned int num);
+int ip6_tnl_encap_del_ops(const struct ip6_tnl_encap_ops *op,
+ unsigned int num);
+
 /* IPv6 tunnel */
 struct ip6_tnl {
struct ip6_tnl __rcu *next; /* next tunnel in list */
diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile
index 5e9d6bf..5cf4a1f 100644
--- a/net/ipv6/Makefile
+++ b/net/ipv6/Makefile
@@ -9,7 +9,7 @@ ipv6-objs :=af_inet6.o anycast.o ip6_output.o ip6_input.o 
addrconf.o \
route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o udplite.o \
raw.o icmp.o mcast.o reassembly.o tcp_ipv6.o ping.o \
exthdrs.o datagram.o ip6_flowlabel.o inet6_connection_sock.o \
-   udp_offload.o
+   udp_offload.o ip6_tunnel_core.o
 
 ipv6-offload :=ip6_offload.o tcpv6_offload.o exthdrs_offload.o
 
@@ -43,6 +43,8 @@ obj-$(CONFIG_IPV6_SIT) += sit.o
 obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.o
 obj-$(CONFIG_IPV6_GRE) += ip6_gre.o
 
+obj-$(CONFIG_NET_FOU) += fou6.o
+
 obj-y += addrconf_core.o exthdrs_core.o ip6_checksum.o ip6_icmp.o
 obj-$(CONFIG_INET) += output_core.o protocol.o $(ipv6-offload)
 
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
new file mode 100644
index 000..c972d0b
--- /dev/null
+++ b/net/ipv6/fou6.c
@@ -0,0 +1,140 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void fou6_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  struct flowi6 *fl6, u8 *protocol, __be16 sport)
+{
+   struct udphdr *uh;
+
+   skb_push(skb, sizeof(struct udphdr));
+   skb_reset_transport_header(skb);
+
+   uh = udp_hdr(skb);
+
+   uh->dest = e->dport;
+   uh->source = sport;
+   uh->len = htons(skb->len);
+   udp6_set_csum(!(e->flags & TUNNEL_ENCAP_FLAG_CSUM6), skb,
+ >saddr, >daddr, skb->len);
+
+   *protocol = IPPROTO_UDP;
+}
+
+int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __fou_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(fou6_build_header);
+
+int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(gue6_build_header);
+
+#ifdef CONFIG_NET_FOU_IP_TUNNELS
+
+static const struct ip6_tnl_encap_ops fou_ip6tun_ops = {
+   .encap_hlen =

[PATCH net-next 09/13] ip6_tun: Add infrastructure for doing encapsulation

Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.

Signed-off-by: Tom Herbert 
---
 include/net/ip6_tunnel.h   |  8 +-
 net/ipv6/ip6_tunnel.c  | 36 ++
 net/ipv6/ip6_tunnel_core.c | 64 ++
 3 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 1c14c27..1b8db86 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -66,10 +66,16 @@ struct ip6_tnl {
__u32 o_seqno;  /* The last output seqno */
int hlen;   /* tun_hlen + encap_hlen */
int tun_hlen;   /* Precalculated header length */
+   int encap_hlen; /* Encap header length (FOU,GUE) */
+   struct ip_tunnel_encap encap;
int mlink;
-
 };
 
+int ip6_tnl_encap_setup(struct ip6_tnl *t,
+   struct ip_tunnel_encap *ipencap);
+int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t,
+ u8 *protocol, struct flowi6 *fl6);
+
 /* Tunnel encapsulation limit destination sub-option */
 
 struct ipv6_tlv_tnl_enc_lim {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 50af706..66e3a63 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1010,7 +1010,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
struct dst_entry *dst = NULL, *ndst = NULL;
struct net_device *tdev;
int mtu;
-   unsigned int max_headroom = sizeof(struct ipv6hdr);
+   unsigned int max_headroom = sizeof(struct ipv6hdr) + t->hlen;
int err = -1;
 
/* NBMA tunnel */
@@ -1063,7 +1063,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
 t->parms.name);
goto tx_err_dst_release;
}
-   mtu = dst_mtu(dst) - sizeof(*ipv6h);
+   mtu = dst_mtu(dst) - sizeof(*ipv6h) - t->hlen;
if (encap_limit >= 0) {
max_headroom += 8;
mtu -= 8;
@@ -1125,10 +1125,14 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
}
 
max_headroom = LL_RESERVED_SPACE(dst->dev) + sizeof(struct ipv6hdr)
-   + dst->header_len;
+   + dst->header_len + t->hlen;
if (max_headroom > dev->needed_headroom)
dev->needed_headroom = max_headroom;
 
+   err = ip6_tnl_encap(skb, t, , fl6);
+   if (err)
+   return err;
+
skb_push(skb, sizeof(struct ipv6hdr));
skb_reset_network_header(skb);
ipv6h = ipv6_hdr(skb);
@@ -1280,6 +1284,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
struct net_device *dev = t->dev;
struct __ip6_tnl_parm *p = >parms;
struct flowi6 *fl6 = >fl.u.ip6;
+   int t_hlen;
 
memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, >raddr, sizeof(struct in6_addr));
@@ -1303,6 +1308,10 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
else
dev->flags &= ~IFF_POINTOPOINT;
 
+   t->tun_hlen = 0;
+   t->hlen = t->encap_hlen + t->tun_hlen;
+   t_hlen = t->hlen + sizeof(struct ipv6hdr);
+
if (p->flags & IP6_TNL_F_CAP_XMIT) {
int strict = (ipv6_addr_type(>raddr) &
  (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL));
@@ -1316,9 +1325,9 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
 
if (rt->dst.dev) {
dev->hard_header_len = rt->dst.dev->hard_header_len +
-   sizeof(struct ipv6hdr);
+   sizeof(struct ipv6hdr) + t->encap_hlen;
 
-   dev->mtu = rt->dst.dev->mtu - sizeof(struct ipv6hdr);
+   dev->mtu = rt->dst.dev->mtu - t_hlen;
if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
dev->mtu -= 8;
 
@@ -1590,14 +1599,11 @@ static void ip6_tnl_dev_setup(struct net_device *dev)
dev->netdev_ops = _tnl_netdev_ops;
dev->destructor = ip6_dev_free;
 
-   dev->type = ARPHRD_TUNNEL6;
-   dev->hard_header_len = LL_MAX_HEADER + sizeof(struct ipv6hdr);
-   dev->mtu = ETH_DATA_LEN - sizeof(struct ipv6hdr);
t = netdev_priv(dev);
-   if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
-   dev->mtu -= 8;
+   dev->type = ARPHRD_TUNNEL6;
dev->flags |= IFF_NOARP;
dev->addr_len = sizeof(struct in6_addr);
+   dev->features |= NETIF_F_LLTX;
netif_keep_dst(dev);
/* This perm addr will be used as interface identifier by IPv6 */
dev->addr_assign_type = NET_ADDR_RANDOM;
@@ -1615,6 +1621,7 @@ ip6_tnl_dev_init_gen(struct net_device *dev)
 {
struct ip6_tnl *t = netdev_priv(dev);

[PATCH v10 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-05-11 Thread Dexuan Cui

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
---

You can also get the patch on this branch:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160512_v10

For the change log before v10, please see https://lkml.org/lkml/2016/5/4/532

In v10, the main changes consist of
1) minimize struct hvsock_sock by making the send/recv buffers pointers.
   the buffers are allocated by kmalloc() in __hvsock_create().

2) minimize the sizes of the send/recv buffers and the vmbus ringbuffers.

 MAINTAINERS |2 +
 include/linux/hyperv.h  |   14 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   78 +++
 include/uapi/linux/hyperv.h |   25 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1484 +++
 10 files changed, 1621 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b57df66..c9fe2c6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5271,7 +5271,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index aa0fadc..7be7237 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1338,4 +1338,18 @@ extern __u32 vmbus_proto_version;
 
 int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
  const uuid_le *shv_host_servie_id);
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,9 @@ struct ucred {
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41  /* Kernel Connection Multiplexor*/
 #define AF_QIPCRTR 42  /* Qualcomm IPC Router  */
+#define AF_HYPERV  43  /* Hyper-V Sockets  */
 
-#define AF_MAX 43  /* For now.. */
+#define AF_MAX 44  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -251,6 +252,7 @@ struct ucred {
 #define PF_VSOCK   AF_VSOCK
 #define PF_KCM AF_KCM
 #define PF_QIPCRTR AF_QIPCRTR
+#define PF_HYPERV  AF_HYPERV
 #define PF_MAX AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
new file mode 100644
index 000..e002397
--- /dev/null
+++ b/include/net/af_hvsock.h
@@ -0,0 +1,78 @@
+#ifndef __AF_HVSOCK_H__
+#define __AF_HVSOCK_H__
+
+#include 
+#include 
+#include 
+
+/* Note: 3-page is the minimal recv ringbuffer size:
+ *
+ * the 1st page is used as the shared read/write index etc, rather than data:
+ * see hv_ringbuffer_init();
+ *
+ * the payload length in the vmbus pipe message received from the host can
+ * be 4096 bytes, and considing the header of HVSOCK_HEADER_LEN bytes, we
+ * need at least 2 extra pages for ringbuffer data.
+ */
+#define HVSOCK_RCV_BUF_SZPAGE_SIZE
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RCV (3 * PAGE_SIZE)
+
+/* As to send, here let's make sure the hvsock_send_buf struct can be held in 1
+ * page, and since we want to use 2 pages for the send ringbuffer size (this is
+ * the minimal size because

[PATCH v10 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-11 Thread Dexuan Cui

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

You can also get the patch by:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160512_v10

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev=145952064004765=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: , but in
AF_HYPERV, the endpoint type is: . Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments, especially comments from David. :-)

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Changes since v8
- removed the unnecessary __packed for some definitions: thanks, David!
- hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection
and reorganized the function
direction
- reorganized the code according to suggestions from Cathy Avery: split big
functions into small ones, set .setsockopt and getsockopt to
sock_no_setsockopt/sock_no_getsockopt
- inline'd some small list helper functions

Changes since v9
- minimized struct hvsock_sock by making the send/recv buffers pointers.
the buffers are allocated by kmalloc() in __hvsock_create() now.
- minimized the sizes of the send/recv buffers and the vmbus ringbuffers.

Dexuan Cui (1):
hv_sock: introduce Hyper-V Sockets

Re: [PATCH net-next v3 1/4] xen-netback: add control ring boilerplate

2016-05-11 Thread Wei Liu

On Wed, May 11, 2016 at 04:33:34PM +0100, Paul Durrant wrote:
> My recent patch to include/xen/interface/io/netif.h defines a new shared
> ring (in addition to the rx and tx rings) for passing control messages
> from a VM frontend driver to a backend driver.
> 
> This patch adds the necessary code to xen-netback to map this new shared
> ring, should it be created by a frontend, but does not add implementations
> for any of the defined protocol messages. These are added in a subsequent
> patch for clarity.
> 
> Signed-off-by: Paul Durrant 

Acked-by: Wei Liu

Re: [PATCH net-next v3 2/4] xen-netback: add control protocol implementation

2016-05-11 Thread Wei Liu

On Wed, May 11, 2016 at 04:33:35PM +0100, Paul Durrant wrote:
> My recent patch to include/xen/interface/io/netif.h defines a new shared
> ring (in addition to the rx and tx rings) for passing control messages
> from a VM frontend driver to a backend driver.
> 
> A previous patch added the necessary boilerplate for mapping the control
> ring from the frontend, should it be created. This patch adds
> implementations for each of the defined protocol messages.
> 
> Signed-off-by: Paul Durrant 

Acked-by: Wei Liu

[PATCH net-next v3 4/4] xen-netback: use hash value from the frontend

My recent patch to include/xen/interface/io/netif.h defines a new extra
info type that can be used to pass hash values between backend and guest
frontend.

This patch adds code to xen-netback to use the value in a hash extra
info fragment passed from the guest frontend in a transmit-side
(i.e. netback receive side) packet to set the skb hash accordingly.

Signed-off-by: Paul Durrant 
Acked-by: Wei Liu 
---
 drivers/net/xen-netback/netback.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 7c72510..a5b5aad 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1509,6 +1509,33 @@ static void xenvif_tx_build_gops(struct xenvif_queue 
*queue,
}
}
 
+   if (extras[XEN_NETIF_EXTRA_TYPE_HASH - 1].type) {
+   struct xen_netif_extra_info *extra;
+   enum pkt_hash_types type = PKT_HASH_TYPE_NONE;
+
+   extra = [XEN_NETIF_EXTRA_TYPE_HASH - 1];
+
+   switch (extra->u.hash.type) {
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV4:
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV6:
+   type = PKT_HASH_TYPE_L3;
+   break;
+
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV4_TCP:
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV6_TCP:
+   type = PKT_HASH_TYPE_L4;
+   break;
+
+   default:
+   break;
+   }
+
+   if (type != PKT_HASH_TYPE_NONE)
+   skb_set_hash(skb,
+*(u32 *)extra->u.hash.value,
+type);
+   }
+
XENVIF_TX_CB(skb)->pending_idx = pending_idx;
 
__skb_put(skb, data_len);
-- 
2.1.4

Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
>  wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 
> if (!ht->thread_should_run(td->cpu)) {
> preempt_enable_no_resched();
> schedule();
> } else {
> __set_current_state(TASK_RUNNING);
> preempt_enable();
> ht->thread_fn(td->cpu);
> }

BTW, I wonder why we pass td->cpu as argument to ht->thread_fn(td->cpu)

This always should be the current processor id.

Or do we have an issue because we ignore it in :

static int ksoftirqd_should_run(unsigned int cpu)
{
return local_softirq_pending();
}

[PATCH net-next v3 0/4] xen-netback: support for control ring

My recent patch to import an up-to-date include/xen/interface/io/netif.h
from the Xen Project brought in the necessary definitions to support the
new control shared ring and protocol. This patch series updates xen-netback
to support the new ring.

Patch #1 adds the necessary boilerplate to map the control ring and handle
messages. No implementation of the new protocol is included in this patch
so that it can be kept to a reasonable size.

Patch #2 adds the protocol implementation.

Patch #3 adds support for passing has values calculated by xen-netback to
capable frontends.

Patch #4 adds support for accepting hash values calculated by capable
frontends and using them the set the socket buffer hash.

[PATCH net-next v3 3/4] xen-netback: pass hash value to the frontend

My recent patch to include/xen/interface/io/netif.h defines a new extra
info type that can be used to pass hash values between backend and guest
frontend.

This patch adds code to xen-netback to pass hash values calculated for
guest receive-side packets (i.e. netback transmit side) to the frontend.

Signed-off-by: Paul Durrant 
Acked-by: Wei Liu 
---
 drivers/net/xen-netback/interface.c | 13 ++-
 drivers/net/xen-netback/netback.c   | 78 +++--
 2 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index 5a39cdb..1c7f49b 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -158,8 +158,17 @@ static u16 xenvif_select_queue(struct net_device *dev, 
struct sk_buff *skb,
struct xenvif *vif = netdev_priv(dev);
unsigned int size = vif->hash.size;
 
-   if (vif->hash.alg == XEN_NETIF_CTRL_HASH_ALGORITHM_NONE)
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   if (vif->hash.alg == XEN_NETIF_CTRL_HASH_ALGORITHM_NONE) {
+   u16 index = fallback(dev, skb) % dev->real_num_tx_queues;
+
+   /* Make sure there is no hash information in the socket
+* buffer otherwise it would be incorrectly forwarded
+* to the frontend.
+*/
+   skb_clear_hash(skb);
+
+   return index;
+   }
 
xenvif_set_skb_hash(vif, skb);
 
diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 6509d11..7c72510 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -168,6 +168,8 @@ static bool xenvif_rx_ring_slots_available(struct 
xenvif_queue *queue)
needed = DIV_ROUND_UP(skb->len, XEN_PAGE_SIZE);
if (skb_is_gso(skb))
needed++;
+   if (skb->sw_hash)
+   needed++;
 
do {
prod = queue->rx.sring->req_prod;
@@ -285,6 +287,8 @@ struct gop_frag_copy {
struct xenvif_rx_meta *meta;
int head;
int gso_type;
+   int protocol;
+   int hash_present;
 
struct page *page;
 };
@@ -331,8 +335,15 @@ static void xenvif_setup_copy_gop(unsigned long gfn,
npo->copy_off += *len;
info->meta->size += *len;
 
+   if (!info->head)
+   return;
+
/* Leave a gap for the GSO descriptor. */
-   if (info->head && ((1 << info->gso_type) & queue->vif->gso_mask))
+   if ((1 << info->gso_type) & queue->vif->gso_mask)
+   queue->rx.req_cons++;
+
+   /* Leave a gap for the hash extra segment. */
+   if (info->hash_present)
queue->rx.req_cons++;
 
info->head = 0; /* There must be something in this buffer now */
@@ -367,6 +378,11 @@ static void xenvif_gop_frag_copy(struct xenvif_queue 
*queue, struct sk_buff *skb
.npo = npo,
.head = *head,
.gso_type = XEN_NETIF_GSO_TYPE_NONE,
+   /* xenvif_set_skb_hash() will have either set a s/w
+* hash or cleared the hash depending on
+* whether the the frontend wants a hash for this skb.
+*/
+   .hash_present = skb->sw_hash,
};
unsigned long bytes;
 
@@ -555,6 +571,7 @@ void xenvif_kick_thread(struct xenvif_queue *queue)
 
 static void xenvif_rx_action(struct xenvif_queue *queue)
 {
+   struct xenvif *vif = queue->vif;
s8 status;
u16 flags;
struct xen_netif_rx_response *resp;
@@ -590,9 +607,10 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
gnttab_batch_copy(queue->grant_copy_op, npo.copy_prod);
 
while ((skb = __skb_dequeue()) != NULL) {
+   struct xen_netif_extra_info *extra = NULL;
 
if ((1 << queue->meta[npo.meta_cons].gso_type) &
-   queue->vif->gso_prefix_mask) {
+   vif->gso_prefix_mask) {
resp = RING_GET_RESPONSE(>rx,
 queue->rx.rsp_prod_pvt++);
 
@@ -610,7 +628,7 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
queue->stats.tx_bytes += skb->len;
queue->stats.tx_packets++;
 
-   status = xenvif_check_gop(queue->vif,
+   status = xenvif_check_gop(vif,
  XENVIF_RX_CB(skb)->meta_slots_used,
  );
 
@@ -632,21 +650,57 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
flags);
 
if ((1 << queue->meta[npo.meta_cons].gso_type) &
-   queue->vif->gso_mask) {
-   struct xen_netif_extra_info *gso =
-   (struct xen_netif_extra_info *)
+

[PATCH net-next v3 1/4] xen-netback: add control ring boilerplate

My recent patch to include/xen/interface/io/netif.h defines a new shared
ring (in addition to the rx and tx rings) for passing control messages
from a VM frontend driver to a backend driver.

This patch adds the necessary code to xen-netback to map this new shared
ring, should it be created by a frontend, but does not add implementations
for any of the defined protocol messages. These are added in a subsequent
patch for clarity.

Signed-off-by: Paul Durrant 
Cc: Wei Liu 
---

v2:
 - Changed error handling style in connect_ctrl_ring()
---
 drivers/net/xen-netback/common.h|  28 +++---
 drivers/net/xen-netback/interface.c | 101 +---
 drivers/net/xen-netback/netback.c   |  99 +--
 drivers/net/xen-netback/xenbus.c|  79 
 4 files changed, 277 insertions(+), 30 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index f44b388..093a12a 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -260,6 +260,11 @@ struct xenvif {
struct dentry *xenvif_dbg_root;
 #endif
 
+   struct xen_netif_ctrl_back_ring ctrl;
+   struct task_struct *ctrl_task;
+   wait_queue_head_t ctrl_wq;
+   unsigned int ctrl_irq;
+
/* Miscellaneous private stuff. */
struct net_device *dev;
 };
@@ -285,10 +290,15 @@ struct xenvif *xenvif_alloc(struct device *parent,
 int xenvif_init_queue(struct xenvif_queue *queue);
 void xenvif_deinit_queue(struct xenvif_queue *queue);
 
-int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref,
-  unsigned long rx_ring_ref, unsigned int tx_evtchn,
-  unsigned int rx_evtchn);
-void xenvif_disconnect(struct xenvif *vif);
+int xenvif_connect_data(struct xenvif_queue *queue,
+   unsigned long tx_ring_ref,
+   unsigned long rx_ring_ref,
+   unsigned int tx_evtchn,
+   unsigned int rx_evtchn);
+void xenvif_disconnect_data(struct xenvif *vif);
+int xenvif_connect_ctrl(struct xenvif *vif, grant_ref_t ring_ref,
+   unsigned int evtchn);
+void xenvif_disconnect_ctrl(struct xenvif *vif);
 void xenvif_free(struct xenvif *vif);
 
 int xenvif_xenbus_init(void);
@@ -300,10 +310,10 @@ int xenvif_queue_stopped(struct xenvif_queue *queue);
 void xenvif_wake_queue(struct xenvif_queue *queue);
 
 /* (Un)Map communication rings. */
-void xenvif_unmap_frontend_rings(struct xenvif_queue *queue);
-int xenvif_map_frontend_rings(struct xenvif_queue *queue,
- grant_ref_t tx_ring_ref,
- grant_ref_t rx_ring_ref);
+void xenvif_unmap_frontend_data_rings(struct xenvif_queue *queue);
+int xenvif_map_frontend_data_rings(struct xenvif_queue *queue,
+  grant_ref_t tx_ring_ref,
+  grant_ref_t rx_ring_ref);
 
 /* Check for SKBs from frontend and schedule backend processing */
 void xenvif_napi_schedule_or_enable_events(struct xenvif_queue *queue);
@@ -318,6 +328,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue);
 
 int xenvif_dealloc_kthread(void *data);
 
+int xenvif_ctrl_kthread(void *data);
+
 void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb);
 
 void xenvif_carrier_on(struct xenvif *vif);
diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index f5231a2..78a10d2 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -128,6 +128,15 @@ irqreturn_t xenvif_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
 }
 
+irqreturn_t xenvif_ctrl_interrupt(int irq, void *dev_id)
+{
+   struct xenvif *vif = dev_id;
+
+   wake_up(>ctrl_wq);
+
+   return IRQ_HANDLED;
+}
+
 int xenvif_queue_stopped(struct xenvif_queue *queue)
 {
struct net_device *dev = queue->vif->dev;
@@ -527,9 +536,66 @@ void xenvif_carrier_on(struct xenvif *vif)
rtnl_unlock();
 }
 
-int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref,
-  unsigned long rx_ring_ref, unsigned int tx_evtchn,
-  unsigned int rx_evtchn)
+int xenvif_connect_ctrl(struct xenvif *vif, grant_ref_t ring_ref,
+   unsigned int evtchn)
+{
+   struct net_device *dev = vif->dev;
+   void *addr;
+   struct xen_netif_ctrl_sring *shared;
+   struct task_struct *task;
+   int err = -ENOMEM;
+
+   err = xenbus_map_ring_valloc(xenvif_to_xenbus_device(vif),
+_ref, 1, );
+   if (err)
+   goto err;
+
+   shared = (struct xen_netif_ctrl_sring *)addr;
+   BACK_RING_INIT(>ctrl, shared, XEN_PAGE_SIZE);
+
+   init_waitqueue_head(>ctrl_wq);
+
+   err = bind_interdomain_evtchn_to_irqhandler(vif->domid, evtchn,
+

[PATCH net-next v3 2/4] xen-netback: add control protocol implementation

My recent patch to include/xen/interface/io/netif.h defines a new shared
ring (in addition to the rx and tx rings) for passing control messages
from a VM frontend driver to a backend driver.

A previous patch added the necessary boilerplate for mapping the control
ring from the frontend, should it be created. This patch adds
implementations for each of the defined protocol messages.

Signed-off-by: Paul Durrant 
Cc: Wei Liu 
---

v3:
 - Remove unintentional label rename

v2:
 - Use RCU list for hash cache
---
 drivers/net/xen-netback/Makefile|   2 +-
 drivers/net/xen-netback/common.h|  46 +
 drivers/net/xen-netback/hash.c  | 386 
 drivers/net/xen-netback/interface.c |  24 +++
 drivers/net/xen-netback/netback.c   |  49 -
 5 files changed, 504 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/xen-netback/hash.c

diff --git a/drivers/net/xen-netback/Makefile b/drivers/net/xen-netback/Makefile
index e346e81..11e02be 100644
--- a/drivers/net/xen-netback/Makefile
+++ b/drivers/net/xen-netback/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_XEN_NETDEV_BACKEND) := xen-netback.o
 
-xen-netback-y := netback.o xenbus.o interface.o
+xen-netback-y := netback.o xenbus.o interface.o hash.o
diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 093a12a..84d6cbd 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -220,6 +220,35 @@ struct xenvif_mcast_addr {
 
 #define XEN_NETBK_MCAST_MAX 64
 
+#define XEN_NETBK_MAX_HASH_KEY_SIZE 40
+#define XEN_NETBK_MAX_HASH_MAPPING_SIZE 128
+#define XEN_NETBK_HASH_TAG_SIZE 40
+
+struct xenvif_hash_cache_entry {
+   struct list_head link;
+   struct rcu_head rcu;
+   u8 tag[XEN_NETBK_HASH_TAG_SIZE];
+   unsigned int len;
+   u32 val;
+   int seq;
+};
+
+struct xenvif_hash_cache {
+   spinlock_t lock;
+   struct list_head list;
+   unsigned int count;
+   atomic_t seq;
+};
+
+struct xenvif_hash {
+   unsigned int alg;
+   u32 flags;
+   u8 key[XEN_NETBK_MAX_HASH_KEY_SIZE];
+   u32 mapping[XEN_NETBK_MAX_HASH_MAPPING_SIZE];
+   unsigned int size;
+   struct xenvif_hash_cache cache;
+};
+
 struct xenvif {
/* Unique identifier for this interface. */
domid_t  domid;
@@ -251,6 +280,8 @@ struct xenvif {
unsigned int num_queues; /* active queues, resource allocated */
unsigned int stalled_queues;
 
+   struct xenvif_hash hash;
+
struct xenbus_watch credit_watch;
struct xenbus_watch mcast_ctrl_watch;
 
@@ -353,6 +384,7 @@ extern bool separate_tx_rx_irq;
 extern unsigned int rx_drain_timeout_msecs;
 extern unsigned int rx_stall_timeout_msecs;
 extern unsigned int xenvif_max_queues;
+extern unsigned int xenvif_hash_cache_size;
 
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *xen_netback_dbg_root;
@@ -366,4 +398,18 @@ void xenvif_skb_zerocopy_complete(struct xenvif_queue 
*queue);
 bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr);
 void xenvif_mcast_addr_list_free(struct xenvif *vif);
 
+/* Hash */
+void xenvif_init_hash(struct xenvif *vif);
+void xenvif_deinit_hash(struct xenvif *vif);
+
+u32 xenvif_set_hash_alg(struct xenvif *vif, u32 alg);
+u32 xenvif_get_hash_flags(struct xenvif *vif, u32 *flags);
+u32 xenvif_set_hash_flags(struct xenvif *vif, u32 flags);
+u32 xenvif_set_hash_key(struct xenvif *vif, u32 gref, u32 len);
+u32 xenvif_set_hash_mapping_size(struct xenvif *vif, u32 size);
+u32 xenvif_set_hash_mapping(struct xenvif *vif, u32 gref, u32 len,
+   u32 off);
+
+void xenvif_set_skb_hash(struct xenvif *vif, struct sk_buff *skb);
+
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/hash.c b/drivers/net/xen-netback/hash.c
new file mode 100644
index 000..47edfe9
--- /dev/null
+++ b/drivers/net/xen-netback/hash.c
@@ -0,0 +1,386 @@
+/*
+ * Copyright (c) 2016 Citrix Systems Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Softare Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS

[PATCH nf V2] netfilter: fix oops in nfqueue during netns error unwinding

2016-05-11 Thread Florian Westphal

Under full load (unshare() in loop -> OOM conditions) we can
get kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0008
IP: [] nfqnl_nf_hook_drop+0x35/0x70
[..]
task: 88012dfa3840 ti: 88012dffc000 task.ti: 88012dffc000
RIP: 0010:[]  [] 
nfqnl_nf_hook_drop+0x35/0x70
RSP: :88012dfffd80  EFLAGS: 00010206
RAX: 0008 RBX: 81add0c0 RCX: 88013fd8
[..]
Call Trace:
 [] nf_queue_nf_hook_drop+0x18/0x20
 [] nf_unregister_net_hook+0xdb/0x150
 [] netfilter_net_exit+0x2f/0x60
 [] ops_exit_list.isra.4+0x38/0x60
 [] setup_net+0xc2/0x120
 [] copy_net_ns+0x79/0x120
 [] create_new_namespaces+0x11b/0x1e0
 [] unshare_nsproxy_namespaces+0x57/0xa0
 [] SyS_unshare+0x1b2/0x340
 [] entry_SYSCALL_64_fastpath+0x1e/0xa8
Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 
49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 
74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47

Problem is that we call into the nfqueue backend to zap
packets that might be queued to userspace when we unregister a
netfilter hook.

However, this assumes that the backend was initialized and
net_generic(net, nfnl_queue_net_id) returns valid memory.

This is only true if the hook unregister happens in the netns exit path.
If it happens during error unwind because a netns init hook returned
an error condition (e.g. out of memory), then the result of
net_generic(net, nfnl_queue_net_id) is undefined.

Only do the cleanup for namespaces that were on the
net_namespace_list list (i.e., all netns ->init() functions were ok).

Cc: "Eric W. Biederman" 
Reported-by: Dale Whitfield 
Fixes: 8405a8fff3f ("netfilter: nf_qeueue: Drop queue entries on 
nf_unregister_hook")
Signed-off-by: Florian Westphal 
---
 AFAICS this works fine as well -- if netns was never on the
 net_namespace_list no packets can be queued so we don't need
 to care if the nf_queue init hook got called or not.

diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 5baa8e2..9722819 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -102,6 +102,13 @@ void nf_queue_nf_hook_drop(struct net *net, struct 
nf_hook_ops *ops)
 {
const struct nf_queue_handler *qh;
 
+   /* netns wasn't initialized, error unwind in progress.
+* Its possible that the nfq netns init function was not even
+* called, in which case nfq pernetns data is in undefined state.
+*/
+   if (!net->list.next)
+   return;
+
rcu_read_lock();
qh = rcu_dereference(queue_handler);
if (qh)
-- 
2.7.3

RE: [PATCH net-next v2 2/4] xen-netback: add control protocol implementation

> -Original Message-
> From: Paul Durrant [mailto:paul.durr...@citrix.com]
> Sent: 11 May 2016 16:16
> To: xen-de...@lists.xenproject.org; netdev@vger.kernel.org
> Cc: Paul Durrant; Wei Liu
> Subject: [PATCH net-next v2 2/4] xen-netback: add control protocol
> implementation
> 
> My recent patch to include/xen/interface/io/netif.h defines a new shared
> ring (in addition to the rx and tx rings) for passing control messages
> from a VM frontend driver to a backend driver.
> 
> A previous patch added the necessary boilerplate for mapping the control
> ring from the frontend, should it be created. This patch adds
> implementations for each of the defined protocol messages.
> 
> Signed-off-by: Paul Durrant 
> Cc: Wei Liu 
> ---
> 
> v2:
>  - Use RCU list for hash cache
> ---
>  drivers/net/xen-netback/Makefile|   2 +-
>  drivers/net/xen-netback/common.h|  46 +
>  drivers/net/xen-netback/hash.c  | 386
> 
>  drivers/net/xen-netback/interface.c |  28 ++-
>  drivers/net/xen-netback/netback.c   |  49 -
>  5 files changed, 506 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/net/xen-netback/hash.c
> 
> diff --git a/drivers/net/xen-netback/Makefile b/drivers/net/xen-
> netback/Makefile
> index e346e81..11e02be 100644
> --- a/drivers/net/xen-netback/Makefile
> +++ b/drivers/net/xen-netback/Makefile
> @@ -1,3 +1,3 @@
>  obj-$(CONFIG_XEN_NETDEV_BACKEND) := xen-netback.o
> 
> -xen-netback-y := netback.o xenbus.o interface.o
> +xen-netback-y := netback.o xenbus.o interface.o hash.o
> diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-
> netback/common.h
> index 093a12a..84d6cbd 100644
> --- a/drivers/net/xen-netback/common.h
> +++ b/drivers/net/xen-netback/common.h
> @@ -220,6 +220,35 @@ struct xenvif_mcast_addr {
> 
>  #define XEN_NETBK_MCAST_MAX 64
> 
> +#define XEN_NETBK_MAX_HASH_KEY_SIZE 40
> +#define XEN_NETBK_MAX_HASH_MAPPING_SIZE 128
> +#define XEN_NETBK_HASH_TAG_SIZE 40
> +
> +struct xenvif_hash_cache_entry {
> + struct list_head link;
> + struct rcu_head rcu;
> + u8 tag[XEN_NETBK_HASH_TAG_SIZE];
> + unsigned int len;
> + u32 val;
> + int seq;
> +};
> +
> +struct xenvif_hash_cache {
> + spinlock_t lock;
> + struct list_head list;
> + unsigned int count;
> + atomic_t seq;
> +};
> +
> +struct xenvif_hash {
> + unsigned int alg;
> + u32 flags;
> + u8 key[XEN_NETBK_MAX_HASH_KEY_SIZE];
> + u32 mapping[XEN_NETBK_MAX_HASH_MAPPING_SIZE];
> + unsigned int size;
> + struct xenvif_hash_cache cache;
> +};
> +
>  struct xenvif {
>   /* Unique identifier for this interface. */
>   domid_t  domid;
> @@ -251,6 +280,8 @@ struct xenvif {
>   unsigned int num_queues; /* active queues, resource allocated */
>   unsigned int stalled_queues;
> 
> + struct xenvif_hash hash;
> +
>   struct xenbus_watch credit_watch;
>   struct xenbus_watch mcast_ctrl_watch;
> 
> @@ -353,6 +384,7 @@ extern bool separate_tx_rx_irq;
>  extern unsigned int rx_drain_timeout_msecs;
>  extern unsigned int rx_stall_timeout_msecs;
>  extern unsigned int xenvif_max_queues;
> +extern unsigned int xenvif_hash_cache_size;
> 
>  #ifdef CONFIG_DEBUG_FS
>  extern struct dentry *xen_netback_dbg_root;
> @@ -366,4 +398,18 @@ void xenvif_skb_zerocopy_complete(struct
> xenvif_queue *queue);
>  bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr);
>  void xenvif_mcast_addr_list_free(struct xenvif *vif);
> 
> +/* Hash */
> +void xenvif_init_hash(struct xenvif *vif);
> +void xenvif_deinit_hash(struct xenvif *vif);
> +
> +u32 xenvif_set_hash_alg(struct xenvif *vif, u32 alg);
> +u32 xenvif_get_hash_flags(struct xenvif *vif, u32 *flags);
> +u32 xenvif_set_hash_flags(struct xenvif *vif, u32 flags);
> +u32 xenvif_set_hash_key(struct xenvif *vif, u32 gref, u32 len);
> +u32 xenvif_set_hash_mapping_size(struct xenvif *vif, u32 size);
> +u32 xenvif_set_hash_mapping(struct xenvif *vif, u32 gref, u32 len,
> + u32 off);
> +
> +void xenvif_set_skb_hash(struct xenvif *vif, struct sk_buff *skb);
> +
>  #endif /* __XEN_NETBACK__COMMON_H__ */
> diff --git a/drivers/net/xen-netback/hash.c b/drivers/net/xen-
> netback/hash.c
> new file mode 100644
> index 000..47edfe9
> --- /dev/null
> +++ b/drivers/net/xen-netback/hash.c
> @@ -0,0 +1,386 @@
> +/*
> + * Copyright (c) 2016 Citrix Systems Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version 2
> + * as published by the Free Softare Foundation; or, when distributed
> + * separately from the Linux kernel or incorporated into other
> + * software packages, subject to the following license:
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
> + * of this source file (the "Software"), to deal in the Software without
> + * restriction, including without

[PATCH net-next v2 1/4] xen-netback: add control ring boilerplate

My recent patch to include/xen/interface/io/netif.h defines a new shared
ring (in addition to the rx and tx rings) for passing control messages
from a VM frontend driver to a backend driver.

This patch adds the necessary code to xen-netback to map this new shared
ring, should it be created by a frontend, but does not add implementations
for any of the defined protocol messages. These are added in a subsequent
patch for clarity.

Signed-off-by: Paul Durrant 
Cc: Wei Liu 
---

v2:
 - Changed error handling style in connect_ctrl_ring()
---
 drivers/net/xen-netback/common.h|  28 +++---
 drivers/net/xen-netback/interface.c | 101 +---
 drivers/net/xen-netback/netback.c   |  99 +--
 drivers/net/xen-netback/xenbus.c|  79 
 4 files changed, 277 insertions(+), 30 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index f44b388..093a12a 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -260,6 +260,11 @@ struct xenvif {
struct dentry *xenvif_dbg_root;
 #endif
 
+   struct xen_netif_ctrl_back_ring ctrl;
+   struct task_struct *ctrl_task;
+   wait_queue_head_t ctrl_wq;
+   unsigned int ctrl_irq;
+
/* Miscellaneous private stuff. */
struct net_device *dev;
 };
@@ -285,10 +290,15 @@ struct xenvif *xenvif_alloc(struct device *parent,
 int xenvif_init_queue(struct xenvif_queue *queue);
 void xenvif_deinit_queue(struct xenvif_queue *queue);
 
-int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref,
-  unsigned long rx_ring_ref, unsigned int tx_evtchn,
-  unsigned int rx_evtchn);
-void xenvif_disconnect(struct xenvif *vif);
+int xenvif_connect_data(struct xenvif_queue *queue,
+   unsigned long tx_ring_ref,
+   unsigned long rx_ring_ref,
+   unsigned int tx_evtchn,
+   unsigned int rx_evtchn);
+void xenvif_disconnect_data(struct xenvif *vif);
+int xenvif_connect_ctrl(struct xenvif *vif, grant_ref_t ring_ref,
+   unsigned int evtchn);
+void xenvif_disconnect_ctrl(struct xenvif *vif);
 void xenvif_free(struct xenvif *vif);
 
 int xenvif_xenbus_init(void);
@@ -300,10 +310,10 @@ int xenvif_queue_stopped(struct xenvif_queue *queue);
 void xenvif_wake_queue(struct xenvif_queue *queue);
 
 /* (Un)Map communication rings. */
-void xenvif_unmap_frontend_rings(struct xenvif_queue *queue);
-int xenvif_map_frontend_rings(struct xenvif_queue *queue,
- grant_ref_t tx_ring_ref,
- grant_ref_t rx_ring_ref);
+void xenvif_unmap_frontend_data_rings(struct xenvif_queue *queue);
+int xenvif_map_frontend_data_rings(struct xenvif_queue *queue,
+  grant_ref_t tx_ring_ref,
+  grant_ref_t rx_ring_ref);
 
 /* Check for SKBs from frontend and schedule backend processing */
 void xenvif_napi_schedule_or_enable_events(struct xenvif_queue *queue);
@@ -318,6 +328,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue);
 
 int xenvif_dealloc_kthread(void *data);
 
+int xenvif_ctrl_kthread(void *data);
+
 void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb);
 
 void xenvif_carrier_on(struct xenvif *vif);
diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index f5231a2..78a10d2 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -128,6 +128,15 @@ irqreturn_t xenvif_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
 }
 
+irqreturn_t xenvif_ctrl_interrupt(int irq, void *dev_id)
+{
+   struct xenvif *vif = dev_id;
+
+   wake_up(>ctrl_wq);
+
+   return IRQ_HANDLED;
+}
+
 int xenvif_queue_stopped(struct xenvif_queue *queue)
 {
struct net_device *dev = queue->vif->dev;
@@ -527,9 +536,66 @@ void xenvif_carrier_on(struct xenvif *vif)
rtnl_unlock();
 }
 
-int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref,
-  unsigned long rx_ring_ref, unsigned int tx_evtchn,
-  unsigned int rx_evtchn)
+int xenvif_connect_ctrl(struct xenvif *vif, grant_ref_t ring_ref,
+   unsigned int evtchn)
+{
+   struct net_device *dev = vif->dev;
+   void *addr;
+   struct xen_netif_ctrl_sring *shared;
+   struct task_struct *task;
+   int err = -ENOMEM;
+
+   err = xenbus_map_ring_valloc(xenvif_to_xenbus_device(vif),
+_ref, 1, );
+   if (err)
+   goto err;
+
+   shared = (struct xen_netif_ctrl_sring *)addr;
+   BACK_RING_INIT(>ctrl, shared, XEN_PAGE_SIZE);
+
+   init_waitqueue_head(>ctrl_wq);
+
+   err = bind_interdomain_evtchn_to_irqhandler(vif->domid, evtchn,
+

[PATCH net-next v2 0/4] xen-netback: support for control ring

My recent patch to import an up-to-date include/xen/interface/io/netif.h
from the Xen Project brought in the necessary definitions to support the
new control shared ring and protocol. This patch series updates xen-netback
to support the new ring.

Patch #1 adds the necessary boilerplate to map the control ring and handle
messages. No implementation of the new protocol is included in this patch
so that it can be kept to a reasonable size.

Patch #2 adds the protocol implementation.

Patch #3 adds support for passing has values calculated by xen-netback to
capable frontends.

Patch #4 adds support for accepting hash values calculated by capable
frontends and using them the set the socket buffer hash.

[PATCH net-next v2 4/4] xen-netback: use hash value from the frontend

My recent patch to include/xen/interface/io/netif.h defines a new extra
info type that can be used to pass hash values between backend and guest
frontend.

This patch adds code to xen-netback to use the value in a hash extra
info fragment passed from the guest frontend in a transmit-side
(i.e. netback receive side) packet to set the skb hash accordingly.

Signed-off-by: Paul Durrant 
Acked-by: Wei Liu 
---
 drivers/net/xen-netback/netback.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 7c72510..a5b5aad 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1509,6 +1509,33 @@ static void xenvif_tx_build_gops(struct xenvif_queue 
*queue,
}
}
 
+   if (extras[XEN_NETIF_EXTRA_TYPE_HASH - 1].type) {
+   struct xen_netif_extra_info *extra;
+   enum pkt_hash_types type = PKT_HASH_TYPE_NONE;
+
+   extra = [XEN_NETIF_EXTRA_TYPE_HASH - 1];
+
+   switch (extra->u.hash.type) {
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV4:
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV6:
+   type = PKT_HASH_TYPE_L3;
+   break;
+
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV4_TCP:
+   case _XEN_NETIF_CTRL_HASH_TYPE_IPV6_TCP:
+   type = PKT_HASH_TYPE_L4;
+   break;
+
+   default:
+   break;
+   }
+
+   if (type != PKT_HASH_TYPE_NONE)
+   skb_set_hash(skb,
+*(u32 *)extra->u.hash.value,
+type);
+   }
+
XENVIF_TX_CB(skb)->pending_idx = pending_idx;
 
__skb_put(skb, data_len);
-- 
2.1.4

[PATCH net-next v2 2/4] xen-netback: add control protocol implementation

My recent patch to include/xen/interface/io/netif.h defines a new shared
ring (in addition to the rx and tx rings) for passing control messages
from a VM frontend driver to a backend driver.

A previous patch added the necessary boilerplate for mapping the control
ring from the frontend, should it be created. This patch adds
implementations for each of the defined protocol messages.

Signed-off-by: Paul Durrant 
Cc: Wei Liu 
---

v2:
 - Use RCU list for hash cache
---
 drivers/net/xen-netback/Makefile|   2 +-
 drivers/net/xen-netback/common.h|  46 +
 drivers/net/xen-netback/hash.c  | 386 
 drivers/net/xen-netback/interface.c |  28 ++-
 drivers/net/xen-netback/netback.c   |  49 -
 5 files changed, 506 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/xen-netback/hash.c

diff --git a/drivers/net/xen-netback/Makefile b/drivers/net/xen-netback/Makefile
index e346e81..11e02be 100644
--- a/drivers/net/xen-netback/Makefile
+++ b/drivers/net/xen-netback/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_XEN_NETDEV_BACKEND) := xen-netback.o
 
-xen-netback-y := netback.o xenbus.o interface.o
+xen-netback-y := netback.o xenbus.o interface.o hash.o
diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 093a12a..84d6cbd 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -220,6 +220,35 @@ struct xenvif_mcast_addr {
 
 #define XEN_NETBK_MCAST_MAX 64
 
+#define XEN_NETBK_MAX_HASH_KEY_SIZE 40
+#define XEN_NETBK_MAX_HASH_MAPPING_SIZE 128
+#define XEN_NETBK_HASH_TAG_SIZE 40
+
+struct xenvif_hash_cache_entry {
+   struct list_head link;
+   struct rcu_head rcu;
+   u8 tag[XEN_NETBK_HASH_TAG_SIZE];
+   unsigned int len;
+   u32 val;
+   int seq;
+};
+
+struct xenvif_hash_cache {
+   spinlock_t lock;
+   struct list_head list;
+   unsigned int count;
+   atomic_t seq;
+};
+
+struct xenvif_hash {
+   unsigned int alg;
+   u32 flags;
+   u8 key[XEN_NETBK_MAX_HASH_KEY_SIZE];
+   u32 mapping[XEN_NETBK_MAX_HASH_MAPPING_SIZE];
+   unsigned int size;
+   struct xenvif_hash_cache cache;
+};
+
 struct xenvif {
/* Unique identifier for this interface. */
domid_t  domid;
@@ -251,6 +280,8 @@ struct xenvif {
unsigned int num_queues; /* active queues, resource allocated */
unsigned int stalled_queues;
 
+   struct xenvif_hash hash;
+
struct xenbus_watch credit_watch;
struct xenbus_watch mcast_ctrl_watch;
 
@@ -353,6 +384,7 @@ extern bool separate_tx_rx_irq;
 extern unsigned int rx_drain_timeout_msecs;
 extern unsigned int rx_stall_timeout_msecs;
 extern unsigned int xenvif_max_queues;
+extern unsigned int xenvif_hash_cache_size;
 
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *xen_netback_dbg_root;
@@ -366,4 +398,18 @@ void xenvif_skb_zerocopy_complete(struct xenvif_queue 
*queue);
 bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr);
 void xenvif_mcast_addr_list_free(struct xenvif *vif);
 
+/* Hash */
+void xenvif_init_hash(struct xenvif *vif);
+void xenvif_deinit_hash(struct xenvif *vif);
+
+u32 xenvif_set_hash_alg(struct xenvif *vif, u32 alg);
+u32 xenvif_get_hash_flags(struct xenvif *vif, u32 *flags);
+u32 xenvif_set_hash_flags(struct xenvif *vif, u32 flags);
+u32 xenvif_set_hash_key(struct xenvif *vif, u32 gref, u32 len);
+u32 xenvif_set_hash_mapping_size(struct xenvif *vif, u32 size);
+u32 xenvif_set_hash_mapping(struct xenvif *vif, u32 gref, u32 len,
+   u32 off);
+
+void xenvif_set_skb_hash(struct xenvif *vif, struct sk_buff *skb);
+
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/hash.c b/drivers/net/xen-netback/hash.c
new file mode 100644
index 000..47edfe9
--- /dev/null
+++ b/drivers/net/xen-netback/hash.c
@@ -0,0 +1,386 @@
+/*
+ * Copyright (c) 2016 Citrix Systems Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Softare Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED

[PATCH net-next v2 3/4] xen-netback: pass hash value to the frontend

My recent patch to include/xen/interface/io/netif.h defines a new extra
info type that can be used to pass hash values between backend and guest
frontend.

This patch adds code to xen-netback to pass hash values calculated for
guest receive-side packets (i.e. netback transmit side) to the frontend.

Signed-off-by: Paul Durrant 
Acked-by: Wei Liu 
---
 drivers/net/xen-netback/interface.c | 13 ++-
 drivers/net/xen-netback/netback.c   | 78 +++--
 2 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index 483080f..dcca498 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -158,8 +158,17 @@ static u16 xenvif_select_queue(struct net_device *dev, 
struct sk_buff *skb,
struct xenvif *vif = netdev_priv(dev);
unsigned int size = vif->hash.size;
 
-   if (vif->hash.alg == XEN_NETIF_CTRL_HASH_ALGORITHM_NONE)
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   if (vif->hash.alg == XEN_NETIF_CTRL_HASH_ALGORITHM_NONE) {
+   u16 index = fallback(dev, skb) % dev->real_num_tx_queues;
+
+   /* Make sure there is no hash information in the socket
+* buffer otherwise it would be incorrectly forwarded
+* to the frontend.
+*/
+   skb_clear_hash(skb);
+
+   return index;
+   }
 
xenvif_set_skb_hash(vif, skb);
 
diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 6509d11..7c72510 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -168,6 +168,8 @@ static bool xenvif_rx_ring_slots_available(struct 
xenvif_queue *queue)
needed = DIV_ROUND_UP(skb->len, XEN_PAGE_SIZE);
if (skb_is_gso(skb))
needed++;
+   if (skb->sw_hash)
+   needed++;
 
do {
prod = queue->rx.sring->req_prod;
@@ -285,6 +287,8 @@ struct gop_frag_copy {
struct xenvif_rx_meta *meta;
int head;
int gso_type;
+   int protocol;
+   int hash_present;
 
struct page *page;
 };
@@ -331,8 +335,15 @@ static void xenvif_setup_copy_gop(unsigned long gfn,
npo->copy_off += *len;
info->meta->size += *len;
 
+   if (!info->head)
+   return;
+
/* Leave a gap for the GSO descriptor. */
-   if (info->head && ((1 << info->gso_type) & queue->vif->gso_mask))
+   if ((1 << info->gso_type) & queue->vif->gso_mask)
+   queue->rx.req_cons++;
+
+   /* Leave a gap for the hash extra segment. */
+   if (info->hash_present)
queue->rx.req_cons++;
 
info->head = 0; /* There must be something in this buffer now */
@@ -367,6 +378,11 @@ static void xenvif_gop_frag_copy(struct xenvif_queue 
*queue, struct sk_buff *skb
.npo = npo,
.head = *head,
.gso_type = XEN_NETIF_GSO_TYPE_NONE,
+   /* xenvif_set_skb_hash() will have either set a s/w
+* hash or cleared the hash depending on
+* whether the the frontend wants a hash for this skb.
+*/
+   .hash_present = skb->sw_hash,
};
unsigned long bytes;
 
@@ -555,6 +571,7 @@ void xenvif_kick_thread(struct xenvif_queue *queue)
 
 static void xenvif_rx_action(struct xenvif_queue *queue)
 {
+   struct xenvif *vif = queue->vif;
s8 status;
u16 flags;
struct xen_netif_rx_response *resp;
@@ -590,9 +607,10 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
gnttab_batch_copy(queue->grant_copy_op, npo.copy_prod);
 
while ((skb = __skb_dequeue()) != NULL) {
+   struct xen_netif_extra_info *extra = NULL;
 
if ((1 << queue->meta[npo.meta_cons].gso_type) &
-   queue->vif->gso_prefix_mask) {
+   vif->gso_prefix_mask) {
resp = RING_GET_RESPONSE(>rx,
 queue->rx.rsp_prod_pvt++);
 
@@ -610,7 +628,7 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
queue->stats.tx_bytes += skb->len;
queue->stats.tx_packets++;
 
-   status = xenvif_check_gop(queue->vif,
+   status = xenvif_check_gop(vif,
  XENVIF_RX_CB(skb)->meta_slots_used,
  );
 
@@ -632,21 +650,57 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
flags);
 
if ((1 << queue->meta[npo.meta_cons].gso_type) &
-   queue->vif->gso_mask) {
-   struct xen_netif_extra_info *gso =
-   (struct xen_netif_extra_info *)
+

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-11 Thread Rik van Riel

On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
>  wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to
> > stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 

Additionally, we are talking about waking up
ksoftirqd on the same CPU.

That means the wakeup code could interrupt
ksoftirqd almost going to sleep, but the
two code paths could not run simultaneously.

That does narrow the scope considerably.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni  wrote:

> Uh, we have likely the same issue in the net_rx_action() function, which
> also execute with bh disabled and check for jiffies changes even on
> single core hosts ?!?

That is why we have a loop break after netdev_budget=300 packets.
And a sysctl to eventually tune this.

Same issue for softirq handler, look at commit
34376a50fb1fa095b9d0636fa41ed2e73125f214

Your questions about this central piece of networking code are worrying.

>
> Aren't jiffies updated by the timer interrupt ? and thous even with
> bh_disabled ?!?

Exactly my point : jiffie wont be updated in your code, since you block BH.

Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
 wrote:

> This looks racy to me as the ksoftirqd could be in the progress to stop
> and we would miss another softirq invocation.

Looking at smpboot_thread_fn(), it looks fine :

if (!ht->thread_should_run(td->cpu)) {
preempt_enable_no_resched();
schedule();
} else {
__set_current_state(TASK_RUNNING);
preempt_enable();
ht->thread_fn(td->cpu);
}

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-11 Thread Paolo Abeni

On Wed, 2016-05-11 at 06:08 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> > Hi Eric,
> > On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > > 
> > > > Not only did we want to present this solely as a bugfix but also as as
> > > > performance enhancements in case of virtio (as you can see in the cover
> > > > letter). Given that a long time ago there was a tendency to remove
> > > > softirqs completely, we thought it might be very interesting, that a
> > > > threaded napi in general seems to be absolutely viable nowadays and
> > > > might offer new features.
> > > 
> > > Well, you did not fix the bug, you worked around by adding yet another
> > > layer, with another sysctl that admins or programs have to manage.
> > > 
> > > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > > but add it as a features request.
> > > 
> > > This ksoftirqd issue is real and a fix looks very reasonable.
> > > 
> > > Please try this patch, as I had very good success with it.
> > 
> > Thank you for your time and your effort.
> > 
> > I tested your patch on the bare metal "single core" scenario, disabling
> > the unneeded cores with:
> > CPUS=`nproc`
> > for I in `seq 1 $CPUS`; do echo 0  >  
> > /sys/devices/system/node/node0/cpu$I/online; done
> > 
> > And I got a:
> > 
> > [   86.925249] Broke affinity for irq 
> > 
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?

The above message is emitted with pr_notice() by the x86 version of
fixup_irqs(). It's not fatal, the host is alive and well after that. The
un-patched kernel does not emit it on cpus disabling.

I'll try to look into this later.

> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
> 
> > for each irq number generated by a network device.
> > 
> > In this scenario, your patch solves the ksoftirqd issue, performing
> > comparable to the napi threaded patches (with a negative delta in the
> > noise range) and introducing a minor regression with a single flow, in
> > the noise range (3%).
> > 
> > As said in a previous mail, we actually experimented something similar,
> > but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
> 
> > 
> > AFAICS this patch adds three more tests in the fast path and affect all
> > other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.
> 
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
> 
> 
> > 
> > The napi thread patches are actually a new feature, that also fixes the
> > ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> > trigger for this work. I'm sorry for not being clear enough in the cover
> > letter.
> > 
> > The napi thread patches offer additional benefits, i.e. an additional
> > relevant gain in the described test scenario, and do not impact on other
> > subsystems/kernel entities. 
> > 
> > I still think they are worthy, and I bet you would disagree, but could
> > you please articulate more which parts concern you most and/or are more
> > bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?

Uh, we have likely the same issue in the net_rx_action() function, which
also execute with bh disabled and check for jiffies changes even on
single core hosts ?!?

Aren't jiffies updated by the timer interrupt ? and thous even with
bh_disabled ?!?

> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)

I missed that, thank you for pointing out.

> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
> 
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing

Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

On Wed, 11 May 2016 12:28:14 +0900, Simon Horman wrote:
> I think that at this stage I would prefer to prohibit push_eth() acting
> on a packet which already has an ethernet header. Indeed that is what
> my patch-set already does in its modifications of __ovs_nla_copy_actions().
> 
> The reason that I lean towards prohibiting this is that I do not
> have an easy way to exercise this case within the current patch-set.
> And thus this extra complexity seems well suited to being handled handled
> incrementally as further work.

Works for me. I don't see any real usage for multiple Ethernet headers.

Thanks!

 Jiri

Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

On Wed, 11 May 2016 12:06:35 +0900, Simon Horman wrote:
> Is this close to what you had in mind?

Yes but see below.

> @@ -739,17 +729,17 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
> *tun_info,
>   key->phy.skb_mark = skb->mark;
>   ovs_ct_fill_key(skb, key);
>   key->ovs_flow_hash = 0;
> - key->phy.is_layer3 = is_layer3;
> + key->phy.is_layer3 = (tun_info && skb->mac_len == 0);

Do we have to depend on tun_info? It would be nice to support all
ARPHRD_NONE interfaces, not just tunnels. The tun interface (from
the tuntap driver) comes to mind, for example.

> --- a/net/openvswitch/vport-netdev.c
> +++ b/net/openvswitch/vport-netdev.c
> @@ -60,7 +60,21 @@ static void netdev_port_receive(struct sk_buff *skb)
>   if (vport->dev->type == ARPHRD_ETHER) {
>   skb_push(skb, ETH_HLEN);
>   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
> + } else if (vport->dev->type == ARPHRD_NONE) {
> + if (skb->protocol == htons(ETH_P_TEB)) {
> + struct ethhdr *eth = eth_hdr(skb);
> +
> + if (unlikely(skb->len < ETH_HLEN))
> + goto error;
> +
> + skb->mac_len = ETH_HLEN;
> + if (eth->h_proto == htons(ETH_P_8021Q))
> + skb->mac_len += VLAN_HLEN;
> + } else {
> + skb->mac_len = 0;
> + }

Without putting much thought into this, could this perhaps be left for
parse_ethertype (called from key_extract) to do?

Thanks,

 Jiri

[PATCH iproute2] ip link: Add support for kernel side filtering

2016-05-11 Thread David Ahern

Kernel gained support for filtering link dumps with commit dc599f76c22b
("net: Add support for filtering link dump by master device and kind").
Add support to ip link command. If a user passes master device or
kind to ip link command they are added to the link dump request message.

Signed-off-by: David Ahern 
---
 include/libnetlink.h |  6 ++
 ip/ipaddress.c   | 33 -
 lib/libnetlink.c | 28 
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 491263f7e103..f7b85dccef36 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -38,6 +38,12 @@ int rtnl_wilddump_request(struct rtnl_handle *rth, int fam, 
int type)
 int rtnl_wilddump_req_filter(struct rtnl_handle *rth, int fam, int type,
__u32 filt_mask)
__attribute__((warn_unused_result));
+
+typedef int (*req_filter_fn_t)(struct nlmsghdr *nlh, int reqlen);
+
+int rtnl_wilddump_req_filter_fn(struct rtnl_handle *rth, int fam, int type,
+   req_filter_fn_t fn)
+   __attribute__((warn_unused_result));
 int rtnl_dump_request(struct rtnl_handle *rth, int type, void *req,
 int len)
__attribute__((warn_unused_result));
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index aac7970e16dd..0692fbacd669 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1476,6 +1476,36 @@ static int ipaddr_flush(void)
return 1;
 }
 
+static int iplink_filter_req(struct nlmsghdr *nlh, int reqlen)
+{
+   int err;
+
+   err = addattr32(nlh, reqlen, IFLA_EXT_MASK, RTEXT_FILTER_VF);
+   if (err)
+   return err;
+
+   if (filter.master) {
+   err = addattr32(nlh, reqlen, IFLA_MASTER, filter.master);
+   if (err)
+   return err;
+   }
+
+   if (filter.kind) {
+   struct rtattr *linkinfo;
+
+   linkinfo = addattr_nest(nlh, reqlen, IFLA_LINKINFO);
+
+   err = addattr_l(nlh, reqlen, IFLA_INFO_KIND, filter.kind,
+   strlen(filter.kind));
+   if (err)
+   return err;
+
+   addattr_nest_end(nlh, linkinfo);
+   }
+
+   return 0;
+}
+
 static int ipaddr_list_flush_or_save(int argc, char **argv, int action)
 {
struct nlmsg_chain linfo = { NULL, NULL};
@@ -1638,7 +1668,8 @@ static int ipaddr_list_flush_or_save(int argc, char 
**argv, int action)
exit(0);
}
 
-   if (rtnl_wilddump_request(, preferred_family, RTM_GETLINK) < 0) {
+   if (rtnl_wilddump_req_filter_fn(, preferred_family, RTM_GETLINK,
+   iplink_filter_req) < 0) {
perror("Cannot send dump request");
exit(1);
}
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index a90e52ca2c0a..0adcbf3f6e38 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -129,6 +129,34 @@ int rtnl_wilddump_req_filter(struct rtnl_handle *rth, int 
family, int type,
return send(rth->fd, (void*), sizeof(req), 0);
 }
 
+int rtnl_wilddump_req_filter_fn(struct rtnl_handle *rth, int family, int type,
+   req_filter_fn_t filter_fn)
+{
+   struct {
+   struct nlmsghdr nlh;
+   struct ifinfomsg ifm;
+   char buf[1024];
+   } req;
+   int err;
+
+   if (!filter_fn)
+   return -EINVAL;
+
+   memset(, 0, sizeof(req));
+   req.nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+   req.nlh.nlmsg_type = type;
+   req.nlh.nlmsg_flags = NLM_F_DUMP|NLM_F_REQUEST;
+   req.nlh.nlmsg_pid = 0;
+   req.nlh.nlmsg_seq = rth->dump = ++rth->seq;
+   req.ifm.ifi_family = family;
+
+   err = filter_fn(, sizeof(req));
+   if (err)
+   return err;
+
+   return send(rth->fd, (void*), sizeof(req), 0);
+}
+
 int rtnl_send(struct rtnl_handle *rth, const void *buf, int len)
 {
return send(rth->fd, buf, len, 0);
-- 
2.1.4

Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

On Wed, 11 May 2016 10:50:12 +0900, Simon Horman wrote:
> On Tue, May 10, 2016 at 02:01:06PM +0200, Jiri Benc wrote:
> > We have two options here:
> > 
> > 1. As for metadata tunnels all the info is in metadata_dst and we
> >don't need the IP/GRE header for anything, we can make the ipgre
> >interface ARPHRD_NONE in metadata based mode.
> > 
> > 2. We can fix this up in ovs after receiving the packet from
> >ARPHRD_IPGRE interface.
> > 
> > I think the first option is the correct one. We already don't assign
> > dev->header_ops in metadata mode. I'll prepare a patch.
> 
> I agree that 1. seems to be the better approach.

I just sent a patch that fixes this. And we have the same bug in
VXLAN-GPE, I sent a patch, too.

> Sure, if that is your preference I think it should be simple enough to
> implement. I agree that netdev_port_receive() looks like a good place for
> this.

Great, thanks!

 Jiri

[PATCH net-next] vxlan: set mac_header correctly in GPE mode

For VXLAN-GPE, the interface is ARPHRD_NONE, thus we need to reset
mac_header after pulling the outer header.

Fixes: e1e5314de08b ("vxlan: implement GPE")
Signed-off-by: Jiri Benc 
---
 drivers/net/vxlan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2f29d20aa08f..e030a804b772 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1338,6 +1338,8 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
if (__iptunnel_pull_header(skb, VXLAN_HLEN, protocol, raw_proto,
   !net_eq(vxlan->net, dev_net(vxlan->dev
goto drop;
+   if (raw_proto)
+   skb_reset_mac_header(skb);
 
if (vxlan_collect_metadata(vs)) {
__be32 vni = vxlan_vni(vxlan_hdr(skb)->vx_vni);
-- 
1.8.3.1

[PATCH net] gre: do not keep the GRE header around in collect medata mode

For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.

Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.

Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc 
---
 net/ipv4/ip_gre.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 205a2b8a5a84..4cc84212cce1 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -398,7 +398,10 @@ static int ipgre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  iph->saddr, iph->daddr, tpi->key);
 
if (tunnel) {
-   skb_pop_mac_header(skb);
+   if (tunnel->dev->type != ARPHRD_NONE)
+   skb_pop_mac_header(skb);
+   else
+   skb_reset_mac_header(skb);
if (tunnel->collect_md) {
__be16 flags;
__be64 tun_id;
@@ -1031,6 +1034,8 @@ static void ipgre_netlink_parms(struct net_device *dev,
struct ip_tunnel *t = netdev_priv(dev);
 
t->collect_md = true;
+   if (dev->type == ARPHRD_IPGRE)
+   dev->type = ARPHRD_NONE;
}
 }
 
-- 
1.8.3.1

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-11 Thread Hannes Frederic Sowa

On 11.05.2016 15:39, Hannes Frederic Sowa wrote:
> I am fine with that. It is us to show clear benefits or use cases for
> that. If we fail with that, no problem at all that the patches get
> rejected, we don't want to add bloat to the kernel, for sure! At this
> point I still think a possibility to run napi in kthreads will allow
> specific workloads to see an improvement. Maybe the simple branch in
> napi_schedule is just worth so people can play around with it. As it
> shouldn't change behavior we can later on simply remove it.

Actually, I consider it a bug that when we force the kernel to use
threaded irqs that we only schedule the softirq for napi later on and
don't do the processing within the thread.

Bye,
Hannes

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-11 Thread Hannes Frederic Sowa

Hi all,

On 11.05.2016 15:08, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
>> Hi Eric,
>> On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
>>> On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
>>>
 Not only did we want to present this solely as a bugfix but also as as
 performance enhancements in case of virtio (as you can see in the cover
 letter). Given that a long time ago there was a tendency to remove
 softirqs completely, we thought it might be very interesting, that a
 threaded napi in general seems to be absolutely viable nowadays and
 might offer new features.
>>>
>>> Well, you did not fix the bug, you worked around by adding yet another
>>> layer, with another sysctl that admins or programs have to manage.
>>>
>>> If you have a special need for virtio, do not hide it behind a 'bug fix'
>>> but add it as a features request.
>>>
>>> This ksoftirqd issue is real and a fix looks very reasonable.
>>>
>>> Please try this patch, as I had very good success with it.
>>
>> Thank you for your time and your effort.
>>
>> I tested your patch on the bare metal "single core" scenario, disabling
>> the unneeded cores with:
>> CPUS=`nproc`
>> for I in `seq 1 $CPUS`; do echo 0  >  
>> /sys/devices/system/node/node0/cpu$I/online; done
>>
>> And I got a:
>>
>> [   86.925249] Broke affinity for irq 
>>
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?
> 
> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.

Yes, I fear this could come up. If we want to target net or stable maybe
we should maybe special case this patch specifically for net-rx?

>> for each irq number generated by a network device.
>>
>> In this scenario, your patch solves the ksoftirqd issue, performing
>> comparable to the napi threaded patches (with a negative delta in the
>> noise range) and introducing a minor regression with a single flow, in
>> the noise range (3%).
>>
>> As said in a previous mail, we actually experimented something similar,
>> but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.

We posted this patch as an RFC. My initial internal proposal only had a
check in ___napi_schedule and completely relied on threaded irqs and
didn't spawn a thread per napi instance in the networking stack. I think
this is the better approach long term, as it allows to configure
threaded irqs per device and doesn't specifically deal with networking
only. NAPI must be aware of when to schedule, obviously, so we need
another check in napi_schedule.

My plan was definitely to go with something more generic, but we didn't
yet know how to express that in a generic way, but relied on the forced
threaded irqs kernel parameter.

>> AFAICS this patch adds three more tests in the fast path and affect all
>> other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.

Yes.

> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.

Exactly, the pending mask gets reset by the task handling the softirq
inline and ksoftirqd runs dry too early not processing any more softirq
notifications.

>> The napi thread patches are actually a new feature, that also fixes the
>> ksoftirqd issue: hunting the ksoftirqd issue has been the initial
>> trigger for this work. I'm sorry for not being clear enough in the cover
>> letter.
>>
>> The napi thread patches offer additional benefits, i.e. an additional
>> relevant gain in the described test scenario, and do not impact on other
>> subsystems/kernel entities. 
>>
>> I still think they are worthy, and I bet you would disagree, but could
>> you please articulate more which parts concern you most and/or are more
>> bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?

I think the local_bh_disable/enable needs to be more fine granular,
correct (inside the loop).

> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to

[PATCH net-next v2 12/14] qed*: IOV support spoof-checking

Add support in `ndo_set_vf_spoofchk' for allowing PF control over
its VF spoof-checking configuration.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_l2.c |  4 ++
 drivers/net/ethernet/qlogic/qed/qed_l2.h |  2 +
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 91 
 drivers/net/ethernet/qlogic/qed/qed_sriov.h  |  2 +
 drivers/net/ethernet/qlogic/qede/qede_main.c | 11 
 include/linux/qed/qed_iov_if.h   |  2 +
 6 files changed, 112 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c 
b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index 8d83250..e0275a7 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -381,6 +381,10 @@ int qed_sp_vport_update(struct qed_hwfn *p_hwfn,
p_ramrod->common.tx_switching_en = p_params->tx_switching_flg;
p_cmn->update_tx_switching_en_flg = p_params->update_tx_switching_flg;
 
+   p_cmn->anti_spoofing_en = p_params->anti_spoofing_en;
+   val = p_params->update_anti_spoofing_en_flg;
+   p_ramrod->common.update_anti_spoofing_en_flg = val;
+
rc = qed_sp_vport_update_rss(p_hwfn, p_ramrod, p_rss_params);
if (rc) {
/* Return spq entry which is taken in qed_sp_init_request()*/
diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.h 
b/drivers/net/ethernet/qlogic/qed/qed_l2.h
index fad30ae..a04fb7f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.h
@@ -149,6 +149,8 @@ struct qed_sp_vport_update_params {
u8  update_tx_switching_flg;
u8  tx_switching_flg;
u8  update_approx_mcast_flg;
+   u8  update_anti_spoofing_en_flg;
+   u8  anti_spoofing_en;
u8  update_accept_any_vlan_flg;
u8  accept_any_vlan;
unsigned long   bins[8];
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index c9a3bb6..804102c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -1234,6 +1234,39 @@ out:
 sizeof(struct pfvf_acquire_resp_tlv), vfpf_status);
 }
 
+static int __qed_iov_spoofchk_set(struct qed_hwfn *p_hwfn,
+ struct qed_vf_info *p_vf, bool val)
+{
+   struct qed_sp_vport_update_params params;
+   int rc;
+
+   if (val == p_vf->spoof_chk) {
+   DP_VERBOSE(p_hwfn, QED_MSG_IOV,
+  "Spoofchk value[%d] is already configured\n", val);
+   return 0;
+   }
+
+   memset(, 0, sizeof(struct qed_sp_vport_update_params));
+   params.opaque_fid = p_vf->opaque_fid;
+   params.vport_id = p_vf->vport_id;
+   params.update_anti_spoofing_en_flg = 1;
+   params.anti_spoofing_en = val;
+
+   rc = qed_sp_vport_update(p_hwfn, , QED_SPQ_MODE_EBLOCK, NULL);
+   if (rc) {
+   p_vf->spoof_chk = val;
+   p_vf->req_spoofchk_val = p_vf->spoof_chk;
+   DP_VERBOSE(p_hwfn, QED_MSG_IOV,
+  "Spoofchk val[%d] configured\n", val);
+   } else {
+   DP_VERBOSE(p_hwfn, QED_MSG_IOV,
+  "Spoofchk configuration[val:%d] failed for VF[%d]\n",
+  val, p_vf->relative_vf_id);
+   }
+
+   return rc;
+}
+
 static int qed_iov_reconfigure_unicast_vlan(struct qed_hwfn *p_hwfn,
struct qed_vf_info *p_vf)
 {
@@ -1476,6 +1509,8 @@ static void qed_iov_vf_mbx_start_vport(struct qed_hwfn 
*p_hwfn,
 
/* Force configuration if needed on the newly opened vport */
qed_iov_configure_vport_forced(p_hwfn, vf, *p_bitmap);
+
+   __qed_iov_spoofchk_set(p_hwfn, vf, vf->req_spoofchk_val);
}
qed_iov_prepare_resp(p_hwfn, p_ptt, vf, CHANNEL_TLV_VPORT_START,
 sizeof(struct pfvf_def_resp_tlv), status);
@@ -1489,6 +1524,7 @@ static void qed_iov_vf_mbx_stop_vport(struct qed_hwfn 
*p_hwfn,
int rc;
 
vf->vport_instance--;
+   vf->spoof_chk = false;
 
rc = qed_sp_vport_stop(p_hwfn, vf->opaque_fid, vf->vport_id);
if (rc != 0) {
@@ -2782,6 +2818,17 @@ void qed_iov_bulletin_set_forced_vlan(struct qed_hwfn 
*p_hwfn,
qed_iov_configure_vport_forced(p_hwfn, vf_info, feature);
 }
 
+static bool qed_iov_vf_has_vport_instance(struct qed_hwfn *p_hwfn, int vfid)
+{
+   struct qed_vf_info *p_vf_info;
+
+   p_vf_info = qed_iov_get_vf_info(p_hwfn, (u16) vfid, true);
+   if (!p_vf_info)
+   return false;
+
+   return !!p_vf_info->vport_instance;
+}
+
 bool qed_iov_is_vf_stopped(struct qed_hwfn

[PATCH net-next v2 08/14] qede: Add VF support

Adding a PCI callback for `sriov_configure' and a new PCI device id for
the VF [+ Some minor changes to accomodate differences between PF and VF
at the qede].
Following this, VF creation should be possible and the entire subset of
existing PF functionality that's allow to VFs should be supported.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qede/qede.h |  4 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 43 +++-
 drivers/net/ethernet/qlogic/qede/qede_main.c| 52 +
 3 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index ff3ac0c..47d6b22 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -112,6 +112,10 @@ struct qede_dev {
u32 dp_module;
u8  dp_level;
 
+   u32 flags;
+#define QEDE_FLAG_IS_VFBIT(0)
+#define IS_VF(edev)(!!((edev)->flags & QEDE_FLAG_IS_VF))
+
const struct qed_eth_ops*ops;
 
struct qed_dev_eth_info dev_info;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 0d04f16..1bc7535 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -151,6 +151,8 @@ static void qede_get_strings_stats(struct qede_dev *edev, 
u8 *buf)
int i, j, k;
 
for (i = 0, j = 0; i < QEDE_NUM_STATS; i++) {
+   if (IS_VF(edev) && qede_stats_arr[i].pf_only)
+   continue;
strcpy(buf + j * ETH_GSTRING_LEN,
   qede_stats_arr[i].string);
j++;
@@ -194,8 +196,11 @@ static void qede_get_ethtool_stats(struct net_device *dev,
 
mutex_lock(>qede_lock);
 
-   for (sidx = 0; sidx < QEDE_NUM_STATS; sidx++)
+   for (sidx = 0; sidx < QEDE_NUM_STATS; sidx++) {
+   if (IS_VF(edev) && qede_stats_arr[sidx].pf_only)
+   continue;
buf[cnt++] = QEDE_STATS_DATA(edev, sidx);
+   }
 
for (sidx = 0; sidx < QEDE_NUM_RQSTATS; sidx++) {
buf[cnt] = 0;
@@ -214,6 +219,13 @@ static int qede_get_sset_count(struct net_device *dev, int 
stringset)
 
switch (stringset) {
case ETH_SS_STATS:
+   if (IS_VF(edev)) {
+   int i;
+
+   for (i = 0; i < QEDE_NUM_STATS; i++)
+   if (qede_stats_arr[i].pf_only)
+   num_stats--;
+   }
return num_stats + QEDE_NUM_RQSTATS;
case ETH_SS_PRIV_FLAGS:
return QEDE_PRI_FLAG_LEN;
@@ -1142,7 +1154,34 @@ static const struct ethtool_ops qede_ethtool_ops = {
.self_test = qede_self_test,
 };
 
+static const struct ethtool_ops qede_vf_ethtool_ops = {
+   .get_settings = qede_get_settings,
+   .get_drvinfo = qede_get_drvinfo,
+   .get_msglevel = qede_get_msglevel,
+   .set_msglevel = qede_set_msglevel,
+   .get_link = qede_get_link,
+   .get_ringparam = qede_get_ringparam,
+   .set_ringparam = qede_set_ringparam,
+   .get_strings = qede_get_strings,
+   .get_ethtool_stats = qede_get_ethtool_stats,
+   .get_priv_flags = qede_get_priv_flags,
+   .get_sset_count = qede_get_sset_count,
+   .get_rxnfc = qede_get_rxnfc,
+   .set_rxnfc = qede_set_rxnfc,
+   .get_rxfh_indir_size = qede_get_rxfh_indir_size,
+   .get_rxfh_key_size = qede_get_rxfh_key_size,
+   .get_rxfh = qede_get_rxfh,
+   .set_rxfh = qede_set_rxfh,
+   .get_channels = qede_get_channels,
+   .set_channels = qede_set_channels,
+};
+
 void qede_set_ethtool_ops(struct net_device *dev)
 {
-   dev->ethtool_ops = _ethtool_ops;
+   struct qede_dev *edev = netdev_priv(dev);
+
+   if (IS_VF(edev))
+   dev->ethtool_ops = _vf_ethtool_ops;
+   else
+   dev->ethtool_ops = _ethtool_ops;
 }
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 04b15ad..57e3426 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -63,6 +63,7 @@ static const struct qed_eth_ops *qed_ops;
 #define CHIP_NUM_57980S_1000x1644
 #define CHIP_NUM_57980S_50 0x1654
 #define CHIP_NUM_57980S_25 0x1656
+#define CHIP_NUM_57980S_IOV0x1664
 
 #ifndef PCI_DEVICE_ID_NX2_57980E
 #define PCI_DEVICE_ID_57980S_40CHIP_NUM_57980S_40
@@ -71,15 +72,22 @@ static const struct qed_eth_ops *qed_ops;
 #define PCI_DEVICE_ID_57980S_100   CHIP_NUM_57980S_100
 #define PCI_DEVICE_ID_57980S_50CHIP_NUM_57980S_50
 #define PCI_DEVICE_ID_57980S_25CHIP_NUM_57980S_25
+#define

[PATCH net-next v2 11/14] qed*: IOV link control

This adds support in 2 ndo that allow PF to tweak the VF's view of the
link - `ndo_set_vf_link_state' to allow it a view independent of the PF's,
and `ndo_set_vf_rate' which would allow the PF to limit the VF speed.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed.h|   2 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c|  76 +
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 164 +++
 drivers/net/ethernet/qlogic/qed/qed_sriov.h  |   6 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |  26 +
 include/linux/qed/qed_iov_if.h   |   6 +
 6 files changed, 280 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index d7da645..77323fc 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -554,8 +554,10 @@ static inline u8 qed_concrete_to_sw_fid(struct qed_dev 
*cdev,
 
 #define PURE_LB_TC 8
 
+int qed_configure_vport_wfq(struct qed_dev *cdev, u16 vp_id, u32 rate);
 void qed_configure_vp_wfq_on_link_change(struct qed_dev *cdev, u32 
min_pf_rate);
 
+void qed_clean_wfq_db(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt);
 #define QED_LEADING_HWFN(dev)   (>hwfns[0])
 
 /* Other Linux specific common definitions */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index e75e73a..acaa286 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1889,6 +1889,32 @@ static int qed_init_wfq_param(struct qed_hwfn *p_hwfn,
return 0;
 }
 
+static int __qed_configure_vport_wfq(struct qed_hwfn *p_hwfn,
+struct qed_ptt *p_ptt, u16 vp_id, u32 rate)
+{
+   struct qed_mcp_link_state *p_link;
+   int rc = 0;
+
+   p_link = _hwfn->cdev->hwfns[0].mcp_info->link_output;
+
+   if (!p_link->min_pf_rate) {
+   p_hwfn->qm_info.wfq_data[vp_id].min_speed = rate;
+   p_hwfn->qm_info.wfq_data[vp_id].configured = true;
+   return rc;
+   }
+
+   rc = qed_init_wfq_param(p_hwfn, vp_id, rate, p_link->min_pf_rate);
+
+   if (rc == 0)
+   qed_configure_wfq_for_all_vports(p_hwfn, p_ptt,
+p_link->min_pf_rate);
+   else
+   DP_NOTICE(p_hwfn,
+ "Validation failed while configuring min rate\n");
+
+   return rc;
+}
+
 static int __qed_configure_vp_wfq_on_link_change(struct qed_hwfn *p_hwfn,
 struct qed_ptt *p_ptt,
 u32 min_pf_rate)
@@ -1923,6 +1949,42 @@ static int __qed_configure_vp_wfq_on_link_change(struct 
qed_hwfn *p_hwfn,
return rc;
 }
 
+/* Main API for qed clients to configure vport min rate.
+ * vp_id - vport id in PF Range[0 - (total_num_vports_per_pf - 1)]
+ * rate - Speed in Mbps needs to be assigned to a given vport.
+ */
+int qed_configure_vport_wfq(struct qed_dev *cdev, u16 vp_id, u32 rate)
+{
+   int i, rc = -EINVAL;
+
+   /* Currently not supported; Might change in future */
+   if (cdev->num_hwfns > 1) {
+   DP_NOTICE(cdev,
+ "WFQ configuration is not supported for this 
device\n");
+   return rc;
+   }
+
+   for_each_hwfn(cdev, i) {
+   struct qed_hwfn *p_hwfn = >hwfns[i];
+   struct qed_ptt *p_ptt;
+
+   p_ptt = qed_ptt_acquire(p_hwfn);
+   if (!p_ptt)
+   return -EBUSY;
+
+   rc = __qed_configure_vport_wfq(p_hwfn, p_ptt, vp_id, rate);
+
+   if (!rc) {
+   qed_ptt_release(p_hwfn, p_ptt);
+   return rc;
+   }
+
+   qed_ptt_release(p_hwfn, p_ptt);
+   }
+
+   return rc;
+}
+
 /* API to configure WFQ from mcp link change */
 void qed_configure_vp_wfq_on_link_change(struct qed_dev *cdev, u32 min_pf_rate)
 {
@@ -2069,3 +2131,17 @@ int qed_configure_pf_min_bandwidth(struct qed_dev *cdev, 
u8 min_bw)
 
return rc;
 }
+
+void qed_clean_wfq_db(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
+{
+   struct qed_mcp_link_state *p_link;
+
+   p_link = _hwfn->mcp_info->link_output;
+
+   if (p_link->min_pf_rate)
+   qed_disable_wfq_for_all_vports(p_hwfn, p_ptt,
+  p_link->min_pf_rate);
+
+   memset(p_hwfn->qm_info.wfq_data, 0,
+  sizeof(*p_hwfn->qm_info.wfq_data) * p_hwfn->qm_info.num_vports);
+}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index c1b7919..c9a3bb6 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -2822,6 +2822,46 @@ u16 qed_iov_bulletin_get_forced_vlan(struct qed_hwfn 
*p_hwfn, u16 rel_vf_id)

[PATCH net-next v2 14/14] qed*: Tx-switching configuration

Device should be configured by default to VEB once VFs are active.
This changes the configuration of both PFs' and VFs' vports into enabling
tx-switching once sriov is enabled.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c |  3 ++-
 drivers/net/ethernet/qlogic/qed/qed_l2.c  |  4 
 drivers/net/ethernet/qlogic/qed/qed_l2.h  |  1 +
 drivers/net/ethernet/qlogic/qed/qed_main.c|  1 +
 drivers/net/ethernet/qlogic/qed/qed_sp.h  |  3 ++-
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  5 -
 drivers/net/ethernet/qlogic/qed/qed_sriov.c   |  1 +
 drivers/net/ethernet/qlogic/qed/qed_vf.c  | 12 
 drivers/net/ethernet/qlogic/qede/qede_main.c  | 24 ++-
 include/linux/qed/qed_eth_if.h|  2 ++
 include/linux/qed/qed_if.h|  1 +
 11 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index acaa286..6fb6016 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -688,7 +688,8 @@ static int qed_hw_init_pf(struct qed_hwfn *p_hwfn,
qed_int_igu_enable(p_hwfn, p_ptt, int_mode);
 
/* send function start command */
-   rc = qed_sp_pf_start(p_hwfn, p_tunn, p_hwfn->cdev->mf_mode);
+   rc = qed_sp_pf_start(p_hwfn, p_tunn, p_hwfn->cdev->mf_mode,
+allow_npar_tx_switch);
if (rc)
DP_NOTICE(p_hwfn, "Function start ramrod failed\n");
}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c 
b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index e0275a7..8fba87dd 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -99,6 +99,8 @@ int qed_sp_eth_vport_start(struct qed_hwfn *p_hwfn,
break;
}
 
+   p_ramrod->tx_switching_en = p_params->tx_switching;
+
/* Software Function ID in hwfn (PFs are 0 - 15, VFs are 16 - 135) */
p_ramrod->sw_fid = qed_concrete_to_sw_fid(p_hwfn->cdev,
  p_params->concrete_fid);
@@ -1792,6 +1794,8 @@ static int qed_update_vport(struct qed_dev *cdev,
params->update_vport_active_flg;
sp_params.vport_active_rx_flg = params->vport_active_flg;
sp_params.vport_active_tx_flg = params->vport_active_flg;
+   sp_params.update_tx_switching_flg = params->update_tx_switching_flg;
+   sp_params.tx_switching_flg = params->tx_switching_flg;
sp_params.accept_any_vlan = params->accept_any_vlan;
sp_params.update_accept_any_vlan_flg =
params->update_accept_any_vlan_flg;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.h 
b/drivers/net/ethernet/qlogic/qed/qed_l2.h
index a04fb7f..0021145 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.h
@@ -94,6 +94,7 @@ enum qed_tpa_mode {
 struct qed_sp_vport_start_params {
enum qed_tpa_mode tpa_mode;
bool remove_inner_vlan;
+   bool tx_switching;
bool only_untagged;
bool drop_ttl0;
u8 max_buffers_per_cqe;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index dcb782c..6ffc21d 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -216,6 +216,7 @@ int qed_fill_dev_info(struct qed_dev *cdev,
dev_info->fw_rev = FW_REVISION_VERSION;
dev_info->fw_eng = FW_ENGINEERING_VERSION;
dev_info->mf_mode = cdev->mf_mode;
+   dev_info->tx_switching = true;
} else {
qed_vf_get_fw_version(>hwfns[0], _info->fw_major,
  _info->fw_minor, _info->fw_rev,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp.h 
b/drivers/net/ethernet/qlogic/qed/qed_sp.h
index c2999cb..ab5549f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp.h
@@ -344,13 +344,14 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
  * @param p_hwfn
  * @param p_tunn
  * @param mode
+ * @param allow_npar_tx_switch
  *
  * @return int
  */
 
 int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
struct qed_tunn_start_params *p_tunn,
-   enum qed_mf_mode mode);
+   enum qed_mf_mode mode, bool allow_npar_tx_switch);
 
 /**
  * @brief qed_sp_pf_stop - PF Function Stop Ramrod
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c 
b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
index ed90947..8c555ed 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
@@ -299,7 +299,7 @@ qed_tunn_set_pf_start_params(struct

[PATCH net-next v2 06/14] qed: Bulletin and Link

Up to this point, VF and PF communication always originates from VF.
As a result, VF cannot be notified of any async changes, and specifically
cannot be informed of the current link state.

This introduces the bulletin board, the mechanism through which the PF
is going to communicate async notifications back to the VF. basically,
it's a well-defined structure agreed by both PF and VF which the VF would
continuously poll and into which the PF would DMA messages when needed.
[Bulletin board is actually allocated and communicated in previous patches
but never before used]

Based on the bulletin infrastructure, the VF can query its link status
and receive said async carrier changes.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c  |  12 ++-
 drivers/net/ethernet/qlogic/qed/qed_sriov.c | 136 +++-
 drivers/net/ethernet/qlogic/qed/qed_sriov.h |   5 +
 drivers/net/ethernet/qlogic/qed/qed_vf.c| 122 ++
 drivers/net/ethernet/qlogic/qed/qed_vf.h| 156 
 5 files changed, 425 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index e98610e..dcb782c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1119,9 +1119,9 @@ static void qed_fill_link(struct qed_hwfn *hwfn,
memcpy(_caps, qed_mcp_get_link_capabilities(hwfn),
   sizeof(link_caps));
} else {
-   memset(, 0, sizeof(params));
-   memset(, 0, sizeof(link));
-   memset(_caps, 0, sizeof(link_caps));
+   qed_vf_get_link_params(hwfn, );
+   qed_vf_get_link_state(hwfn, );
+   qed_vf_get_link_caps(hwfn, _caps);
}
 
/* Set the link parameters to pass to protocol driver */
@@ -1224,7 +1224,12 @@ static void qed_fill_link(struct qed_hwfn *hwfn,
 static void qed_get_current_link(struct qed_dev *cdev,
 struct qed_link_output *if_link)
 {
+   int i;
+
qed_fill_link(>hwfns[0], if_link);
+
+   for_each_hwfn(cdev, i)
+   qed_inform_vf_link_state(>hwfns[i]);
 }
 
 void qed_link_update(struct qed_hwfn *hwfn)
@@ -1234,6 +1239,7 @@ void qed_link_update(struct qed_hwfn *hwfn)
struct qed_link_output if_link;
 
qed_fill_link(hwfn, _link);
+   qed_inform_vf_link_state(hwfn);
 
if (IS_LEAD_HWFN(hwfn) && cookie)
op->link_update(cookie, _link);
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index 82f1eda3..f6540c0 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include "qed_cxt.h"
 #include "qed_hsi.h"
@@ -116,6 +117,41 @@ static struct qed_vf_info *qed_iov_get_vf_info(struct 
qed_hwfn *p_hwfn,
return vf;
 }
 
+int qed_iov_post_vf_bulletin(struct qed_hwfn *p_hwfn,
+int vfid, struct qed_ptt *p_ptt)
+{
+   struct qed_bulletin_content *p_bulletin;
+   int crc_size = sizeof(p_bulletin->crc);
+   struct qed_dmae_params params;
+   struct qed_vf_info *p_vf;
+
+   p_vf = qed_iov_get_vf_info(p_hwfn, (u16) vfid, true);
+   if (!p_vf)
+   return -EINVAL;
+
+   if (!p_vf->vf_bulletin)
+   return -EINVAL;
+
+   p_bulletin = p_vf->bulletin.p_virt;
+
+   /* Increment bulletin board version and compute crc */
+   p_bulletin->version++;
+   p_bulletin->crc = crc32(0, (u8 *)p_bulletin + crc_size,
+   p_vf->bulletin.size - crc_size);
+
+   DP_VERBOSE(p_hwfn, QED_MSG_IOV,
+  "Posting Bulletin 0x%08x to VF[%d] (CRC 0x%08x)\n",
+  p_bulletin->version, p_vf->relative_vf_id, p_bulletin->crc);
+
+   /* propagate bulletin board via dmae to vm memory */
+   memset(, 0, sizeof(params));
+   params.flags = QED_DMAE_FLAG_VF_DST;
+   params.dst_vfid = p_vf->abs_vf_id;
+   return qed_dmae_host2host(p_hwfn, p_ptt, p_vf->bulletin.phys,
+ p_vf->vf_bulletin, p_vf->bulletin.size / 4,
+ );
+}
+
 static int qed_iov_pci_cfg_info(struct qed_dev *cdev)
 {
struct qed_hw_sriov_info *iov = cdev->p_iov_info;
@@ -790,6 +826,11 @@ static int qed_iov_release_hw_for_vf(struct qed_hwfn 
*p_hwfn,
return -EINVAL;
}
 
+   if (vf->bulletin.p_virt)
+   memset(vf->bulletin.p_virt, 0, sizeof(*vf->bulletin.p_virt));
+
+   memset(>p_vf_info, 0, sizeof(vf->p_vf_info));
+
if (vf->state != VF_STOPPED) {
/* Stopping the VF */
rc = qed_sp_vf_stop(p_hwfn, vf->concrete_fid, vf->opaque_fid);
@@ -1159,6 +1200,7 @@ static void qed_iov_vf_mbx_acquire(struct

[PATCH net-next v2 10/14] qed*: Support forced MAC

Allows the PF to enforce the VF's mac.
i.e., by using `ip link ... vf  mac '.

While a MAC is forced, PF would prevent the VF from configuring any other
MAC.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_l2.c |   9 ++
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 120 +++
 drivers/net/ethernet/qlogic/qed/qed_sriov.h  |   1 +
 drivers/net/ethernet/qlogic/qed/qed_vf.c |  47 +++
 drivers/net/ethernet/qlogic/qed/qed_vf.h |  21 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |  31 +++
 include/linux/qed/qed_eth_if.h   |   3 +
 include/linux/qed/qed_iov_if.h   |   2 +
 8 files changed, 234 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c 
b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index 7fb6b82..8d83250 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -1701,6 +1701,14 @@ static void qed_register_eth_ops(struct qed_dev *cdev,
qed_vf_start_iov_wq(cdev);
 }
 
+static bool qed_check_mac(struct qed_dev *cdev, u8 *mac)
+{
+   if (IS_PF(cdev))
+   return true;
+
+   return qed_vf_check_mac(>hwfns[0], mac);
+}
+
 static int qed_start_vport(struct qed_dev *cdev,
   struct qed_start_vport_params *params)
 {
@@ -2149,6 +2157,7 @@ static const struct qed_eth_ops qed_eth_ops_pass = {
 #endif
.fill_dev_info = _fill_eth_dev_info,
.register_ops = _register_eth_ops,
+   .check_mac = _check_mac,
.vport_start = _start_vport,
.vport_stop = _stop_vport,
.vport_update = _update_vport,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index 77d44ba..c1b7919 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -1295,6 +1295,29 @@ static int qed_iov_configure_vport_forced(struct 
qed_hwfn *p_hwfn,
if (!p_vf->vport_instance)
return -EINVAL;
 
+   if (events & (1 << MAC_ADDR_FORCED)) {
+   /* Since there's no way [currently] of removing the MAC,
+* we can always assume this means we need to force it.
+*/
+   memset(, 0, sizeof(filter));
+   filter.type = QED_FILTER_MAC;
+   filter.opcode = QED_FILTER_REPLACE;
+   filter.is_rx_filter = 1;
+   filter.is_tx_filter = 1;
+   filter.vport_to_add_to = p_vf->vport_id;
+   ether_addr_copy(filter.mac, p_vf->bulletin.p_virt->mac);
+
+   rc = qed_sp_eth_filter_ucast(p_hwfn, p_vf->opaque_fid,
+, QED_SPQ_MODE_CB, NULL);
+   if (rc) {
+   DP_NOTICE(p_hwfn,
+ "PF failed to configure MAC for VF\n");
+   return rc;
+   }
+
+   p_vf->configured_features |= 1 << MAC_ADDR_FORCED;
+   }
+
if (events & (1 << VLAN_ADDR_FORCED)) {
struct qed_sp_vport_update_params vport_update;
u8 removal;
@@ -2199,6 +,16 @@ static void qed_iov_vf_mbx_ucast_filter(struct qed_hwfn 
*p_hwfn,
goto out;
}
 
+   if ((p_bulletin->valid_bitmap & (1 << MAC_ADDR_FORCED)) &&
+   (params.type == QED_FILTER_MAC ||
+params.type == QED_FILTER_MAC_VLAN)) {
+   if (!ether_addr_equal(p_bulletin->mac, params.mac) ||
+   (params.opcode != QED_FILTER_ADD &&
+params.opcode != QED_FILTER_REPLACE))
+   status = PFVF_STATUS_FORCED;
+   goto out;
+   }
+
rc = qed_iov_chk_ucast(p_hwfn, vf->relative_vf_id, );
if (rc) {
status = PFVF_STATUS_FAILURE;
@@ -2702,6 +2735,30 @@ static int qed_iov_copy_vf_msg(struct qed_hwfn *p_hwfn, 
struct qed_ptt *ptt,
return 0;
 }
 
+static void qed_iov_bulletin_set_forced_mac(struct qed_hwfn *p_hwfn,
+   u8 *mac, int vfid)
+{
+   struct qed_vf_info *vf_info;
+   u64 feature;
+
+   vf_info = qed_iov_get_vf_info(p_hwfn, (u16)vfid, true);
+   if (!vf_info) {
+   DP_NOTICE(p_hwfn->cdev,
+ "Can not set forced MAC, invalid vfid [%d]\n", vfid);
+   return;
+   }
+
+   feature = 1 << MAC_ADDR_FORCED;
+   memcpy(vf_info->bulletin.p_virt->mac, mac, ETH_ALEN);
+
+   vf_info->bulletin.p_virt->valid_bitmap |= feature;
+   /* Forced MAC will disable MAC_ADDR */
+   vf_info->bulletin.p_virt->valid_bitmap &=
+   ~(1 << VFPF_BULLETIN_MAC_ADDR);
+
+   qed_iov_configure_vport_forced(p_hwfn, vf_info, feature);
+}
+
 void qed_iov_bulletin_set_forced_vlan(struct qed_hwfn *p_hwfn,
  u16 pvid, int vfid)
 {
@@ -2736,6

[PATCH net-next v2 13/14] qed*: support ndo_get_vf_config

Allows the user to view the VF configuration by observing the PF's
device.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 93 
 drivers/net/ethernet/qlogic/qede/qede_main.c | 12 
 include/linux/qed/qed_iov_if.h   |  3 +
 3 files changed, 108 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index 804102c..6af8fd9f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -2588,6 +2588,30 @@ void qed_iov_set_link(struct qed_hwfn *p_hwfn,
p_bulletin->capability_speed = p_caps->speed_capabilities;
 }
 
+static void qed_iov_get_link(struct qed_hwfn *p_hwfn,
+u16 vfid,
+struct qed_mcp_link_params *p_params,
+struct qed_mcp_link_state *p_link,
+struct qed_mcp_link_capabilities *p_caps)
+{
+   struct qed_vf_info *p_vf = qed_iov_get_vf_info(p_hwfn,
+  vfid,
+  false);
+   struct qed_bulletin_content *p_bulletin;
+
+   if (!p_vf)
+   return;
+
+   p_bulletin = p_vf->bulletin.p_virt;
+
+   if (p_params)
+   __qed_vf_get_link_params(p_hwfn, p_params, p_bulletin);
+   if (p_link)
+   __qed_vf_get_link_state(p_hwfn, p_link, p_bulletin);
+   if (p_caps)
+   __qed_vf_get_link_caps(p_hwfn, p_caps, p_bulletin);
+}
+
 static void qed_iov_process_mbx_req(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt, int vfid)
 {
@@ -2840,6 +2864,17 @@ bool qed_iov_is_vf_stopped(struct qed_hwfn *p_hwfn, int 
vfid)
return p_vf_info->state == VF_STOPPED;
 }
 
+static bool qed_iov_spoofchk_get(struct qed_hwfn *p_hwfn, int vfid)
+{
+   struct qed_vf_info *vf_info;
+
+   vf_info = qed_iov_get_vf_info(p_hwfn, (u16) vfid, true);
+   if (!vf_info)
+   return false;
+
+   return vf_info->spoof_chk;
+}
+
 int qed_iov_spoofchk_set(struct qed_hwfn *p_hwfn, int vfid, bool val)
 {
struct qed_vf_info *vf;
@@ -2937,6 +2972,23 @@ int qed_iov_configure_min_tx_rate(struct qed_dev *cdev, 
int vfid, u32 rate)
return qed_configure_vport_wfq(cdev, vport_id, rate);
 }
 
+static int qed_iov_get_vf_min_rate(struct qed_hwfn *p_hwfn, int vfid)
+{
+   struct qed_wfq_data *vf_vp_wfq;
+   struct qed_vf_info *vf_info;
+
+   vf_info = qed_iov_get_vf_info(p_hwfn, (u16) vfid, true);
+   if (!vf_info)
+   return 0;
+
+   vf_vp_wfq = _hwfn->qm_info.wfq_data[vf_info->vport_id];
+
+   if (vf_vp_wfq->configured)
+   return vf_vp_wfq->min_speed;
+   else
+   return 0;
+}
+
 /**
  * qed_schedule_iov - schedules IOV task for VF and PF
  * @hwfn: hardware function pointer
@@ -3153,6 +3205,46 @@ static int qed_sriov_pf_set_vlan(struct qed_dev *cdev, 
u16 vid, int vfid)
return 0;
 }
 
+static int qed_get_vf_config(struct qed_dev *cdev,
+int vf_id, struct ifla_vf_info *ivi)
+{
+   struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
+   struct qed_public_vf_info *vf_info;
+   struct qed_mcp_link_state link;
+   u32 tx_rate;
+
+   /* Sanitize request */
+   if (IS_VF(cdev))
+   return -EINVAL;
+
+   if (!qed_iov_is_valid_vfid(>hwfns[0], vf_id, true)) {
+   DP_VERBOSE(cdev, QED_MSG_IOV,
+  "VF index [%d] isn't active\n", vf_id);
+   return -EINVAL;
+   }
+
+   vf_info = qed_iov_get_public_vf_info(hwfn, vf_id, true);
+
+   qed_iov_get_link(hwfn, vf_id, NULL, , NULL);
+
+   /* Fill information about VF */
+   ivi->vf = vf_id;
+
+   if (is_valid_ether_addr(vf_info->forced_mac))
+   ether_addr_copy(ivi->mac, vf_info->forced_mac);
+   else
+   ether_addr_copy(ivi->mac, vf_info->mac);
+
+   ivi->vlan = vf_info->forced_vlan;
+   ivi->spoofchk = qed_iov_spoofchk_get(hwfn, vf_id);
+   ivi->linkstate = vf_info->link_state;
+   tx_rate = vf_info->tx_rate;
+   ivi->max_tx_rate = tx_rate ? tx_rate : link.speed;
+   ivi->min_tx_rate = qed_iov_get_vf_min_rate(hwfn, vf_id);
+
+   return 0;
+}
+
 void qed_inform_vf_link_state(struct qed_hwfn *hwfn)
 {
struct qed_mcp_link_capabilities caps;
@@ -3506,6 +3598,7 @@ const struct qed_iov_hv_ops qed_iov_ops_pass = {
.configure = _sriov_configure,
.set_mac = _sriov_pf_set_mac,
.set_vlan = _sriov_pf_set_vlan,
+   .get_config = _get_vf_config,
.set_link_state = _set_vf_link_state,
.set_spoof = _spoof_configure,
.set_rate = _set_vf_rate,
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index

[PATCH net-next v2 04/14] qed: IOV configure and FLR