date:20160515

On Mon, 2016-05-16 at 09:17 +0800, Jason Wang wrote:
> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.

...

>   struct tun_struct *detached;
> + /* reader lock */
> + spinlock_t rlock;
> + unsigned long tail;
> + struct tun_desc tx_descs[TUN_RING_SIZE];
> + /* writer lock */
> + spinlock_t wlock;
> + unsigned long head;
>  };
>  

Ok, we had these kind of ideas floating around for many other cases,
like qdisc, UDP or af_packet sockets...

I believe we should have a common set of helpers, not hidden in
drivers/net/tun.c but in net/core/skb_ring.c or something, with more
flexibility (like the number of slots)

BTW, why are you using spin_lock_irqsave() in tun_net_xmit() and
tun_peek() ?

BH should be disabled already (in tun_next_xmit()), and we can not
transmit from hard irq.

Thanks.

Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()

2016-05-15 Thread ethan zhao


Alexander,

On 2016/5/14 0:46, Alexander Duyck wrote:

On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao  wrote:

Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
CPUs were assigned. especially when DCB is enabled, so we should take
num_online_cpus() as top limit, and aslo to make sure every TC has
at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
number.

Signed-off-by: Ethan Zhao 

What is the harm in allowing the user to specify up to 64 queues if
they want to?  Also what is your opinion based on?  In the case of RSS


 There is no module parameter to specify queue number in this upstream 
ixgbe
  driver.  for what to specify more queues than num_online_cpus() via 
ethtool ?

 I couldn't figure out the benefit to do that.

 But if DCB is turned on after loading, the queues would be 64/64, that 
doesn't

 make sense if only 16 CPUs assigned.

traffic the upper limit is only 16 on older NICs, but last I knew the
latest X550 can support more queues for RSS.  Have you only been
testing on older NICs or did you test on the latest hardware as well?
  More queues for RSS than num_online_cpus() could bring better 
performance ?
  Test result shows false result.  even memory cost is not an issue for 
most of

  the expensive servers, but not for all.



If you want to control the number of queues allocated in a given
configuration you should look at the code over in the ixgbe_lib.c, not
  Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues 
calculation algorithm.
  But they all take the dev queues allocated in alloc_etherdev_mq() as 
upper limit.


 If we set 64 as default here, DCB would says "oh, there is 64 there, I 
could use it"

ixgbe_main.c.  All you are doing with this patch is denying the user
choice with this change as they then are not allowed to set more

  Yes, it is purposed to deny configuration that doesn't benefit.

queues.  Even if they find your decision was wrong for their
configuration.

- Alex


 Thanks,
 Ethan

Re: [patch net-next 07/11] net: hns: dsaf adds support of acpi

2016-05-15 Thread Yankejian (Hackim Yim)

On 2016/5/13 21:12, Andy Shevchenko wrote:
> On Fri, 2016-05-13 at 16:19 +0800, Yisen Zhuang wrote:
>> From: Kejian Yan 
>>
>> Dsaf needs to get configuration parameter by ACPI, so this patch add
>> support of ACPI.
>>
> Looks like at some point better to split driver to core part, and PCI
> and ACPI/DT/platform code.
>
> Too many changes where IS_ENABLED() involved shows as I can imagine bad
> architecture / split of the driver.

Hi Andy,
Actully, we use the unified function asap. The routine in DT/ACPI maybe 
difference. Some routine
will be treated in BIOS in ACPI case, but it will be treated in OS in DT case, 
so we need to distinguish
it.
And we will try to reduce the use of IS_ENABLED().

Thanks very much for your suggestions, Andy

Kejian

Re: [patch net-next 06/11] ACPI: bus: move acpi_match_device_ids() to linux/acpi.h

2016-05-15 Thread Yankejian (Hackim Yim)



On 2016/5/13 21:15, Andy Shevchenko wrote:
> On Fri, 2016-05-13 at 16:19 +0800, Yisen Zhuang wrote:
>> From: Hanjun Guo 
>>
>> acpi_match_device_ids() will be used for drivers to match
>> different hardware versions, it will be compiled in non-ACPI
>> case, but acpi_match_device_ids() in acpi_bus.h and it can
>> only be used in ACPI case, so move it to linux/acpi.h and
>> introduce a stub function for it.
> I somehow doubt this is right move.
>
> Like I said in the previous comment the architectural split might make
> this a bit better.
>
> You might use 
>
> #ifdef IS_ENABLED(CONFIG_ACPI)
> #else
> #endif
>
> only once to some big part of code. If kernel is build without ACPI
> support you even will not have this in your driver at all.

Hi Andy,

Thanks for your suggestions. It will add stub function instead in next submit.


> -- 
> Andy Shevchenko 
> Intel Finland Oy
>
>
> .
>

[PATCH net-next] tuntap: introduce tx skb ring

2016-05-15 Thread Jason Wang

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by using circular buffer which allows
lockless synchronization. This is done by switching from
sk_receive_queue to a tx skb ring with a new flag IFF_TX_RING and when
this is set:

- store pointer to skb in circular buffer in tun_net_xmit(), and read
  it from the circular buffer in tun_do_read().
- introduce a new proto_ops peek which could be implemented by
  specific socket which does not use sk_receive_queue.
- store skb length in circular buffer too, and implement a lockless
  peek for tuntap.
- change vhost_net to use proto_ops->peek() instead
- new spinlocks were introduced to synchronize among producers (and so
  did for consumers).

Pktgen test shows about 9% improvement on guest receiving pps:

Before: ~148pps
After : ~161pps

(I'm not sure noblocking read is still needed, so it was not included
 in this patch)
Signed-off-by: Jason Wang 
---
---
 drivers/net/tun.c   | 157 +---
 drivers/vhost/net.c |  16 -
 include/linux/net.h |   1 +
 include/uapi/linux/if_tun.h |   1 +
 4 files changed, 165 insertions(+), 10 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 425e983..6001ece 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -130,6 +131,8 @@ struct tap_filter {
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
+#define TUN_RING_SIZE 256
+#define TUN_RING_MASK (TUN_RING_SIZE - 1)
 
 struct tun_pcpu_stats {
u64 rx_packets;
@@ -142,6 +145,11 @@ struct tun_pcpu_stats {
u32 rx_frame_errors;
 };
 
+struct tun_desc {
+   struct sk_buff *skb;
+   int len; /* Cached skb len for peeking */
+};
+
 /* A tun_file connects an open character device to a tuntap netdevice. It
  * also contains all socket related structures (except sock_fprog and 
tap_filter)
  * to serve as one transmit queue for tuntap device. The sock_fprog and
@@ -167,6 +175,13 @@ struct tun_file {
};
struct list_head next;
struct tun_struct *detached;
+   /* reader lock */
+   spinlock_t rlock;
+   unsigned long tail;
+   struct tun_desc tx_descs[TUN_RING_SIZE];
+   /* writer lock */
+   spinlock_t wlock;
+   unsigned long head;
 };
 
 struct tun_flow_entry {
@@ -515,7 +530,27 @@ static struct tun_struct *tun_enable_queue(struct tun_file 
*tfile)
 
 static void tun_queue_purge(struct tun_file *tfile)
 {
+   unsigned long head, tail;
+   struct tun_desc *desc;
+   struct sk_buff *skb;
skb_queue_purge(>sk.sk_receive_queue);
+   spin_lock(>rlock);
+
+   head = ACCESS_ONCE(tfile->head);
+   tail = tfile->tail;
+
+   /* read tail before reading descriptor at tail */
+   smp_rmb();
+
+   while (CIRC_CNT(head, tail, TUN_RING_SIZE) >= 1) {
+   desc = >tx_descs[tail];
+   skb = desc->skb;
+   kfree_skb(skb);
+   tail = (tail + 1) & TUN_RING_MASK;
+   /* read descriptor before incrementing tail. */
+   smp_store_release(>tail, tail & TUN_RING_MASK);
+   }
+   spin_unlock(>rlock);
skb_queue_purge(>sk.sk_error_queue);
 }
 
@@ -824,6 +859,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
int txq = skb->queue_mapping;
struct tun_file *tfile;
u32 numqueues = 0;
+   unsigned long flags;
 
rcu_read_lock();
tfile = rcu_dereference(tun->tfiles[txq]);
@@ -888,8 +924,35 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
nf_reset(skb);
 
-   /* Enqueue packet */
-   skb_queue_tail(>socket.sk->sk_receive_queue, skb);
+   if (tun->flags & IFF_TX_RING) {
+   unsigned long head, tail;
+
+   spin_lock_irqsave(>wlock, flags);
+
+   head = tfile->head;
+   tail = ACCESS_ONCE(tfile->tail);
+
+   if (CIRC_SPACE(head, tail, TUN_RING_SIZE) >= 1) {
+   struct tun_desc *desc = >tx_descs[head];
+
+   desc->skb = skb;
+   desc->len = skb->len;
+   if (skb_vlan_tag_present(skb))
+   desc->len += VLAN_HLEN;
+
+   /* read descriptor before incrementing head. */
+   smp_store_release(>head,
+ (head + 1) & TUN_RING_MASK);
+   } else {
+   spin_unlock_irqrestore(>wlock, flags);
+   goto drop;
+   }
+
+   spin_unlock_irqrestore(>wlock, flags);
+   } else {
+   /* Enqueue packet */
+

[PATCH net-next] fq_codel: fix memory limitation drift

From: Eric Dumazet 

memory_usage must be decreased in dequeue_func(), not in
fq_codel_dequeue(), otherwise packets dropped by Codel algo
are missing this decrease.

Also we need to clear memory_usage in fq_codel_reset()

Fixes: 95b58430abe7 ("fq_codel: add memory limitation per queue")
Signed-off-by: Eric Dumazet 
---
 net/sched/sch_fq_codel.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index bb8bd9314629..6883a8971562 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -262,6 +262,7 @@ static struct sk_buff *dequeue_func(struct codel_vars 
*vars, void *ctx)
if (flow->head) {
skb = dequeue_head(flow);
q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
+   q->memory_usage -= skb->truesize;
sch->q.qlen--;
sch->qstats.backlog -= qdisc_pkt_len(skb);
}
@@ -318,7 +319,6 @@ begin:
list_del_init(>flowchain);
goto begin;
}
-   q->memory_usage -= skb->truesize;
qdisc_bstats_update(sch, skb);
flow->deficit -= qdisc_pkt_len(skb);
/* We cant call qdisc_tree_reduce_backlog() if our qlen is 0,
@@ -355,6 +355,7 @@ static void fq_codel_reset(struct Qdisc *sch)
}
memset(q->backlogs, 0, q->flows_cnt * sizeof(u32));
sch->q.qlen = 0;
+   q->memory_usage = 0;
 }
 
 static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {

Re: [patch net-next 05/11] net: hns: add uniform interface for phy connection

2016-05-15 Thread Yankejian (Hackim Yim)



On 2016/5/13 21:07, Andy Shevchenko wrote:
> On Fri, 2016-05-13 at 16:19 +0800, Yisen Zhuang wrote:
>> From: Kejian Yan 
>>
>> As device_node is only used by OF case, HNS needs to treat the others
>> cases including ACPI. It needs to use uniform ways to handle both of
>> OF and ACPI. This patch chooses phy_device, and of_phy_connect and
>> of_phy_attach are only used by OF case. It needs to add uniform
>> interface
>> to handle that sequence by both OF and ACPI.
> --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
>> +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
>> @@ -987,6 +987,41 @@ static void hns_nic_adjust_link(struct net_device
>> *ndev)
>>  h->dev->ops->adjust_link(h, ndev->phydev->speed, ndev-
>>> phydev->duplex);
>>  }
>>  
>> +static
>> +struct phy_device *hns_nic_phy_attach(struct net_device *dev,
>> +  struct phy_device *phy,
>> +  u32 flags,
>> +  phy_interface_t iface)
>> +{
>> +int ret;
>> +
>> +if (!phy)
>> +return NULL;
> No need to use defensive programming here.
>
>> +
>> +ret = phy_attach_direct(dev, phy, flags, iface);
>> +
>> +return ret ? NULL : phy;
> Shouldn't it return an error?
>
>
>> +}
>> +
>> +static
>> +struct phy_device *hns_nic_phy_connect(struct net_device *dev,
>> +   struct phy_device *phy,
>> +   void (*hndlr)(struct
>> net_device *),
>> +   u32 flags,
>> +   phy_interface_t iface)
>> +{
>> +int ret;
>> +
>> +if (!phy)
>> +return NULL;
>> +
>> +phy->dev_flags = flags;
>> +
>> +ret = phy_connect_direct(dev, phy, hndlr, iface);
>> +
>> +return ret ? NULL : phy;
>> +}
>> +
> For now looks that above functions are redundant and you may call them
> directly in below code.

Hi Andy,
Thanks for you suggestions, it will be fixed in next submit

MBR,
Kejian

>>  /**
>>   *hns_nic_init_phy - init phy
>>   *@ndev: net device
>> @@ -996,16 +1031,17 @@ static void hns_nic_adjust_link(struct
>> net_device *ndev)
>>  int hns_nic_init_phy(struct net_device *ndev, struct hnae_handle *h)
>>  {
>>  struct hns_nic_priv *priv = netdev_priv(ndev);
>> -struct phy_device *phy_dev = NULL;
>> +struct phy_device *phy_dev = h->phy_dev;
>>  
>> -if (!h->phy_node)
>> +if (!h->phy_dev)
>>  return 0;
>>  
>>  if (h->phy_if != PHY_INTERFACE_MODE_XGMII)
>> -phy_dev = of_phy_connect(ndev, h->phy_node,
>> - hns_nic_adjust_link, 0, h-
>>> phy_if);
>> +phy_dev = hns_nic_phy_connect(ndev, phy_dev,
>> +  hns_nic_adjust_link,
>> +  0, h->phy_if);
>>  else
>> -phy_dev = of_phy_attach(ndev, h->phy_node, 0, h-
>>> phy_if);
>> +phy_dev = hns_nic_phy_attach(ndev, phy_dev, 0, h-
>>> phy_if);
>

Re: [PATCH v5 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6

2016-05-15 Thread Jeff Kirsher

On Sun, 2016-05-15 at 16:42 -0700, Tom Herbert wrote:
> This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
> SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
> NETIF_F_GSO_IPXIP6. These are used to described IP in IP
> tunnel and what the outer protocol is. The inner protocol
> can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
> SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
> are removed (these are both instances of SKB_GSO_IPXIP4).
> SKB_GSO_IPXIP6 will be used when support for GSO with IP
> encapsulation over IPv6 is added.
> 
> Signed-off-by: Tom Herbert 

Acked-by: Jeff Kirsher 
For the Intel driver changes...

> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  5 ++---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |  4 ++--
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |  3 +--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  3 +--
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  3 +--
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c   |  3 +--
>  drivers/net/ethernet/intel/igb/igb_main.c |  3 +--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  3 +--
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
>  include/linux/netdev_features.h   | 12 ++--
>  include/linux/netdevice.h |  4 ++--
>  include/linux/skbuff.h    |  4 ++--
>  net/core/ethtool.c    |  4 ++--
>  net/ipv4/af_inet.c    |  2 +-
>  net/ipv4/ipip.c   |  2 +-
>  net/ipv6/ip6_offload.c    |  4 ++--
>  net/ipv6/sit.c    |  4 ++--
>  net/netfilter/ipvs/ip_vs_xmit.c   | 17 +++--
>  18 files changed, 36 insertions(+), 47 deletions(-)


signature.asc
Description: This is a digitally signed message part

linux-next: manual merge of the wireless-drivers-next tree with the net-next tree

2016-05-15 Thread Stephen Rothwell

Hi Kalle,

Today's linux-next merge of the wireless-drivers-next tree got a
conflict in:

  drivers/net/wireless/intel/iwlwifi/mvm/tx.c

between commit:

  909b27f70643 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

from the net-next tree and commit:

  a525d0eab17d ("Merge tag 'iwlwifi-for-kalle-2016-05-04' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes")

from the wireless-drivers-next tree.

I fixed it up (I think that the net-next tree merge lost the changes
to iwl_mvm_set_tx_cmd() associated with commit 5c08b0f5026f ("iwlwifi:
mvm: don't override the rate with the AMSDU len")) and can carry the
fix as necessary. This is now fixed as far as linux-next is concerned,
but any non trivial conflicts should be mentioned to your upstream
maintainer when your tree is submitted for merging.  You may also want
to consider cooperating with the maintainer of the conflicting tree to
minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

From Andreas Christopher (Please read the attached letter and contact me on Tel:+27603505087)

2016-05-15 Thread Andreas Christopher



Compliments of the day.doc
Description: MS-Word document

[PATCH v5 net-next 04/14] ipv6: Change "final" protocol processing for encapsulation

When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.

This patch set adds a check of a final protocol has been seen. The
rules are:
  - If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
  - If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d35dff2..94611e4 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -223,6 +223,7 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
unsigned int nhoff;
int nexthdr;
bool raw;
+   bool have_final = false;
 
/*
 *  Parse extension headers
@@ -242,9 +243,21 @@ resubmit_final:
if (ipprot) {
int ret;
 
-   if (ipprot->flags & INET6_PROTO_FINAL) {
+   if (have_final) {
+   if (!(ipprot->flags & INET6_PROTO_FINAL)) {
+   /* Once we've seen a final protocol don't
+* allow encapsulation on any non-final
+* ones. This allows foo in UDP encapsulation
+* to work.
+*/
+   goto discard;
+   }
+   } else if (ipprot->flags & INET6_PROTO_FINAL) {
const struct ipv6hdr *hdr;
 
+   /* Only do this once for first final protocol */
+   have_final = true;
+
/* Free reference early: we don't need it any more,
   and it may hold ip_conntrack module loaded
   indefinitely. */
-- 
2.8.0.rc2

[PATCH v5 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation

Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.

Signed-off-by: Tom Herbert 
---
 include/net/ip6_tunnel.h  | 58 ++
 net/ipv4/ip_tunnel_core.c |  5 +++
 net/ipv6/ip6_tunnel.c | 89 +--
 3 files changed, 141 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fb9e015..d325c81 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -52,10 +52,68 @@ struct ip6_tnl {
__u32 o_seqno;  /* The last output seqno */
int hlen;   /* tun_hlen + encap_hlen */
int tun_hlen;   /* Precalculated header length */
+   int encap_hlen; /* Encap header length (FOU,GUE) */
+   struct ip_tunnel_encap encap;
int mlink;
+};
 
+struct ip6_tnl_encap_ops {
+   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
+   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
+   u8 *protocol, struct flowi6 *fl6);
 };
 
+extern const struct ip6_tnl_encap_ops __rcu *
+   ip6tun_encaps[MAX_IPTUN_ENCAP_OPS];
+
+int ip6_tnl_encap_add_ops(const struct ip6_tnl_encap_ops *ops,
+ unsigned int num);
+int ip6_tnl_encap_del_ops(const struct ip6_tnl_encap_ops *ops,
+ unsigned int num);
+int ip6_tnl_encap_setup(struct ip6_tnl *t,
+   struct ip_tunnel_encap *ipencap);
+
+static inline int ip6_encap_hlen(struct ip_tunnel_encap *e)
+{
+   const struct ip6_tnl_encap_ops *ops;
+   int hlen = -EINVAL;
+
+   if (e->type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (e->type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(ip6tun_encaps[e->type]);
+   if (likely(ops && ops->encap_hlen))
+   hlen = ops->encap_hlen(e);
+   rcu_read_unlock();
+
+   return hlen;
+}
+
+static inline int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t,
+   u8 *protocol, struct flowi6 *fl6)
+{
+   const struct ip6_tnl_encap_ops *ops;
+   int ret = -EINVAL;
+
+   if (t->encap.type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(ip6tun_encaps[t->encap.type]);
+   if (likely(ops && ops->build_header))
+   ret = ops->build_header(skb, >encap, protocol, fl6);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 /* Tunnel encapsulation limit destination sub-option */
 
 struct ipv6_tlv_tnl_enc_lim {
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index cc66a20..afd6b59 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,10 @@ const struct ip_tunnel_encap_ops __rcu *
iptun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly;
 EXPORT_SYMBOL(iptun_encaps);
 
+const struct ip6_tnl_encap_ops __rcu *
+   ip6tun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly;
+EXPORT_SYMBOL(ip6tun_encaps);
+
 void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
   __be32 src, __be32 dst, __u8 proto,
   __u8 tos, __u8 ttl, __be16 df, bool xnet)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index e79330f..9f0ea85 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1010,7 +1010,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
struct dst_entry *dst = NULL, *ndst = NULL;
struct net_device *tdev;
int mtu;
-   unsigned int max_headroom = sizeof(struct ipv6hdr);
+   unsigned int max_headroom = sizeof(struct ipv6hdr) + t->hlen;
int err = -1;
 
/* NBMA tunnel */
@@ -1063,7 +1063,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
 t->parms.name);
goto tx_err_dst_release;
}
-   mtu = dst_mtu(dst) - sizeof(*ipv6h);
+   mtu = dst_mtu(dst) - sizeof(*ipv6h) - t->hlen;
if (encap_limit >= 0) {
max_headroom += 8;
mtu -= 8;
@@ -1125,10 +1125,14 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
}
 
max_headroom = LL_RESERVED_SPACE(dst->dev) + sizeof(struct ipv6hdr)
-   + dst->header_len;
+   + dst->header_len + t->hlen;
if (max_headroom > dev->needed_headroom)
dev->needed_headroom = max_headroom;
 
+   err = ip6_tnl_encap(skb, t, , fl6);
+   if (err)
+   return err;
+
skb_push(skb, sizeof(struct

[PATCH v5 net-next 03/14] ipv6: Fix nexthdr for reinjection

In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).

This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 1. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f185cbc..d35dff2 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -236,6 +236,7 @@ resubmit:
nhoff = IP6CB(skb)->nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -263,10 +264,21 @@ resubmit:
goto discard;
 
ret = ipprot->handler(skb);
-   if (ret > 0)
-   goto resubmit;
-   else if (ret == 0)
+   if (ret > 0) {
+   if (ipprot->flags & INET6_PROTO_FINAL) {
+   /* Not an extension header, most likely UDP
+* encapsulation. Use return value as nexthdr
+* protocol not nhoff (which presumably is
+* not set by handler).
+*/
+   nexthdr = ret;
+   goto resubmit_final;
+   } else {
+   goto resubmit;
+   }
+   } else if (ret == 0) {
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS);
+   }
} else {
if (!raw) {
if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-- 
2.8.0.rc2

[PATCH v5 net-next 08/14] fou: Support IPv6 in fou

This patch adds receive path support for IPv6 with fou.

- Add address family to fou structure for open sockets. This supports
  AF_INET and AF_INET6. Lookups for fou ports are performed on both the
  port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
  based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 47 +++
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index f4f2ddd..5f9207c 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -21,6 +21,7 @@ struct fou {
u8 protocol;
u8 flags;
__be16 port;
+   u8 family;
u16 type;
struct list_head list;
struct rcu_head rcu;
@@ -47,14 +48,17 @@ static inline struct fou *fou_from_sock(struct sock *sk)
return sk->sk_user_data;
 }
 
-static int fou_recv_pull(struct sk_buff *skb, size_t len)
+static int fou_recv_pull(struct sk_buff *skb, struct fou *fou, size_t len)
 {
-   struct iphdr *iph = ip_hdr(skb);
-
/* Remove 'len' bytes from the packet (UDP header and
 * FOU header if present).
 */
-   iph->tot_len = htons(ntohs(iph->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
+
__skb_pull(skb, len);
skb_postpull_rcsum(skb, udp_hdr(skb), len);
skb_reset_transport_header(skb);
@@ -68,7 +72,7 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb)
if (!fou)
return 1;
 
-   if (fou_recv_pull(skb, sizeof(struct udphdr)))
+   if (fou_recv_pull(skb, fou, sizeof(struct udphdr)))
goto drop;
 
return -fou->protocol;
@@ -141,7 +145,11 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff 
*skb)
 
hdrlen = sizeof(struct guehdr) + optlen;
 
-   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
 
/* Pull csum through the guehdr now . This can be used if
 * there is a remote checksum offload.
@@ -426,7 +434,8 @@ static int fou_add_to_port_list(struct net *net, struct fou 
*fou)
 
mutex_lock(>fou_lock);
list_for_each_entry(fout, >fou_list, list) {
-   if (fou->port == fout->port) {
+   if (fou->port == fout->port &&
+   fou->family == fout->family) {
mutex_unlock(>fou_lock);
return -EALREADY;
}
@@ -471,8 +480,9 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
sk = sock->sk;
 
-   fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->family = cfg->udp_config.family;
+   fou->flags = cfg->flags;
fou->type = cfg->type;
fou->sock = sock;
 
@@ -524,12 +534,13 @@ static int fou_destroy(struct net *net, struct fou_cfg 
*cfg)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
__be16 port = cfg->udp_config.local_udp_port;
+   u8 family = cfg->udp_config.family;
int err = -EINVAL;
struct fou *fou;
 
mutex_lock(>fou_lock);
list_for_each_entry(fou, >fou_list, list) {
-   if (fou->port == port) {
+   if (fou->port == port && fou->family == family) {
fou_release(fou);
err = 0;
break;
@@ -567,8 +578,15 @@ static int parse_nl_config(struct genl_info *info,
if (info->attrs[FOU_ATTR_AF]) {
u8 family = nla_get_u8(info->attrs[FOU_ATTR_AF]);
 
-   if (family != AF_INET)
-   return -EINVAL;
+   switch (family) {
+   case AF_INET:
+   break;
+   case AF_INET6:
+   cfg->udp_config.ipv6_v6only = 1;
+   break;
+   default:
+   return -EAFNOSUPPORT;
+   }
 
cfg->udp_config.family = family;
}
@@ -659,6 +677,7 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
struct fou_cfg cfg;
struct fou *fout;
__be16 port;
+   u8 family;
int ret;
 
ret = parse_nl_config(info, );
@@ -668,6 +687,10 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
if (port == 0)
return -EINVAL;
 
+   family = cfg.udp_config.family;
+   if (family !=

[PATCH v5 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6

This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  5 ++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  3 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  3 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  3 +--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |  3 +--
 drivers/net/ethernet/intel/igb/igb_main.c |  3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  3 +--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
 include/linux/netdev_features.h   | 12 ++--
 include/linux/netdevice.h |  4 ++--
 include/linux/skbuff.h|  4 ++--
 net/core/ethtool.c|  4 ++--
 net/ipv4/af_inet.c|  2 +-
 net/ipv4/ipip.c   |  2 +-
 net/ipv6/ip6_offload.c|  4 ++--
 net/ipv6/sit.c|  4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c   | 17 +++--
 18 files changed, 36 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index d465bd7..0a5b770 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13259,12 +13259,11 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX;
if (!chip_is_e1x) {
dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT;
+   NETIF_F_GSO_IPXIP4;
dev->hw_enc_features =
NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 |
-   NETIF_F_GSO_IPIP |
-   NETIF_F_GSO_SIT |
+   NETIF_F_GSO_IPXIP4 |
NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL;
}
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5a0dca3..bfc1e94 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6311,7 +6311,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
   NETIF_F_TSO | NETIF_F_TSO6 |
   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
-  NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
+  NETIF_F_GSO_IPXIP4 |
   NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_PARTIAL | NETIF_F_RXHASH |
   NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO;
@@ -6321,7 +6321,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
NETIF_F_TSO | NETIF_F_TSO6 |
NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
+   NETIF_F_GSO_IPXIP4;
NETIF_F_GSO_PARTIAL;
dev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM |
NETIF_F_GSO_GRE_CSUM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 1cd0ebf..242a1ff 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9083,8 +9083,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
   NETIF_F_TSO6 |
   NETIF_F_GSO_GRE  |
   NETIF_F_GSO_GRE_CSUM |
-  NETIF_F_GSO_IPIP |
-  NETIF_F_GSO_SIT  |
+  NETIF_F_GSO_IPXIP4   |
   NETIF_F_GSO_UDP_TUNNEL   |

[PATCH v5 net-next 14/14] ip4ip6: Support for GSO/GRO

Signed-off-by: Tom Herbert 
---
 include/net/inet_common.h |  5 +
 net/ipv4/af_inet.c| 12 +++-
 net/ipv6/ip6_offload.c| 33 -
 net/ipv6/ip6_tunnel.c |  3 +++
 4 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 109e3ee..5d68342 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -39,6 +39,11 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short 
family,
 int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
 
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb);
+int inet_gro_complete(struct sk_buff *skb, int nhoff);
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features);
+
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
if (sk)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 25040b1..377424e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1192,8 +1192,8 @@ int inet_sk_rebuild_header(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_sk_rebuild_header);
 
-static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
-   netdev_features_t features)
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features)
 {
bool udpfrag = false, fixedid = false, encap;
struct sk_buff *segs = ERR_PTR(-EINVAL);
@@ -1280,9 +1280,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
 out:
return segs;
 }
+EXPORT_SYMBOL(inet_gso_segment);
 
-static struct sk_buff **inet_gro_receive(struct sk_buff **head,
-struct sk_buff *skb)
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 {
const struct net_offload *ops;
struct sk_buff **pp = NULL;
@@ -1398,6 +1398,7 @@ out:
 
return pp;
 }
+EXPORT_SYMBOL(inet_gro_receive);
 
 static struct sk_buff **ipip_gro_receive(struct sk_buff **head,
 struct sk_buff *skb)
@@ -1449,7 +1450,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
-static int inet_gro_complete(struct sk_buff *skb, int nhoff)
+int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
@@ -1479,6 +1480,7 @@ out_unlock:
 
return err;
 }
+EXPORT_SYMBOL(inet_gro_complete);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 332d6a0..22e90e5 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ip6_offload.h"
 
@@ -268,6 +269,21 @@ static struct sk_buff **sit_ip6ip6_gro_receive(struct 
sk_buff **head,
return ipv6_gro_receive(head, skb);
 }
 
+static struct sk_buff **ip4ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
+{
+   /* Common GRO receive for SIT and IP6IP6 */
+
+   if (NAPI_GRO_CB(skb)->encap_mark) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   NAPI_GRO_CB(skb)->encap_mark = 1;
+
+   return inet_gro_receive(head, skb);
+}
+
 static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
const struct net_offload *ops;
@@ -307,6 +323,13 @@ static int ip6ip6_gro_complete(struct sk_buff *skb, int 
nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return inet_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -324,6 +347,14 @@ static const struct net_offload sit_offload = {
},
 };
 
+static const struct net_offload ip4ip6_offload = {
+   .callbacks = {
+   .gso_segment= inet_gso_segment,
+   .gro_receive= ip4ip6_gro_receive,
+   .gro_complete   = ip4ip6_gro_complete,
+   },
+};
+
 static const struct net_offload ip6ip6_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
@@ -331,7 +362,6 @@ static const struct net_offload ip6ip6_offload = {
.gro_complete   = ip6ip6_gro_complete,
},
 };
-
 static int __init ipv6_offload_init(void)
 {
 
@@ -344,6 +374,7 @@ static int __init ipv6_offload_init(void)
 
inet_add_offload(_offload, IPPROTO_IPV6);
inet6_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPIP);
 
return

[PATCH v5 net-next 11/14] ip6_gre: Add support for fou/gue encapsulation

Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_gre.c | 77 +++---
 1 file changed, 74 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4541fa5..f040bcf 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1022,9 +1022,7 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
}
 
tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
-
-   tunnel->hlen = tunnel->tun_hlen;
-
+   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
 
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
@@ -1290,15 +1288,57 @@ static void ip6gre_tap_setup(struct net_device *dev)
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
 }
 
+static bool ip6gre_netlink_encap_parms(struct nlattr *data[],
+  struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_GRE_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_GRE_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_GRE_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_GRE_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
 {
struct ip6_tnl *nt;
struct net *net = dev_net(dev);
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
+   struct ip_tunnel_encap ipencap;
int err;
 
nt = netdev_priv(dev);
+
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, >parms);
 
if (ip6gre_tunnel_find(net, >parms, dev->type))
@@ -1345,10 +1385,18 @@ static int ip6gre_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct net *net = nt->net;
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
struct __ip6_tnl_parm p;
+   struct ip_tunnel_encap ipencap;
 
if (dev == ign->fb_tunnel_dev)
return -EINVAL;
 
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, );
 
t = ip6gre_tunnel_locate(net, , 0);
@@ -1400,6 +1448,14 @@ static size_t ip6gre_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_GRE_FLAGS */
nla_total_size(4) +
+   /* IFLA_GRE_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1422,6 +1478,17 @@ static int ip6gre_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) ||
nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE,
+   t->encap.type) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_SPORT,
+t->encap.sport) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_DPORT,
+t->encap.dport) ||
+   nla_put_u16(skb, IFLA_GRE_ENCAP_FLAGS,
+   t->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1440,6 +1507,10 @@ static const struct nla_policy 
ip6gre_policy[IFLA_GRE_MAX + 1] = {
[IFLA_GRE_ENCAP_LIMIT] = { .type = NLA_U8 },
[IFLA_GRE_FLOWINFO]= { .type = NLA_U32 },
[IFLA_GRE_FLAGS]   = { .type = NLA_U32 },
+   [IFLA_GRE_ENCAP_TYPE]   = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_FLAGS]  = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_SPORT]  = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_DPORT]  = { .type = NLA_U16 },
 };
 
 static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
-- 
2.8.0.rc2

[PATCH v5 net-next 05/14] net: Cleanup encap items in ip_tunnels.h

Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.

Signed-off-by: Tom Herbert 
---
 include/net/ip_tunnels.h  | 76 ---
 net/ipv4/ip_tunnel.c  | 45 
 net/ipv4/ip_tunnel_core.c |  4 +++
 3 files changed, 62 insertions(+), 63 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d916b43..dbf 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -171,22 +171,6 @@ struct ip_tunnel_net {
struct ip_tunnel __rcu *collect_md_tun;
 };
 
-struct ip_tunnel_encap_ops {
-   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
-   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
-   u8 *protocol, struct flowi4 *fl4);
-};
-
-#define MAX_IPTUN_ENCAP_OPS 8
-
-extern const struct ip_tunnel_encap_ops __rcu *
-   iptun_encaps[MAX_IPTUN_ENCAP_OPS];
-
-int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op,
-   unsigned int num);
-int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op,
-   unsigned int num);
-
 static inline void ip_tunnel_key_init(struct ip_tunnel_key *key,
  __be32 saddr, __be32 daddr,
  u8 tos, u8 ttl, __be32 label,
@@ -251,8 +235,6 @@ void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct 
rtnl_link_ops *ops);
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
const struct iphdr *tnl_params, const u8 protocol);
 int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd);
-int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
-   u8 *protocol, struct flowi4 *fl4);
 int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict);
 int ip_tunnel_change_mtu(struct net_device *dev, int new_mtu);
 
@@ -271,9 +253,67 @@ int ip_tunnel_changelink(struct net_device *dev, struct 
nlattr *tb[],
 int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
  struct ip_tunnel_parm *p);
 void ip_tunnel_setup(struct net_device *dev, int net_id);
+
+struct ip_tunnel_encap_ops {
+   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
+   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
+   u8 *protocol, struct flowi4 *fl4);
+};
+
+#define MAX_IPTUN_ENCAP_OPS 8
+
+extern const struct ip_tunnel_encap_ops __rcu *
+   iptun_encaps[MAX_IPTUN_ENCAP_OPS];
+
+int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op,
+   unsigned int num);
+int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op,
+   unsigned int num);
+
 int ip_tunnel_encap_setup(struct ip_tunnel *t,
  struct ip_tunnel_encap *ipencap);
 
+static inline int ip_encap_hlen(struct ip_tunnel_encap *e)
+{
+   const struct ip_tunnel_encap_ops *ops;
+   int hlen = -EINVAL;
+
+   if (e->type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (e->type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(iptun_encaps[e->type]);
+   if (likely(ops && ops->encap_hlen))
+   hlen = ops->encap_hlen(e);
+   rcu_read_unlock();
+
+   return hlen;
+}
+
+static inline int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
+ u8 *protocol, struct flowi4 *fl4)
+{
+   const struct ip_tunnel_encap_ops *ops;
+   int ret = -EINVAL;
+
+   if (t->encap.type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(iptun_encaps[t->encap.type]);
+   if (likely(ops && ops->build_header))
+   ret = ops->build_header(skb, >encap, protocol, fl4);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 /* Extract dsfield from inner protocol */
 static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph,
   const struct sk_buff *skb)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index a69ed94..d8f5e0a 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -443,29 +443,6 @@ drop:
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_rcv);
 
-static int ip_encap_hlen(struct ip_tunnel_encap *e)
-{
-   const struct ip_tunnel_encap_ops *ops;
-   int hlen = -EINVAL;
-
-   if (e->type == TUNNEL_ENCAP_NONE)
-   return 0;
-
-   if (e->type >= MAX_IPTUN_ENCAP_OPS)
-   return -EINVAL;
-
-

[PATCH v5 net-next 06/14] fou: Call setup_udp_tunnel_sock

Use helper function to set up UDP tunnel related information for a fou
socket.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 50 --
 1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index eeec7d6..6cbc725 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -448,31 +448,13 @@ static void fou_release(struct fou *fou)
kfree_rcu(fou, rcu);
 }
 
-static int fou_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = fou_udp_recv;
-   udp_sk(sk)->gro_receive = fou_gro_receive;
-   udp_sk(sk)->gro_complete = fou_gro_complete;
-   fou_from_sock(sk)->protocol = cfg->protocol;
-
-   return 0;
-}
-
-static int gue_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = gue_udp_recv;
-   udp_sk(sk)->gro_receive = gue_gro_receive;
-   udp_sk(sk)->gro_complete = gue_gro_complete;
-
-   return 0;
-}
-
 static int fou_create(struct net *net, struct fou_cfg *cfg,
  struct socket **sockp)
 {
struct socket *sock = NULL;
struct fou *fou = NULL;
struct sock *sk;
+   struct udp_tunnel_sock_cfg tunnel_cfg;
int err;
 
/* Open UDP socket */
@@ -491,33 +473,33 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
 
fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->type = cfg->type;
+   fou->sock = sock;
+
+   memset(_cfg, 0, sizeof(tunnel_cfg));
+   tunnel_cfg.encap_type = 1;
+   tunnel_cfg.sk_user_data = fou;
+   tunnel_cfg.encap_destroy = NULL;
 
/* Initial for fou type */
switch (cfg->type) {
case FOU_ENCAP_DIRECT:
-   err = fou_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = fou_udp_recv;
+   tunnel_cfg.gro_receive = fou_gro_receive;
+   tunnel_cfg.gro_complete = fou_gro_complete;
+   fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
-   err = gue_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = gue_udp_recv;
+   tunnel_cfg.gro_receive = gue_gro_receive;
+   tunnel_cfg.gro_complete = gue_gro_complete;
break;
default:
err = -EINVAL;
goto error;
}
 
-   fou->type = cfg->type;
-
-   udp_sk(sk)->encap_type = 1;
-   udp_encap_enable();
-
-   sk->sk_user_data = fou;
-   fou->sock = sock;
-
-   inet_inc_convert_csum(sk);
+   setup_udp_tunnel_sock(net, sock, _cfg);
 
sk->sk_allocation = GFP_ATOMIC;
 
-- 
2.8.0.rc2

[PATCH v5 net-next 12/14] ip6_tunnel: Add support for fou/gue encapsulation

Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_tunnel.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 9f0ea85..093bdba 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1796,13 +1796,55 @@ static void ip6_tnl_netlink_parms(struct nlattr *data[],
parms->proto = nla_get_u8(data[IFLA_IPTUN_PROTO]);
 }
 
+static bool ip6_tnl_netlink_encap_parms(struct nlattr *data[],
+   struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_IPTUN_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_IPTUN_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_IPTUN_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_IPTUN_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_IPTUN_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev,
   struct nlattr *tb[], struct nlattr *data[])
 {
struct net *net = dev_net(dev);
struct ip6_tnl *nt, *t;
+   struct ip_tunnel_encap ipencap;
 
nt = netdev_priv(dev);
+
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6_tnl_netlink_parms(data, >parms);
 
t = ip6_tnl_locate(net, >parms, 0);
@@ -1819,10 +1861,17 @@ static int ip6_tnl_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct __ip6_tnl_parm p;
struct net *net = t->net;
struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
+   struct ip_tunnel_encap ipencap;
 
if (dev == ip6n->fb_tnl_dev)
return -EINVAL;
 
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(t, );
+
+   if (err < 0)
+   return err;
+   }
ip6_tnl_netlink_parms(data, );
 
t = ip6_tnl_locate(net, , 0);
@@ -1863,6 +1912,14 @@ static size_t ip6_tnl_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_IPTUN_PROTO */
nla_total_size(1) +
+   /* IFLA_IPTUN_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1880,6 +1937,17 @@ static int ip6_tnl_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u32(skb, IFLA_IPTUN_FLAGS, parm->flags) ||
nla_put_u8(skb, IFLA_IPTUN_PROTO, parm->proto))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_IPTUN_ENCAP_TYPE,
+   tunnel->encap.type) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_SPORT,
+tunnel->encap.sport) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_DPORT,
+tunnel->encap.dport) ||
+   nla_put_u16(skb, IFLA_IPTUN_ENCAP_FLAGS,
+   tunnel->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1903,6 +1971,10 @@ static const struct nla_policy 
ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
[IFLA_IPTUN_FLOWINFO]   = { .type = NLA_U32 },
[IFLA_IPTUN_FLAGS]  = { .type = NLA_U32 },
[IFLA_IPTUN_PROTO]  = { .type = NLA_U8 },
+   [IFLA_IPTUN_ENCAP_TYPE] = { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_FLAGS]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_SPORT]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_DPORT]= { .type = NLA_U16 },
 };
 
 static struct rtnl_link_ops ip6_link_ops __read_mostly = {
-- 
2.8.0.rc2

[PATCH v5 net-next 10/14] fou: Add encap ops for IPv6 tunnels

This patch add a new fou6 module that provides encapsulation
operations for IPv6.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h |   2 +-
 net/ipv6/Makefile |   1 +
 net/ipv6/fou6.c   | 140 ++
 3 files changed, 142 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv6/fou6.c

diff --git a/include/net/fou.h b/include/net/fou.h
index 7d2fda2..f5cc691 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -9,7 +9,7 @@
 #include 
 
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
-static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
+size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
 int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
   u8 *protocol, __be16 *sport, int type);
diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile
index 5e9d6bf..7ec3129 100644
--- a/net/ipv6/Makefile
+++ b/net/ipv6/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_IPV6_VTI) += ip6_vti.o
 obj-$(CONFIG_IPV6_SIT) += sit.o
 obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.o
 obj-$(CONFIG_IPV6_GRE) += ip6_gre.o
+obj-$(CONFIG_NET_FOU) += fou6.o
 
 obj-y += addrconf_core.o exthdrs_core.o ip6_checksum.o ip6_icmp.o
 obj-$(CONFIG_INET) += output_core.o protocol.o $(ipv6-offload)
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
new file mode 100644
index 000..c972d0b
--- /dev/null
+++ b/net/ipv6/fou6.c
@@ -0,0 +1,140 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void fou6_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  struct flowi6 *fl6, u8 *protocol, __be16 sport)
+{
+   struct udphdr *uh;
+
+   skb_push(skb, sizeof(struct udphdr));
+   skb_reset_transport_header(skb);
+
+   uh = udp_hdr(skb);
+
+   uh->dest = e->dport;
+   uh->source = sport;
+   uh->len = htons(skb->len);
+   udp6_set_csum(!(e->flags & TUNNEL_ENCAP_FLAG_CSUM6), skb,
+ >saddr, >daddr, skb->len);
+
+   *protocol = IPPROTO_UDP;
+}
+
+int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __fou_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(fou6_build_header);
+
+int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(gue6_build_header);
+
+#ifdef CONFIG_NET_FOU_IP_TUNNELS
+
+static const struct ip6_tnl_encap_ops fou_ip6tun_ops = {
+   .encap_hlen = fou_encap_hlen,
+   .build_header = fou6_build_header,
+};
+
+static const struct ip6_tnl_encap_ops gue_ip6tun_ops = {
+   .encap_hlen = gue_encap_hlen,
+   .build_header = gue6_build_header,
+};
+
+static int ip6_tnl_encap_add_fou_ops(void)
+{
+   int ret;
+
+   ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   if (ret < 0) {
+   pr_err("can't add fou6 ops\n");
+   return ret;
+   }
+
+   ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE);
+   if (ret < 0) {
+   pr_err("can't add gue6 ops\n");
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   return ret;
+   }
+
+   return 0;
+}
+
+static void ip6_tnl_encap_del_fou_ops(void)
+{
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE);
+}
+
+#else
+
+static int ip6_tnl_encap_add_fou_ops(void)
+{
+   return 0;
+}
+
+static void ip6_tnl_encap_del_fou_ops(void)
+{
+}
+
+#endif
+
+static int __init fou6_init(void)
+{
+   int ret;
+
+   ret = ip6_tnl_encap_add_fou_ops();
+
+   return ret;
+}
+
+static void __exit fou6_fini(void)
+{
+   ip6_tnl_encap_del_fou_ops();
+}
+
+module_init(fou6_init);
+module_exit(fou6_fini);
+MODULE_AUTHOR("Tom Herbert ");
+MODULE_LICENSE("GPL");
-- 
2.8.0.rc2

[PATCH v5 net-next 07/14] fou: Split out {fou,gue}_build_header

Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h |  8 
 net/ipv4/fou.c| 47 +--
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/include/net/fou.h b/include/net/fou.h
index 19b8a0c..7d2fda2 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -11,9 +11,9 @@
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
 static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
-int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
 
 #endif
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 6cbc725..f4f2ddd 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -780,6 +780,22 @@ static void fou_build_udp(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
*protocol = IPPROTO_UDP;
 }
 
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
+{
+   int err;
+
+   err = iptunnel_handle_offloads(skb, type);
+   if (err)
+   return err;
+
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
+
+   return 0;
+}
+EXPORT_SYMBOL(__fou_build_header);
+
 int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 u8 *protocol, struct flowi4 *fl4)
 {
@@ -788,26 +804,21 @@ int fou_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
__be16 sport;
int err;
 
-   err = iptunnel_handle_offloads(skb, type);
+   err = __fou_build_header(skb, e, protocol, , type);
if (err)
return err;
 
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
 }
 EXPORT_SYMBOL(fou_build_header);
 
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4)
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
 {
-   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
-  SKB_GSO_UDP_TUNNEL;
struct guehdr *guehdr;
size_t hdrlen, optlen = 0;
-   __be16 sport;
void *data;
bool need_priv = false;
int err;
@@ -826,8 +837,8 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return err;
 
/* Get source port (based on flow hash) before skb_push */
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
 
hdrlen = sizeof(struct guehdr) + optlen;
 
@@ -872,6 +883,22 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
 
}
 
+   return 0;
+}
+EXPORT_SYMBOL(__gue_build_header);
+
+int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+u8 *protocol, struct flowi4 *fl4)
+{
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
+  SKB_GSO_UDP_TUNNEL;
+   __be16 sport;
+   int err;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
-- 
2.8.0.rc2

[PATCH v5 net-next 00/14] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

This patch set:
  - Fixes GRE6 to process translate flags correctly from configuration
  - Adds support for GSO and GRO for ip6ip6 and ip4ip6
  - Add support for FOU and GUE in IPv6
  - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE
  - Fixes ip6_input to deal with UDP encapsulations
  - Some other minor fixes

v2:
  - Removed a check of GSO types in MPLS
  - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input
from Alexander)
  - Don't define GSO types specifically for IP6IP6 and IP4IP6, above
fix makes that unnecessary
  - Don't bother clearing encapsulation flag in UDP tunnel segment
(another item suggested by Alexander).

v3:
  - Address some minor comments from Alexander

v4:
  - Rebase on changes to fix IP TX tunnels
  - Fix MTU issues in ip4ip6, ip6ip6
  - Add test data for above

v5:
  - Address feedback from Shmulik Ladkani regarding extension header
code that does not return next header but in instead relies
on returning value via nhoff. Solution here is to fix EH
processing to return nexthdr value.
  - Refactored IPv4 encaps so that we won't need to create
a ip6_tunnel_core.c when adding encap support IPv6.

Tested:
   Tested a variety of case, but not the full matrix (which is quite
   large now). Most of the obvious cases (e.g. GRE) work fine. Still
   some issues probably with GSO/GRO being effective in all cases.

- IPv4/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6616 Mbps
  200 TCP_RR
1244043 tps
141/243/446 90/95/99% latencies
86.61% CPU utilization

- IPv6/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6940 Mbps
  200 TCP_RR
1270903 tps
138/236/440 90/95/99% latencies
87.51% CPU utilization

 - IP6IP6
  1 TCP_STREAM
2576 Mbps
  200 TCP_RR
498981 tps
388/498/631 90/95/99% latencies
19.75% CPU utilization (1 CPU saturated)

 - IP6IP6/GUE with RCO
  1 TCP_STREAM
2031 Mbps
  200 TCP_RR
1233818 tps
143/244/451 90/95/99% latencies
87.57 CPU utilization

 - IP4IP6
  1 TCP_STREAM
2371 Mbps
  200 TCP_RR
763774 tps
250/318/466 90/95/99% latencies
35.25% CPU utilization (1 CPU saturated)

 - IP4IP6/GUE with RCO
  1 TCP_STREAM
2054 Mbps
  200 TCP_RR
1196385 tps
148/251/460 90/95/99% latencies
87.56 CPU utilization

 - GRE with keyid
  200 TCP_RR
744173 tps
258/332/461 90/95/99% latencies
34.59% CPU utilization (1 CPU saturated)
  

Tom Herbert (14):
  gso: Remove arbitrary checks for unsupported GSO
  net: define gso types for IPx over IPv4 and IPv6
  ipv6: Fix nexthdr for reinjection
  ipv6: Change "final" protocol processing for encapsulation
  net: Cleanup encap items in ip_tunnels.h
  fou: Call setup_udp_tunnel_sock
  fou: Split out {fou,gue}_build_header
  fou: Support IPv6 in fou
  ip6_tun: Add infrastructure for doing encapsulation
  fou: Add encap ops for IPv6 tunnels
  ip6_gre: Add support for fou/gue encapsulation
  ip6_tunnel: Add support for fou/gue encapsulation
  ip6ip6: Support for GSO/GRO
  ip4ip6: Support for GSO/GRO

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   3 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |   3 +-
 drivers/net/ethernet/intel/igb/igb_main.c |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
 include/linux/netdev_features.h   |  12 +-
 include/linux/netdevice.h |   4 +-
 include/linux/skbuff.h|   4 +-
 include/net/fou.h |  10 +-
 include/net/inet_common.h |   5 +
 include/net/ip6_tunnel.h  |  58 
 include/net/ip_tunnels.h  |  76 +++---
 net/core/ethtool.c|   4 +-
 net/ipv4/af_inet.c|  32 ++---
 net/ipv4/fou.c| 144 +++
 net/ipv4/gre_offload.c|  14 --
 net/ipv4/ip_tunnel.c  |  45 --
 net/ipv4/ip_tunnel_core.c |   9 ++
 net/ipv4/ipip.c   |   2 +-
 net/ipv4/tcp_offload.c|  19 ---
 net/ipv4/udp_offload.c|  10 --
 net/ipv6/Makefile |   1 +
 net/ipv6/fou6.c   | 140 ++
 net/ipv6/ip6_gre.c|  77 +-
 net/ipv6/ip6_input.c

[PATCH v5 net-next 13/14] ip6ip6: Support for GSO/GRO

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_offload.c | 24 +---
 net/ipv6/ip6_tunnel.c  |  3 +++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 787e55f..332d6a0 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -253,9 +253,11 @@ out:
return pp;
 }
 
-static struct sk_buff **sit_gro_receive(struct sk_buff **head,
-   struct sk_buff *skb)
+static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
 {
+   /* Common GRO receive for SIT and IP6IP6 */
+
if (NAPI_GRO_CB(skb)->encap_mark) {
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
@@ -298,6 +300,13 @@ static int sit_gro_complete(struct sk_buff *skb, int nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return ipv6_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -310,11 +319,19 @@ static struct packet_offload ipv6_packet_offload 
__read_mostly = {
 static const struct net_offload sit_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
-   .gro_receive= sit_gro_receive,
+   .gro_receive= sit_ip6ip6_gro_receive,
.gro_complete   = sit_gro_complete,
},
 };
 
+static const struct net_offload ip6ip6_offload = {
+   .callbacks = {
+   .gso_segment= ipv6_gso_segment,
+   .gro_receive= sit_ip6ip6_gro_receive,
+   .gro_complete   = ip6ip6_gro_complete,
+   },
+};
+
 static int __init ipv6_offload_init(void)
 {
 
@@ -326,6 +343,7 @@ static int __init ipv6_offload_init(void)
dev_add_offload(_packet_offload);
 
inet_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPV6);
 
return 0;
 }
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 093bdba..0219bfa 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1238,6 +1238,9 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
 
+   if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
+   return -1;
+
err = ip6_tnl_xmit(skb, dev, dsfield, , encap_limit, ,
   IPPROTO_IPV6);
if (err != 0) {
-- 
2.8.0.rc2

[PATCH v5 net-next 01/14] gso: Remove arbitrary checks for unsupported GSO

In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.

This is a code simplication especially for add new GSO types.

Signed-off-by: Tom Herbert 
---
 net/ipv4/af_inet.c | 18 --
 net/ipv4/gre_offload.c | 14 --
 net/ipv4/tcp_offload.c | 19 ---
 net/ipv4/udp_offload.c | 10 --
 net/ipv6/ip6_offload.c | 18 --
 net/ipv6/udp_offload.c | 13 -
 net/mpls/mpls_gso.c|  9 -
 7 files changed, 101 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2e6e65f..7f08d45 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1205,24 +1205,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
int ihl;
int id;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_UDP |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TUNNEL_REMCSUM |
-  SKB_GSO_PARTIAL |
-  0)))
-   goto out;
-
skb_reset_network_header(skb);
nhoff = skb_network_header(skb) - skb_mac_header(skb);
if (unlikely(!pskb_may_pull(skb, sizeof(*iph
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index e88190a..ecd1e09 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -26,20 +26,6 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
int gre_offset, outer_hlen;
bool need_csum, ufo;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-   ~(SKB_GSO_TCPV4 |
- SKB_GSO_TCPV6 |
- SKB_GSO_UDP |
- SKB_GSO_DODGY |
- SKB_GSO_TCP_ECN |
- SKB_GSO_TCP_FIXEDID |
- SKB_GSO_GRE |
- SKB_GSO_GRE_CSUM |
- SKB_GSO_IPIP |
- SKB_GSO_SIT |
- SKB_GSO_PARTIAL)))
-   goto out;
-
if (!skb->encapsulation)
goto out;
 
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 02737b6..5c59649 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -83,25 +83,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TUNNEL_REMCSUM |
-  0) ||
-!(type & (SKB_GSO_TCPV4 |
-  SKB_GSO_TCPV6
-   goto out;
 
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 6b7459c..81f253b 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -209,16 +209,6 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY |
- SKB_GSO_UDP_TUNNEL |
- SKB_GSO_UDP_TUNNEL_CSUM |
- SKB_GSO_TUNNEL_REMCSUM |
- SKB_GSO_IPIP |
-

[PATCH 1/2] net: ethernet: ftgmac100: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/faraday/ftgmac100.c |   24 
 1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 84384e1..9cc23c3 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -71,7 +71,6 @@ struct ftgmac100 {
struct napi_struct napi;
 
struct mii_bus *mii_bus;
-   struct phy_device *phydev;
int old_speed;
 };
 
@@ -807,7 +806,7 @@ err:
 static void ftgmac100_adjust_link(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
-   struct phy_device *phydev = priv->phydev;
+   struct phy_device *phydev = netdev->phydev;
int ier;
 
if (phydev->speed == priv->old_speed)
@@ -850,7 +849,6 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
return PTR_ERR(phydev);
}
 
-   priv->phydev = phydev;
return 0;
 }
 
@@ -942,17 +940,13 @@ static void ftgmac100_get_drvinfo(struct net_device 
*netdev,
 static int ftgmac100_get_settings(struct net_device *netdev,
  struct ethtool_cmd *cmd)
 {
-   struct ftgmac100 *priv = netdev_priv(netdev);
-
-   return phy_ethtool_gset(priv->phydev, cmd);
+   return phy_ethtool_gset(netdev->phydev, cmd);
 }
 
 static int ftgmac100_set_settings(struct net_device *netdev,
  struct ethtool_cmd *cmd)
 {
-   struct ftgmac100 *priv = netdev_priv(netdev);
-
-   return phy_ethtool_sset(priv->phydev, cmd);
+   return phy_ethtool_sset(netdev->phydev, cmd);
 }
 
 static const struct ethtool_ops ftgmac100_ethtool_ops = {
@@ -1085,7 +1079,7 @@ static int ftgmac100_open(struct net_device *netdev)
ftgmac100_init_hw(priv);
ftgmac100_start_hw(priv, 10);
 
-   phy_start(priv->phydev);
+   phy_start(netdev->phydev);
 
napi_enable(>napi);
netif_start_queue(netdev);
@@ -,7 +1105,7 @@ static int ftgmac100_stop(struct net_device *netdev)
 
netif_stop_queue(netdev);
napi_disable(>napi);
-   phy_stop(priv->phydev);
+   phy_stop(netdev->phydev);
 
ftgmac100_stop_hw(priv);
free_irq(priv->irq, netdev);
@@ -1152,9 +1146,7 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
 /* optional */
 static int ftgmac100_do_ioctl(struct net_device *netdev, struct ifreq *ifr, 
int cmd)
 {
-   struct ftgmac100 *priv = netdev_priv(netdev);
-
-   return phy_mii_ioctl(priv->phydev, ifr, cmd);
+   return phy_mii_ioctl(netdev->phydev, ifr, cmd);
 }
 
 static const struct net_device_ops ftgmac100_netdev_ops = {
@@ -1275,7 +1267,7 @@ static int ftgmac100_probe(struct platform_device *pdev)
return 0;
 
 err_register_netdev:
-   phy_disconnect(priv->phydev);
+   phy_disconnect(netdev->phydev);
 err_mii_probe:
mdiobus_unregister(priv->mii_bus);
 err_register_mdiobus:
@@ -1301,7 +1293,7 @@ static int __exit ftgmac100_remove(struct platform_device 
*pdev)
 
unregister_netdev(netdev);
 
-   phy_disconnect(priv->phydev);
+   phy_disconnect(netdev->phydev);
mdiobus_unregister(priv->mii_bus);
mdiobus_free(priv->mii_bus);
 
-- 
1.7.4.4

[PATCH 2/2] net: ethernet: ftgmac100: use phy_ethtool_{get|set}_link_ksettings

There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/faraday/ftgmac100.c |   16 ++--
 1 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 9cc23c3..e7cf313 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -937,23 +937,11 @@ static void ftgmac100_get_drvinfo(struct net_device 
*netdev,
strlcpy(info->bus_info, dev_name(>dev), sizeof(info->bus_info));
 }
 
-static int ftgmac100_get_settings(struct net_device *netdev,
- struct ethtool_cmd *cmd)
-{
-   return phy_ethtool_gset(netdev->phydev, cmd);
-}
-
-static int ftgmac100_set_settings(struct net_device *netdev,
- struct ethtool_cmd *cmd)
-{
-   return phy_ethtool_sset(netdev->phydev, cmd);
-}
-
 static const struct ethtool_ops ftgmac100_ethtool_ops = {
-   .set_settings   = ftgmac100_set_settings,
-   .get_settings   = ftgmac100_get_settings,
.get_drvinfo= ftgmac100_get_drvinfo,
.get_link   = ethtool_op_get_link,
+   .get_link_ksettings = phy_ethtool_get_link_ksettings,
+   .set_link_ksettings = phy_ethtool_set_link_ksettings,
 };
 
 /**
-- 
1.7.4.4

[PATCH 1/2] net: ethernet: gianfar: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/freescale/gianfar.c |   42 +++--
 drivers/net/ethernet/freescale/gianfar.h |1 -
 drivers/net/ethernet/freescale/gianfar_ethtool.c |   24 +++--
 3 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index a580041..7615e06 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -999,7 +999,7 @@ static int gfar_hwtstamp_get(struct net_device *netdev, 
struct ifreq *ifr)
 
 static int gfar_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
-   struct gfar_private *priv = netdev_priv(dev);
+   struct phy_device *phydev = dev->phydev;
 
if (!netif_running(dev))
return -EINVAL;
@@ -1009,10 +1009,10 @@ static int gfar_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
if (cmd == SIOCGHWTSTAMP)
return gfar_hwtstamp_get(dev, rq);
 
-   if (!priv->phydev)
+   if (!phydev)
return -ENODEV;
 
-   return phy_mii_ioctl(priv->phydev, rq, cmd);
+   return phy_mii_ioctl(phydev, rq, cmd);
 }
 
 static u32 cluster_entry_per_class(struct gfar_private *priv, u32 rqfar,
@@ -1635,7 +1635,7 @@ static int gfar_suspend(struct device *dev)
gfar_start_wol_filer(priv);
 
} else {
-   phy_stop(priv->phydev);
+   phy_stop(ndev->phydev);
}
 
return 0;
@@ -1664,7 +1664,7 @@ static int gfar_resume(struct device *dev)
gfar_filer_restore_table(priv);
 
} else {
-   phy_start(priv->phydev);
+   phy_start(ndev->phydev);
}
 
gfar_start(priv);
@@ -1698,8 +1698,8 @@ static int gfar_restore(struct device *dev)
priv->oldspeed = 0;
priv->oldduplex = -1;
 
-   if (priv->phydev)
-   phy_start(priv->phydev);
+   if (ndev->phydev)
+   phy_start(ndev->phydev);
 
netif_device_attach(ndev);
enable_napi(priv);
@@ -1778,6 +1778,7 @@ static int init_phy(struct net_device *dev)
priv->device_flags & FSL_GIANFAR_DEV_HAS_GIGABIT ?
GFAR_SUPPORTED_GBIT : 0;
phy_interface_t interface;
+   struct phy_device *phydev;
 
priv->oldlink = 0;
priv->oldspeed = 0;
@@ -1785,9 +1786,9 @@ static int init_phy(struct net_device *dev)
 
interface = gfar_get_interface(dev);
 
-   priv->phydev = of_phy_connect(dev, priv->phy_node, _link, 0,
- interface);
-   if (!priv->phydev) {
+   phydev = of_phy_connect(dev, priv->phy_node, _link, 0,
+   interface);
+   if (!phydev) {
dev_err(>dev, "could not attach to PHY\n");
return -ENODEV;
}
@@ -1796,11 +1797,11 @@ static int init_phy(struct net_device *dev)
gfar_configure_serdes(dev);
 
/* Remove any features not supported by the controller */
-   priv->phydev->supported &= (GFAR_SUPPORTED | gigabit_support);
-   priv->phydev->advertising = priv->phydev->supported;
+   phydev->supported &= (GFAR_SUPPORTED | gigabit_support);
+   phydev->advertising = phydev->supported;
 
/* Add support for flow control, but don't advertise it by default */
-   priv->phydev->supported |= (SUPPORTED_Pause | SUPPORTED_Asym_Pause);
+   phydev->supported |= (SUPPORTED_Pause | SUPPORTED_Asym_Pause);
 
return 0;
 }
@@ -1944,7 +1945,7 @@ void stop_gfar(struct net_device *dev)
/* disable ints and gracefully shut down Rx/Tx DMA */
gfar_halt(priv);
 
-   phy_stop(priv->phydev);
+   phy_stop(dev->phydev);
 
free_skb_resources(priv);
 }
@@ -2204,7 +2205,7 @@ int startup_gfar(struct net_device *ndev)
priv->oldspeed = 0;
priv->oldduplex = -1;
 
-   phy_start(priv->phydev);
+   phy_start(ndev->phydev);
 
enable_napi(priv);
 
@@ -2572,8 +2573,7 @@ static int gfar_close(struct net_device *dev)
stop_gfar(dev);
 
/* Disconnect from the PHY */
-   phy_disconnect(priv->phydev);
-   priv->phydev = NULL;
+   phy_disconnect(dev->phydev);
 
gfar_free_irq(priv);
 
@@ -3379,7 +3379,7 @@ static irqreturn_t gfar_interrupt(int irq, void *grp_id)
 static void adjust_link(struct net_device *dev)
 {
struct gfar_private *priv = netdev_priv(dev);
-   struct phy_device *phydev = priv->phydev;
+   struct phy_device *phydev = dev->phydev;
 
if (unlikely(phydev->link != priv->oldlink ||
 (phydev->link && (phydev->duplex !=

[PATCH 2/2] net: ethernet: gianfar: use phy_ethtool_{get|set}_link_ksettings

There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/freescale/gianfar_ethtool.c |   27 +
 1 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c 
b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 94a8dc5..56588f2 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -184,29 +184,6 @@ static void gfar_gdrvinfo(struct net_device *dev,
strlcpy(drvinfo->bus_info, "N/A", sizeof(drvinfo->bus_info));
 }
 
-
-static int gfar_set_ksettings(struct net_device *dev,
- const struct ethtool_link_ksettings *cmd)
-{
-   struct phy_device *phydev = dev->phydev;
-
-   if (!phydev)
-   return -ENODEV;
-
-   return phy_ethtool_ksettings_set(phydev, cmd);
-}
-
-static int gfar_get_ksettings(struct net_device *dev,
- struct ethtool_link_ksettings *cmd)
-{
-   struct phy_device *phydev = dev->phydev;
-
-   if (!phydev)
-   return -ENODEV;
-
-   return phy_ethtool_ksettings_get(phydev, cmd);
-}
-
 /* Return the length of the register structure */
 static int gfar_reglen(struct net_device *dev)
 {
@@ -1580,6 +1557,6 @@ const struct ethtool_ops gfar_ethtool_ops = {
.set_rxnfc = gfar_set_nfc,
.get_rxnfc = gfar_get_nfc,
.get_ts_info = gfar_get_ts_info,
-   .get_link_ksettings = gfar_get_ksettings,
-   .set_link_ksettings = gfar_set_ksettings,
+   .get_link_ksettings = phy_ethtool_get_link_ksettings,
+   .set_link_ksettings = phy_ethtool_set_link_ksettings,
 };
-- 
1.7.4.4

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

2016-05-15 Thread Roman Yeryomin

On 16 May 2016 at 02:07, Eric Dumazet  wrote:
> On Mon, 2016-05-16 at 01:34 +0300, Roman Yeryomin wrote:
>
>> qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 1601271168 bytes 1057706 pkt (dropped 1422304, overlimits 0 requeues 
>> 17)
>>  backlog 1541252b 1018p requeues 17
>>   maxpacket 1514 drop_overlimit 1422304 new_flow_count 35 ecn_mark 0
>>   new_flows_len 0 old_flows_len 1
>
> Why do you have ce_threshold set ? You really should not (even if it
> does not matter for the kind of traffic you have at this moment)

No idea, it was there always. How do I unset it? Setting it to 0 doesn't help.

> If your expected link speed is around 1Gbps, or 80,000 packets per
> second, then you have to understand that 1024 packets limit is about 12
> ms at most.
>
> Even if the queue is full, max sojourn time of a packet would be 12 ms.
>
> I really do not see how 'target 80 ms' could be hit.

Well, as I said, I've tried different options. Neither target 20ms (as
Dave proposed) not 12ms save the situation.

> You basically have FQ, with no Codel effect, but with the associated
> cost of Codel (having to take timestamps)
>
>
>

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

On Mon, 2016-05-16 at 01:34 +0300, Roman Yeryomin wrote:

> qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514
> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>  Sent 1601271168 bytes 1057706 pkt (dropped 1422304, overlimits 0 requeues 17)
>  backlog 1541252b 1018p requeues 17
>   maxpacket 1514 drop_overlimit 1422304 new_flow_count 35 ecn_mark 0
>   new_flows_len 0 old_flows_len 1

Why do you have ce_threshold set ? You really should not (even if it
does not matter for the kind of traffic you have at this moment)

If your expected link speed is around 1Gbps, or 80,000 packets per
second, then you have to understand that 1024 packets limit is about 12
ms at most.

Even if the queue is full, max sojourn time of a packet would be 12 ms.

I really do not see how 'target 80 ms' could be hit.

You basically have FQ, with no Codel effect, but with the associated
cost of Codel (having to take timestamps)

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

2016-05-15 Thread Roman Yeryomin

On 7 May 2016 at 12:57, Kevin Darbyshire-Bryant
 wrote:
>
>
> On 06/05/16 10:42, Jesper Dangaard Brouer wrote:
>> Hi Felix,
>>
>> This is an important fix for OpenWRT, please read!
>>
>> OpenWRT changed the default fq_codel sch->limit from 10240 to 1024,
>> without also adjusting q->flows_cnt.  Eric explains below that you must
>> also adjust the buckets (q->flows_cnt) for this not to break. (Just
>> adjust it to 128)
>>
>> Problematic OpenWRT commit in question:
>>  http://git.openwrt.org/?p=openwrt.git;a=patch;h=12cd6578084e
>>  12cd6578084e ("kernel: revert fq_codel quantum override to prevent it from 
>> causing too much cpu load with higher speed (#21326)")
> I 'pull requested' this to the lede-staging tree on github.
> https://github.com/lede-project/staging/pull/11
>
> One way or another Felix & co should see the change :-)

If you would follow the white rabbit, you would see that it doesn't help

>>
>>
>> I also highly recommend you cherry-pick this very recent commit:
>>  net-next: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()")
>>  https://git.kernel.org/davem/net-next/c/9d18562a227
>>
>> This should fix very high CPU usage in-case fq_codel goes into drop mode.
>> The problem is that drop mode was considered rare, and implementation
>> wise it was chosen to be more expensive (to save cycles on normal mode).
>> Unfortunately is it easy to trigger with an UDP flood. Drop mode is
>> especially expensive for smaller devices, as it scans a 4K big array,
>> thus 64 cache misses for small devices!
>>
>> The fix is to allow drop-mode to bulk-drop more packets when entering
>> drop-mode (default 64 bulk drop).  That way we don't suddenly
>> experience a significantly higher processing cost per packet, but
>> instead can amortize this.
> I haven't done the above cherry-pick patch & backport patch creation for
> 4.4/4.1/3.18 yet - maybe if $dayjob permits time and no one else beats
> me to it :-)
>
> Kevin
>

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

2016-05-15 Thread Roman Yeryomin

On 6 May 2016 at 22:43, Dave Taht  wrote:
> On Fri, May 6, 2016 at 11:56 AM, Roman Yeryomin  wrote:
>> On 6 May 2016 at 21:43, Roman Yeryomin  wrote:
>>> On 6 May 2016 at 15:47, Jesper Dangaard Brouer  wrote:

 I've created a OpenWRT ticket[1] on this issue, as it seems that someone[2]
 closed Felix'es OpenWRT email account (bad choice! emails bouncing).
 Sounds like OpenWRT and the LEDE https://www.lede-project.org/ project
 is in some kind of conflict.

 OpenWRT ticket [1] https://dev.openwrt.org/ticket/22349

 [2] 
 http://thread.gmane.org/gmane.comp.embedded.openwrt.devel/40298/focus=40335
>>>
>>> OK, so, after porting the patch to 4.1 openwrt kernel and playing a
>>> bit with fq_codel limits I was able to get 420Mbps UDP like this:
>>> tc qdisc replace dev wlan0 parent :1 fq_codel flows 16 limit 256
>>
>> Forgot to mention, I've reduced drop_batch_size down to 32
>
> 0) Not clear to me if that's the right line, there are 4 wifi queues,
> and the third one
> is the BE queue.

That was an example, sorry, should have stated that. I've applied same
settings to all 4 queues.

> That is too low a limit, also, for normal use. And:
> for the purpose of this particular UDP test, flows 16 is ok, but not
> ideal.

I played with different combinations, it doesn't make any
(significant) difference: 20-30Mbps, not more.
What numbers would you propose?

> 1) What's the tcp number (with a simultaneous ping) with this latest patchset?
> (I care about tcp performance a lot more than udp floods - surviving a
> udp flood yes, performance, no)

During the test (both TCP and UDP) it's roughly 5ms in average, not
running tests ~2ms. Actually I'm now wondering if target is working at
all, because I had same result with target 80ms..
So, yes, latency is good, but performance is poor.

> before/after?
>
> tc -s qdisc show dev wlan0 during/after results?

during the test:

qdisc mq 0: root
 Sent 1600496000 bytes 1057194 pkt (dropped 1421568, overlimits 0 requeues 17)
 backlog 1545794b 1021p requeues 17
qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8002: parent :2 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 1601271168 bytes 1057706 pkt (dropped 1422304, overlimits 0 requeues 17)
 backlog 1541252b 1018p requeues 17
  maxpacket 1514 drop_overlimit 1422304 new_flow_count 35 ecn_mark 0
  new_flows_len 0 old_flows_len 1
qdisc fq_codel 8004: parent :4 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0


after the test (60sec):

qdisc mq 0: root
 Sent 3084996052 bytes 2037744 pkt (dropped 2770176, overlimits 0 requeues 28)
 backlog 0b 0p requeues 28
qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8002: parent :2 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 3084996052 bytes 2037744 pkt (dropped 2770176, overlimits 0 requeues 28)
 backlog 0b 0p requeues 28
  maxpacket 1514 drop_overlimit 2770176 new_flow_count 64 ecn_mark 0
  new_flows_len 0 old_flows_len 1
qdisc fq_codel 8004: parent :4 limit 1024p flows 16 quantum 1514
target 80.0ms ce_threshold 32us interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0


> IF you are doing builds for the archer c7v2, I can join in on this... (?)

I'm not but I have c7 somewhere, so I can do a build for it and also
test, so we are on the same page.

> I did do a test of the ath10k "before", fq_codel *never engaged*, and
> tcp induced latencies

Re: [PATCH net-next] net: also make sch_handle_egress() drop monitor ready

2016-05-15 Thread Alexei Starovoitov

On Sun, May 15, 2016 at 11:28:29PM +0200, Daniel Borkmann wrote:
> Follow-up for 8a3a4c6e7b34 ("net: make sch_handle_ingress() drop
> monitor ready") to also make the egress side drop monitor ready.
> 
> Also here only TC_ACT_SHOT is a clear indication that something
> went wrong. Hence don't provide false positives to drop monitors
> such as 'perf record -e skb:kfree_skb ...'.
> 
> Signed-off-by: Daniel Borkmann 

Acked-by: Alexei Starovoitov

[PATCH net-next] net: also make sch_handle_egress() drop monitor ready

2016-05-15 Thread Daniel Borkmann

Follow-up for 8a3a4c6e7b34 ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.

Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.

Signed-off-by: Daniel Borkmann 
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 12436d1..904ff43 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3186,12 +3186,12 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
case TC_ACT_SHOT:
qdisc_qstats_cpu_drop(cl->q);
*ret = NET_XMIT_DROP;
-   goto drop;
+   kfree_skb(skb);
+   return NULL;
case TC_ACT_STOLEN:
case TC_ACT_QUEUED:
*ret = NET_XMIT_SUCCESS;
-drop:
-   kfree_skb(skb);
+   consume_skb(skb);
return NULL;
case TC_ACT_REDIRECT:
/* No need to push/pop skb's mac_header here on egress! */
-- 
1.9.3

Re: BUG: use-after-free in netlink_dump

2016-05-15 Thread Cong Wang

On Sun, May 15, 2016 at 8:24 AM, Baozeng Ding  wrote:
> Hi all,
> I've got the following report (use-after-free in netlink_dump) while running
> syzkaller.
> Unfortunately no reproducer.The kernel version is 4.6.0-rc2+.
...
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0xb3/0x112 lib/dump_stack.c:51
>  [] print_trailer+0x10d/0x190 mm/slub.c:667
>  [] object_err+0x2f/0x40 mm/slub.c:674
>  [< inline >] print_address_description mm/kasan/report.c:179
>  [] kasan_report_error+0x218/0x530 mm/kasan/report.c:275
>  [< inline >] kasan_report mm/kasan/report.c:297
>  [] __asan_report_load4_noabort+0x3e/0x40
> mm/kasan/report.c:317
>  [< inline >] ? nlmsg_put_answer include/net/netlink.h:471
>  [] ? netlink_dump+0x4eb/0xa40
> net/netlink/af_netlink.c:2120
>  [< inline >] nlmsg_put_answer include/net/netlink.h:471
>  [] netlink_dump+0x4eb/0xa40 net/netlink/af_netlink.c:2120
>  [] netlink_recvmsg+0x8fb/0xe00
> net/netlink/af_netlink.c:1869

Similar to what Richard reported, I think the problem is cb->skb,
which is exposed to other thread since cb is per netlink socket
(cb = >cb). IOW, the cb->skb is freed by one thread at the
end of netlink_dump() meanwhile the other thread is still using
it via NETLINK_CB(cb->skb).portid.

I am guessing we miss some skb_get():

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index aeefe12..142bb39 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2184,7 +2184,7 @@ int __netlink_dump_start(struct sock *ssk,
struct sk_buff *skb,
cb->data = control->data;
cb->module = control->module;
cb->min_dump_alloc = control->min_dump_alloc;
-   cb->skb = skb;
+   cb->skb = skb_get(skb);

nlk->cb_running = true;

meanwhile the cb->skb is still "freed" by the consume_skb(cb->skb).

Re: BUG: net/ipv4: KASAN: use-after-free in tcp_v4_rcv

On Mon, 2016-05-16 at 00:02 +0800, Baozeng Ding wrote:
> Hi all,
> I've got the following report use-after-free in tcp_v4_rcv while running 
> syzkaller.
> Unfortunately no reproducer.The kernel version is 4.6.0-rc2+.
> 
> ===
> BUG: KASAN: use-after-free in tcp_v4_rcv+0x2144/0x2c20 at addr 
> 8800380279c0
> Write of size 8 by task syz-executor/7055
> =
> BUG skbuff_head_cache (Tainted: GB D): kasan: bad access 
> detected
> -
> 
> INFO: Freed in e1000_clean+0xa08/0x24a0 age=6364136532 cpu=2226773637 pid=-1
> [< inline >] napi_poll net/core/dev.c:5087
> [<  none  >] net_rx_action+0x751/0xd80 net/core/dev.c:5152
> [<  none  >] __do_softirq+0x22b/0x8da kernel/softirq.c:273
> [< inline >] invoke_softirq kernel/softirq.c:350
> [<  none  >] irq_exit+0x15d/0x190 kernel/softirq.c:391
> [< inline >] exiting_irq ./arch/x86/include/asm/apic.h:658
> [<  none  >] do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
> [<  none  >] ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:454
> [<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
> [<  none  >] __slab_free+0x1e8/0x300 mm/slub.c:2657
> [< inline >] slab_free mm/slub.c:2810
> [<  none  >] kmem_cache_free+0x298/0x320 mm/slub.c:2819
> [<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
> [<  none  >] __kfree_skb+0x1d/0x20 net/core/skbuff.c:684
> [<  none  >] kfree_skb+0x107/0x310 net/core/skbuff.c:704
> [<  none  >] packet_rcv_spkt+0xd8/0x4a0 net/packet/af_packet.c:1822
> [< inline >] deliver_skb net/core/dev.c:1814
> [< inline >] deliver_ptype_list_skb net/core/dev.c:1829
> [<  none  >] __netif_receive_skb_core+0x134a/0x3060 
> net/core/dev.c:4143
> [<  none  >] __netif_receive_skb+0x2a/0x160 net/core/dev.c:4198
> 

Above stack trace looks suspicious.

It looks like __netif_receive_skb() is called from a context with BH
enabled.

Some hard irq is happening, and invoke_softirq() enters __do_softirq()

Getting more depth in this stack trace would be nice ?


> 
> Call Trace:
>   [< inline >] __dump_stack lib/dump_stack.c:15
>   [] dump_stack+0xb3/0x112 lib/dump_stack.c:51
>   [] print_trailer+0x10d/0x190 mm/slub.c:667
>   [] object_err+0x2f/0x40 mm/slub.c:674
>   [< inline >] print_address_description mm/kasan/report.c:179
>   [] kasan_report_error+0x218/0x530 mm/kasan/report.c:275
>   [] ? tcp_v4_rcv+0x1d14/0x2c20 net/ipv4/tcp_ipv4.c:1653
>   [< inline >] kasan_report mm/kasan/report.c:297
>   [] __asan_report_store8_noabort+0x3e/0x40 
> mm/kasan/report.c:323
>   [< inline >] ? nf_reset include/linux/skbuff.h:3464
>   [] ? tcp_v4_rcv+0x1c21/0x2c20 net/ipv4/tcp_ipv4.c:1639
>   [< inline >] ? __sk_add_backlog include/net/sock.h:810
>   [< inline >] ? sk_add_backlog include/net/sock.h:843
>   [] ? tcp_v4_rcv+0x2144/0x2c20 net/ipv4/tcp_ipv4.c:1659
>   [< inline >] __sk_add_backlog include/net/sock.h:810
>   [< inline >] sk_add_backlog include/net/sock.h:843
>   [] tcp_v4_rcv+0x2144/0x2c20 net/ipv4/tcp_ipv4.c:1659
>   [] ? raw_local_deliver+0x7c1/0xae0 net/ipv4/raw.c:221
>   [] ? nf_iterate+0x1aa/0x230 net/netfilter/core.c:289
>   [] ? nf_iterate+0x230/0x230 net/netfilter/core.c:268
>   [] ip_local_deliver_finish+0x2b0/0xa50 
> net/ipv4/ip_input.c:216
>   [< inline >] ? __skb_pull include/linux/skbuff.h:1900
>   [] ? ip_local_deliver_finish+0x12a/0xa50 
> net/ipv4/ip_input.c:194
>   [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:219
>   [< inline >] NF_HOOK include/linux/netfilter.h:242
>   [] ip_local_deliver+0x1b3/0x350 net/ipv4/ip_input.c:257
>   [] ? ip_call_ra_chain+0x540/0x540 
> net/ipv4/ip_input.c:163
>   [] ? ip_rcv_finish+0x1ab0/0x1ab0 
> include/net/net_namespace.h:259
>   [< inline >] dst_input include/net/dst.h:510
>   [] ip_rcv_finish+0x679/0x1ab0 net/ipv4/ip_input.c:388
>   [] ? sk_filter+0x7f/0xe50 net/core/filter.c:94
>   [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:219
>   [< inline >] NF_HOOK include/linux/netfilter.h:242
>   [] ip_rcv+0x963/0x10c0 net/ipv4/ip_input.c:478
>   [] ? ip_local_deliver+0x350/0x350 
> net/ipv4/ip_input.c:250
>   [] ? skb_release_data+0x3d2/0x430 net/core/skbuff.c:599
>   [] ? inet_del_offload+0x40/0x40 ??:?
>   [] ? packet_rcv_spkt+0xdd/0x4a0 
> net/packet/af_packet.c:1822
>   [] ? ip_local_deliver+0x350/0x350 
> net/ipv4/ip_input.c:250
>   [] __netif_receive_skb_core+0x168d/0x3060 
> net/core/dev.c:4160
>   [] ? netif_wake_subqueue+0x220/0x220 
> include/linux/compiler.h:222
>   [< inline >] ? ktime_get_real include/linux/timekeeping.h:179
>   [< inline >] ? __net_timestamp include/linux/skbuff.h:3099

Re: r8169: Unconditionally disabling ASPM

2016-05-15 Thread Francois Romieu

Paul Menzel  :
[...]
> As over five years have passed now, do you think that is still needed?
> I wonder why no module parameter was added back then, where users could
> enable ASPM if it works on their systems? Because there is no such
> situation and it always fails?

It was enabled again (d64ec841517a25f6d468bde9f67e5b4cffdc67c7) then
disabled (4521e1a94279ce610d3f9b7945c17d581f804242). It's closer
to 3.5 years :o)

Module parameters are frowned upon.

Lin, is there some interest in selectively [*] enabling (or disabling)
ASPM support in the r8169 driver or will it be unreliable ?

[*] Based on DMI information for instance.

-- 
Ueimor

Re: [PATCH] ethernet:arc: Fix racing of TX ring buffer

2016-05-15 Thread Francois Romieu

Shuyu Wei  :
[...]
> I still have a question, is it possible that tx_clean() run
> between   priv->tx_buff[*txbd_curr].skb = skb   and   dma_wmb()?

A (previous) run can take place after priv->tx_buff[*txbd_curr].skb and
before *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len).

So, yes, the driver must check in arc_emac_tx_clean() that 1) either
txbd_dirty != txbd_curr or 2) "info" is not consistent with a still-not-used
status word. Please be patient with me and get rid of the useless "i"

diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index a3a9392..337ea3b 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -153,9 +153,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 {
struct arc_emac_priv *priv = netdev_priv(ndev);
struct net_device_stats *stats = >stats;
-   unsigned int i;
 
-   for (i = 0; i < TX_BD_NUM; i++) {
+   while (priv->txbd_dirty != priv->txbd_curr) {
unsigned int *txbd_dirty = >txbd_dirty;
struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];

-- 
Ueimor

Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name

2016-05-15 Thread Linus Torvalds

On Sat, May 14, 2016 at 2:31 PM, Linus Torvalds
 wrote:
>
> "u64" is indeed "unsigned long long" on x86 and many other
> architectures, but on alpha and ia64 it's just "unsigned long".

Actually, I take that back.

In the kernel, it seems to always be "unsigned long long", even on
alpha and ia64.

We do have a "int-l64.h" file that typedef's __u64 to be just unsigned
long, and yes, that file is included for alpha and ia64, but it seems
that that only happens when __KERNEL__ is not defined.

So it does seem like using "%ull" and u64 is fine. Not in general, but
inside the kernel it's ok.

  Linus

Re: [net-next 00/13][pull request] 40GbE Intel Wired LAN Driver Updates 2016-05-14

2016-05-15 Thread David Miller

From: Jeff Kirsher 
Date: Sat, 14 May 2016 21:57:22 -0700

> This series contains updates to i40e and i40evf.

Pulled, thanks Jeff.

Re: [PATCH net-next v2 0/9] bnxt_en: updates for net-next.

2016-05-15 Thread David Miller

From: Michael Chan 
Date: Sun, 15 May 2016 03:04:42 -0400

> Non-critical bug fixes, improvements, a new ethtool feature, and a new
> device ID.
> 
> v2: Fixed a bug in bnxt_get_module_eeprom() found by Ben Hutchings.

Series applied, thanks.

r8169: Unconditionally disabling ASPM

2016-05-15 Thread Paul Menzel

Dear Linux folks,


Running the Firmware Test Suite (fwts) [1] on an ASRock E350M1, it
suggests that ASPM should be enabled.

The module r8169 disables ASPM since the commit below.

commit ba04c7c93bbcb48ce880cf75b6e9dffcd79d4c7b
Author: Stanislaw Gruszka 
Date:   Tue Feb 22 02:00:11 2011 +

r8169: disable ASPM

For some time is known that ASPM is causing troubles on r8169, i.e. make
device randomly stop working without any errors in dmesg.

Currently Tomi Leppikangas reports that system with r8169 device hangs
with MCE errors when ASPM is enabled:
https://bugzilla.redhat.com/show_bug.cgi?id=642861#c4

Lets disable ASPM for r8169 devices at all, to avoid problems with
r8169 PCIe devices at least for some users.

Reported-by: Tomi Leppikangas 
Cc: sta...@kernel.org
Signed-off-by: Stanislaw Gruszka 
Signed-off-by: David S. Miller 

As over five years have passed now, do you think that is still needed?
I wonder why no module parameter was added back then, where users could
enable ASPM if it works on their systems? Because there is no such
situation and it always fails?


Thanks,

Paul


[1] https://wiki.ubuntu.com/FirmwareTestSuite

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-15 Thread David Miller

From: Dexuan Cui 
Date: Sun, 15 May 2016 09:52:42 -0700

> Changes since v10
> 
> 1) add module params: send_ring_page, recv_ring_page. They can be used to
> enlarge the ringbuffer size to get better performance, e.g.,
> # modprobe hv_sock  recv_ring_page=16 send_ring_page=16
> By default, recv_ring_page is 3 and send_ring_page is 2.
> 
> 2) add module param max_socket_number (the default is 1024).
> A user can enlarge the number to create more than 1024 hv_sock sockets.
> By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
> (Here 1+1 means 1 page for send/recv buffers per connection, respectively.)

This is papering around my objections, and create module parameters which
I am fundamentally against.

You're making the facility unusable by default, just to work around my
memory consumption concerns.

What will end up happening is that everyone will simply increase the
values.

You're not really addressing the core issue, and I will be ignoring you
future submissions of this change until you do.

[PATCH iproute2 -next] ingress, clsact: don't add TCA_OPTIONS to nl msg

2016-05-15 Thread Daniel Borkmann

In ingress and clsact qdisc TCA_OPTIONS are ignored, since it's
parameterless. In tc, we add an empty addattr_l(... TCA_OPTIONS,
NULL, 0) to the netlink message nevertheless. This has the
side effect that when someone tries a 'tc qdisc replace' and
already an existing such qdisc is present, tc fails with
EINVAL here.

Reason is that in the kernel, this invokes qdisc_change() when
such requested qdisc is already present. When TCA_OPTIONS are
passed to modify parameters, it looks whether qdisc implements
.change() callback, and if not present (like in both cases here)
it returns with error. Rather than adding an empty stub to the
kernel that ignores TCA_OPTIONS again, just don't add TCA_OPTIONS
to the netlink message in the first place.

Before:

  # tc qdisc replace dev foo clsact# first try
  # tc qdisc replace dev foo clsact# second one
  RTNETLINK answers: Invalid argument

After:

  # tc qdisc replace dev foo clsact
  # tc qdisc replace dev foo clsact
  # tc qdisc replace dev foo clsact

Signed-off-by: Daniel Borkmann 
---
 tc/q_clsact.c  | 1 -
 tc/q_ingress.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/tc/q_clsact.c b/tc/q_clsact.c
index 0c05dbd..e2a1a71 100644
--- a/tc/q_clsact.c
+++ b/tc/q_clsact.c
@@ -18,7 +18,6 @@ static int clsact_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
return -1;
}
 
-   addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
return 0;
 }
 
diff --git a/tc/q_ingress.c b/tc/q_ingress.c
index c3c9b40..31699a8 100644
--- a/tc/q_ingress.c
+++ b/tc/q_ingress.c
@@ -34,7 +34,6 @@ static int ingress_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
}
}
 
-   addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
return 0;
 }
 
-- 
1.9.3

BUG: net/ipv4: KASAN: use-after-free in tcp_sendmsg

2016-05-15 Thread Baozeng Ding


Hi all,
I've got the following report use-after-free in  tcp_sendmsg (net/ipv4) 
while running syzkaller.

Unfortunately no reproducer.The kernel version is 4.6.0-rc2+.

==
BUG: KASAN: use-after-free in release_sock+0x4a0/0x510 at addr 
8800380279c0

Read of size 8 by task sshd/7035
=
BUG skbuff_head_cache (Tainted: GB D): kasan: bad access 
detected

-

INFO: Freed in e1000_clean+0xa08/0x24a0 age=6364136656 cpu=2226773637 pid=-1
[< inline >] napi_poll net/core/dev.c:5087
[<  none  >] net_rx_action+0x751/0xd80 net/core/dev.c:5152
[<  none  >] __do_softirq+0x22b/0x8da kernel/softirq.c:273
[< inline >] invoke_softirq kernel/softirq.c:350
[<  none  >] irq_exit+0x15d/0x190 kernel/softirq.c:391
[< inline >] exiting_irq ./arch/x86/include/asm/apic.h:658
[<  none  >] do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
[<  none  >] ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:454
[<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
[<  none  >] __slab_free+0x1e8/0x300 mm/slub.c:2657
[< inline >] slab_free mm/slub.c:2810
[<  none  >] kmem_cache_free+0x298/0x320 mm/slub.c:2819
[<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
[<  none  >] __kfree_skb+0x1d/0x20 net/core/skbuff.c:684
[<  none  >] kfree_skb+0x107/0x310 net/core/skbuff.c:704
[<  none  >] packet_rcv_spkt+0xd8/0x4a0 net/packet/af_packet.c:1822
[< inline >] deliver_skb net/core/dev.c:1814
[< inline >] deliver_ptype_list_skb net/core/dev.c:1829
[<  none  >] __netif_receive_skb_core+0x134a/0x3060 
net/core/dev.c:4143

[<  none  >] __netif_receive_skb+0x2a/0x160 net/core/dev.c:4198

Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x112 lib/dump_stack.c:51
 [] print_trailer+0x10d/0x190 mm/slub.c:667
 [] object_err+0x2f/0x40 mm/slub.c:674
 [< inline >] print_address_description mm/kasan/report.c:179
 [] kasan_report_error+0x218/0x530 mm/kasan/report.c:275
 [< inline >] ? rdtsc ./arch/x86/include/asm/msr.h:155
 [< inline >] ? rdtsc_ordered ./arch/x86/include/asm/msr.h:183
 [] ? delay_tsc+0x18/0x70 arch/x86/lib/delay.c:58
 [< inline >] kasan_report mm/kasan/report.c:297
 [] __asan_report_load8_noabort+0x3e/0x40 
mm/kasan/report.c:318
 [< inline >] ? __raw_spin_unlock 
include/linux/spinlock_api_smp.h:153
 [] ? _raw_spin_unlock+0x20/0x30 
kernel/locking/spinlock.c:183

 [< inline >] ? __release_sock net/core/sock.c:1984
 [] ? release_sock+0x4a0/0x510 net/core/sock.c:2442
 [< inline >] __release_sock net/core/sock.c:1984
 [] release_sock+0x4a0/0x510 net/core/sock.c:2442
 [] tcp_sendmsg+0x1de/0x2a90 net/ipv4/tcp.c:1293
 [] ? tcp_sendpage+0x1820/0x1820 
include/linux/skbuff.h:1491

 [< inline >] ? sock_rps_record_flow include/net/sock.h:878
 [] ? inet_sendmsg+0x73/0x4c0 net/ipv4/af_inet.c:733
 [< inline >] ? rcu_read_unlock include/linux/rcupdate.h:922
 [< inline >] ? sock_rps_record_flow_hash include/net/sock.h:871
 [< inline >] ? sock_rps_record_flow include/net/sock.h:878
 [] ? inet_sendmsg+0x1fa/0x4c0 net/ipv4/af_inet.c:733
 [] inet_sendmsg+0x2f5/0x4c0 net/ipv4/af_inet.c:740
 [< inline >] ? sock_rps_record_flow include/net/sock.h:878
 [] ? inet_sendmsg+0x73/0x4c0 net/ipv4/af_inet.c:733
 [] ? inet_recvmsg+0x4a0/0x4a0 
include/linux/compiler.h:222

 [< inline >] sock_sendmsg_nosec net/socket.c:612
 [] sock_sendmsg+0xca/0x110 net/socket.c:622
 [] sock_write_iter+0x216/0x3a0 net/socket.c:821
 [] ? sock_sendmsg+0x110/0x110 net/socket.c:612
 [] ? iov_iter_init+0xaf/0x1d0 lib/iov_iter.c:359
 [< inline >] new_sync_write fs/read_write.c:518
 [] __vfs_write+0x300/0x4b0 fs/read_write.c:531
 [] ? do_iter_readv_writev+0x2b0/0x2b0 
fs/read_write.c:707
 [] ? retarget_shared_pending+0x210/0x210 
include/linux/signal.h:117

 [< inline >] ? spin_unlock_irq include/linux/spinlock.h:357
 [] ? __set_current_blocked+0x80/0xa0 
kernel/signal.c:2490
 [] ? apparmor_file_permission+0x22/0x30 
security/apparmor/lsm.c:446

 [] ? rw_verify_area+0x102/0x2c0 fs/read_write.c:448
 [] vfs_write+0x167/0x4a0 fs/read_write.c:578
 [< inline >] SYSC_write fs/read_write.c:625
 [] SyS_write+0x111/0x220 fs/read_write.c:617
 [] ? SyS_read+0x220/0x220 fs/read_write.c:599
 [] ? trace_hardirqs_on_thunk+0x1b/0x1d 
arch/x86/entry/thunk_64.S:42
 [] entry_SYSCALL_64_fastpath+0x23/0xc1 
arch/x86/entry/entry_64.S:207

Memory state around the buggy address:
 880038027880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 880038027900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>880038027980: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb

BUG: net/ipv4: KASAN: use-after-free in tcp_v4_rcv

2016-05-15 Thread Baozeng Ding


Hi all,
I've got the following report use-after-free in tcp_v4_rcv while running 
syzkaller.

Unfortunately no reproducer.The kernel version is 4.6.0-rc2+.

===
BUG: KASAN: use-after-free in tcp_v4_rcv+0x2144/0x2c20 at addr 
8800380279c0

Write of size 8 by task syz-executor/7055
=
BUG skbuff_head_cache (Tainted: GB D): kasan: bad access 
detected

-

INFO: Freed in e1000_clean+0xa08/0x24a0 age=6364136532 cpu=2226773637 pid=-1
[< inline >] napi_poll net/core/dev.c:5087
[<  none  >] net_rx_action+0x751/0xd80 net/core/dev.c:5152
[<  none  >] __do_softirq+0x22b/0x8da kernel/softirq.c:273
[< inline >] invoke_softirq kernel/softirq.c:350
[<  none  >] irq_exit+0x15d/0x190 kernel/softirq.c:391
[< inline >] exiting_irq ./arch/x86/include/asm/apic.h:658
[<  none  >] do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
[<  none  >] ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:454
[<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
[<  none  >] __slab_free+0x1e8/0x300 mm/slub.c:2657
[< inline >] slab_free mm/slub.c:2810
[<  none  >] kmem_cache_free+0x298/0x320 mm/slub.c:2819
[<  none  >] kfree_skbmem+0xe6/0x100 net/core/skbuff.c:622
[<  none  >] __kfree_skb+0x1d/0x20 net/core/skbuff.c:684
[<  none  >] kfree_skb+0x107/0x310 net/core/skbuff.c:704
[<  none  >] packet_rcv_spkt+0xd8/0x4a0 net/packet/af_packet.c:1822
[< inline >] deliver_skb net/core/dev.c:1814
[< inline >] deliver_ptype_list_skb net/core/dev.c:1829
[<  none  >] __netif_receive_skb_core+0x134a/0x3060 
net/core/dev.c:4143

[<  none  >] __netif_receive_skb+0x2a/0x160 net/core/dev.c:4198


Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x112 lib/dump_stack.c:51
 [] print_trailer+0x10d/0x190 mm/slub.c:667
 [] object_err+0x2f/0x40 mm/slub.c:674
 [< inline >] print_address_description mm/kasan/report.c:179
 [] kasan_report_error+0x218/0x530 mm/kasan/report.c:275
 [] ? tcp_v4_rcv+0x1d14/0x2c20 net/ipv4/tcp_ipv4.c:1653
 [< inline >] kasan_report mm/kasan/report.c:297
 [] __asan_report_store8_noabort+0x3e/0x40 
mm/kasan/report.c:323

 [< inline >] ? nf_reset include/linux/skbuff.h:3464
 [] ? tcp_v4_rcv+0x1c21/0x2c20 net/ipv4/tcp_ipv4.c:1639
 [< inline >] ? __sk_add_backlog include/net/sock.h:810
 [< inline >] ? sk_add_backlog include/net/sock.h:843
 [] ? tcp_v4_rcv+0x2144/0x2c20 net/ipv4/tcp_ipv4.c:1659
 [< inline >] __sk_add_backlog include/net/sock.h:810
 [< inline >] sk_add_backlog include/net/sock.h:843
 [] tcp_v4_rcv+0x2144/0x2c20 net/ipv4/tcp_ipv4.c:1659
 [] ? raw_local_deliver+0x7c1/0xae0 net/ipv4/raw.c:221
 [] ? nf_iterate+0x1aa/0x230 net/netfilter/core.c:289
 [] ? nf_iterate+0x230/0x230 net/netfilter/core.c:268
 [] ip_local_deliver_finish+0x2b0/0xa50 
net/ipv4/ip_input.c:216

 [< inline >] ? __skb_pull include/linux/skbuff.h:1900
 [] ? ip_local_deliver_finish+0x12a/0xa50 
net/ipv4/ip_input.c:194

 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:219
 [< inline >] NF_HOOK include/linux/netfilter.h:242
 [] ip_local_deliver+0x1b3/0x350 net/ipv4/ip_input.c:257
 [] ? ip_call_ra_chain+0x540/0x540 
net/ipv4/ip_input.c:163
 [] ? ip_rcv_finish+0x1ab0/0x1ab0 
include/net/net_namespace.h:259

 [< inline >] dst_input include/net/dst.h:510
 [] ip_rcv_finish+0x679/0x1ab0 net/ipv4/ip_input.c:388
 [] ? sk_filter+0x7f/0xe50 net/core/filter.c:94
 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:219
 [< inline >] NF_HOOK include/linux/netfilter.h:242
 [] ip_rcv+0x963/0x10c0 net/ipv4/ip_input.c:478
 [] ? ip_local_deliver+0x350/0x350 
net/ipv4/ip_input.c:250

 [] ? skb_release_data+0x3d2/0x430 net/core/skbuff.c:599
 [] ? inet_del_offload+0x40/0x40 ??:?
 [] ? packet_rcv_spkt+0xdd/0x4a0 
net/packet/af_packet.c:1822
 [] ? ip_local_deliver+0x350/0x350 
net/ipv4/ip_input.c:250
 [] __netif_receive_skb_core+0x168d/0x3060 
net/core/dev.c:4160
 [] ? netif_wake_subqueue+0x220/0x220 
include/linux/compiler.h:222

 [< inline >] ? ktime_get_real include/linux/timekeeping.h:179
 [< inline >] ? __net_timestamp include/linux/skbuff.h:3099
 [] ? netif_receive_skb_internal+0x125/0x390 
net/core/dev.c:4207

 [< inline >] ? __net_timestamp include/linux/skbuff.h:3099
 [] ? netif_receive_skb_internal+0x14a/0x390 
net/core/dev.c:4207

 [] __netif_receive_skb+0x2a/0x160 net/core/dev.c:4198
 [] netif_receive_skb_internal+0x1b5/0x390 
net/core/dev.c:4226

 [< inline >] ? __net_timestamp include/linux/skbuff.h:3099
 [] ? netif_receive_skb_internal+0x14a/0x390 
net/core/dev.c:4207

 [] ? dev_cpu_callback+0x690/0x690 net/core/dev.c:7755
 []

Re: [patch net-next 1/4] netdevice: add SW statistics ndo

2016-05-15 Thread Andrew Lunn

> I think we don/t understand each other. HW stats always include SW
> stats. Because whatever goes in or out goes through HW.

Hi Jiri

Bit of a corner case, but what about multicast and broadcast? Can you
do the replication in the switch, so that only a single copy is sent
from the host to the switch?

Andrew

BUG: use-after-free in netlink_dump

2016-05-15 Thread Baozeng Ding


Hi all,
I've got the following report (use-after-free in netlink_dump) while 
running syzkaller.

Unfortunately no reproducer.The kernel version is 4.6.0-rc2+.

==
BUG: KASAN: use-after-free in netlink_dump+0x4eb/0xa40 at addr 
880036ae7988

Read of size 4 by task syz-executor/14596
=
BUG kmalloc-1024 (Tainted: GB  ): kasan: bad access detected
-

INFO: Allocated in 0x age=18446681375777959590 cpu=0 pid=0
[<  none  >] __alloc_skb+0xf0/0x5f0 net/core/skbuff.c:230
[<  none  >] ___slab_alloc+0x4c7/0x500 mm/slub.c:2446
[<  none  >] __slab_alloc+0x4c/0x90 mm/slub.c:2475
[< inline >] slab_alloc_node mm/slub.c:2538
[<  none  >] __kmalloc_node_track_caller+0xba/0x420 mm/slub.c:4095
[<  none  >] __kmalloc_reserve.isra.33+0x41/0xe0 
net/core/skbuff.c:137

[<  none  >] __alloc_skb+0xf0/0x5f0 net/core/skbuff.c:230
[< inline >] alloc_skb include/linux/skbuff.h:895
[< inline >] netlink_alloc_large_skb net/netlink/af_netlink.c:1086
[<  none  >] netlink_sendmsg+0x8cd/0xcb0 
net/netlink/af_netlink.c:1761

[< inline >] sock_sendmsg_nosec net/socket.c:612
[<  none  >] sock_sendmsg+0xca/0x110 net/socket.c:622
[<  none  >] ___sys_sendmsg+0x728/0x860 net/socket.c:1946
[<  none  >] __sys_sendmsg+0xd1/0x170 net/socket.c:1980
[< inline >] SYSC_sendmsg net/socket.c:1991
[<  none  >] SyS_sendmsg+0x2d/0x50 net/socket.c:1987
[<  none  >] entry_SYSCALL_64_fastpath+0x23/0xc1 
arch/x86/entry/entry_64.S:207

INFO: Freed in 0x1000f2d5f age=18446681375777959590 cpu=0 pid=0
[< inline >] skb_free_head net/core/skbuff.c:579
[<  none  >] skb_release_data+0x361/0x430 net/core/skbuff.c:610
[<  none  >] __slab_free+0x1e8/0x300 mm/slub.c:2657
[< inline >] slab_free mm/slub.c:2810
[<  none  >] kfree+0x255/0x2d0 mm/slub.c:3661
[< inline >] skb_free_head net/core/skbuff.c:579
[<  none  >] skb_release_data+0x361/0x430 net/core/skbuff.c:610
[<  none  >] skb_release_all+0x4a/0x60 net/core/skbuff.c:669
[< inline >] __kfree_skb net/core/skbuff.c:683
[<  none  >] consume_skb+0x11b/0x2f0 net/core/skbuff.c:756
[< inline >] netlink_unicast_kernel net/netlink/af_netlink.c:1215
[<  none  >] netlink_unicast+0x5aa/0x890 
net/netlink/af_netlink.c:1240
[<  none  >] netlink_sendmsg+0x981/0xcb0 
net/netlink/af_netlink.c:1786

[< inline >] sock_sendmsg_nosec net/socket.c:612
[<  none  >] sock_sendmsg+0xca/0x110 net/socket.c:622
[<  none  >] ___sys_sendmsg+0x728/0x860 net/socket.c:1946
[<  none  >] __sys_sendmsg+0xd1/0x170 net/socket.c:1980
[< inline >] SYSC_sendmsg net/socket.c:1991
[<  none  >] SyS_sendmsg+0x2d/0x50 net/socket.c:1987
[<  none  >] entry_SYSCALL_64_fastpath+0x23/0xc1 
arch/x86/entry/entry_64.S:207
INFO: Slab 0xeadab800 objects=24 used=8 fp=0x880036ae7980 
flags=0x1fffc004080

INFO: Object 0x880036ae7978 @offset=31096 fp=0x
CPU: 0 PID: 14596 Comm: syz-executor Tainted: GB 4.6.0-rc2+ #16

Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x112 lib/dump_stack.c:51
 [] print_trailer+0x10d/0x190 mm/slub.c:667
 [] object_err+0x2f/0x40 mm/slub.c:674
 [< inline >] print_address_description mm/kasan/report.c:179
 [] kasan_report_error+0x218/0x530 mm/kasan/report.c:275
 [< inline >] kasan_report mm/kasan/report.c:297
 [] __asan_report_load4_noabort+0x3e/0x40 
mm/kasan/report.c:317

 [< inline >] ? nlmsg_put_answer include/net/netlink.h:471
 [] ? netlink_dump+0x4eb/0xa40 
net/netlink/af_netlink.c:2120

 [< inline >] nlmsg_put_answer include/net/netlink.h:471
 [] netlink_dump+0x4eb/0xa40 
net/netlink/af_netlink.c:2120
 [] netlink_recvmsg+0x8fb/0xe00 
net/netlink/af_netlink.c:1869
 [] ? netlink_dump+0xa40/0xa40 
include/linux/skbuff.h:1980
 [] ? rw_copy_check_uvector+0x1c3/0x260 
fs/read_write.c:818

 [] ? import_iovec+0x216/0x3c0 lib/iov_iter.c:811
 [] ? iov_iter_get_pages_alloc+0x960/0x960 
lib/iov_iter.c:629
 [] ? security_socket_recvmsg+0x8f/0xc0 
security/security.c:1244

 [< inline >] sock_recvmsg_nosec net/socket.c:714
 [] sock_recvmsg+0x9d/0xb0 net/socket.c:722
 [] ? __sock_recv_wifi_status+0x180/0x180 
./arch/x86/include/asm/bitops.h:311

 [] ___sys_recvmsg+0x259/0x540 net/socket.c:2104
 [< inline >] ? sock_sendmsg_nosec net/socket.c:612
 [] ? ___sys_sendmsg+0x860/0x860 net/socket.c:1943
 [< inline >] ? rcu_read_unlock include/linux/rcupdate.h:922
 [] ? __fget+0x20c/0x3b0 fs/file.c:712
 [< inline >] ? rcu_lock_release include/linux/rcupdate.h:491
 [< inline >] ?

Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-05-15 Thread Andrew Lunn

> >I think there could be similar code one layer above to handle one gpio
> >line for multiple phys.
> 
>Ah, you want me to recognize some MAC/MDIO bound prop (e.g.
> "mdio-reset-gpios") in of_mdiobus_register()? I'll think about it
> now that my patch needs fixing anyway...

Hi Sergi

It does not need to be you implementing this, your hardware does not
need it. However, if you do feel like doing it, great.

 Andrew

[PATCH 4/4] batman-adv: Fix refcnt leak in batadv_v_neigh_*

From: Sven Eckelmann 

The functions batadv_neigh_ifinfo_get increase the reference counter of the
batadv_neigh_ifinfo. These have to be reduced again when the reference is
not used anymore to correctly free the objects.

Fixes: 9786906022eb ("batman-adv: B.A.T.M.A.N. V - implement neighbor 
comparison API calls")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_v.c | 32 +---
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/net/batman-adv/bat_v.c b/net/batman-adv/bat_v.c
index e81ad4b8e5c8..7fd477583eb0 100644
--- a/net/batman-adv/bat_v.c
+++ b/net/batman-adv/bat_v.c
@@ -257,14 +257,23 @@ static int batadv_v_neigh_cmp(struct batadv_neigh_node 
*neigh1,
  struct batadv_hard_iface *if_outgoing2)
 {
struct batadv_neigh_ifinfo *ifinfo1, *ifinfo2;
+   int ret = 0;
 
ifinfo1 = batadv_neigh_ifinfo_get(neigh1, if_outgoing1);
+   if (WARN_ON(!ifinfo1))
+   goto err_ifinfo1;
+
ifinfo2 = batadv_neigh_ifinfo_get(neigh2, if_outgoing2);
+   if (WARN_ON(!ifinfo2))
+   goto err_ifinfo2;
 
-   if (WARN_ON(!ifinfo1 || !ifinfo2))
-   return 0;
+   ret = ifinfo1->bat_v.throughput - ifinfo2->bat_v.throughput;
 
-   return ifinfo1->bat_v.throughput - ifinfo2->bat_v.throughput;
+   batadv_neigh_ifinfo_put(ifinfo2);
+err_ifinfo2:
+   batadv_neigh_ifinfo_put(ifinfo1);
+err_ifinfo1:
+   return ret;
 }
 
 static bool batadv_v_neigh_is_sob(struct batadv_neigh_node *neigh1,
@@ -274,17 +283,26 @@ static bool batadv_v_neigh_is_sob(struct 
batadv_neigh_node *neigh1,
 {
struct batadv_neigh_ifinfo *ifinfo1, *ifinfo2;
u32 threshold;
+   bool ret = false;
 
ifinfo1 = batadv_neigh_ifinfo_get(neigh1, if_outgoing1);
-   ifinfo2 = batadv_neigh_ifinfo_get(neigh2, if_outgoing2);
+   if (WARN_ON(!ifinfo1))
+   goto err_ifinfo1;
 
-   if (WARN_ON(!ifinfo1 || !ifinfo2))
-   return false;
+   ifinfo2 = batadv_neigh_ifinfo_get(neigh2, if_outgoing2);
+   if (WARN_ON(!ifinfo2))
+   goto err_ifinfo2;
 
threshold = ifinfo1->bat_v.throughput / 4;
threshold = ifinfo1->bat_v.throughput - threshold;
 
-   return ifinfo2->bat_v.throughput > threshold;
+   ret = ifinfo2->bat_v.throughput > threshold;
+
+   batadv_neigh_ifinfo_put(ifinfo2);
+err_ifinfo2:
+   batadv_neigh_ifinfo_put(ifinfo1);
+err_ifinfo1:
+   return ret;
 }
 
 static struct batadv_algo_ops batadv_batman_v __read_mostly = {
-- 
2.8.2

[PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-15 Thread Dexuan Cui

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

You can also get the patch by:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160512_v10

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev=145952064004765=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: , but in
AF_HYPERV, the endpoint type is: . Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments, especially comments from David. :-)

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Changes since v8
- removed the unnecessary __packed for some definitions: thanks, David!
- hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection
and reorganized the function
direction
- reorganized the code according to suggestions from Cathy Avery: split big
functions into small ones, set .setsockopt and getsockopt to
sock_no_setsockopt/sock_no_getsockopt
- inline'd some small list helper functions

Changes since v9
- minimized struct hvsock_sock by making the send/recv buffers pointers.
the buffers are allocated by kmalloc() in __hvsock_create() now.
- minimized the sizes of the send/recv buffers and the vmbus ringbuffers.

Changes since v10

1) add module params: send_ring_page, recv_ring_page. They can be used to
enlarge the ringbuffer size to get better performance, e.g.,
# modprobe hv_sock recv_ring_page=16 send_ring_page=16
By default, recv_ring_page is 3 and send_ring_page is 2.

2) add module param max_socket_number (the default is 1024).
A user can enlarge the number to create more than 1024 hv_sock sockets.
By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
(Here

[PATCH 3/4] batman-adv: Fix double neigh_node_put in batadv_v_ogm_route_update

From: Sven Eckelmann 

The router is put down twice when it was non-NULL and either orig_ifinfo is
NULL afterwards or batman-adv receives a packet with the same sequence
number. This will end up in a use-after-free when the batadv_neigh_node is
removed because the reference counter ended up too early at 0.

Fixes: 9323158ef9f4 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_v_ogm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/bat_v_ogm.c b/net/batman-adv/bat_v_ogm.c
index d9bcbe6e7d65..91df28a100f9 100644
--- a/net/batman-adv/bat_v_ogm.c
+++ b/net/batman-adv/bat_v_ogm.c
@@ -529,8 +529,10 @@ static void batadv_v_ogm_route_update(struct batadv_priv 
*bat_priv,
goto out;
}
 
-   if (router)
+   if (router) {
batadv_neigh_node_put(router);
+   router = NULL;
+   }
 
/* Update routes, and check if the OGM is from the best next hop */
batadv_v_ogm_orig_update(bat_priv, orig_node, neigh_node, ogm2,
-- 
2.8.2

[PATCH 2/4] batman-adv: Avoid nullptr derefence in batadv_v_neigh_is_sob

From: Sven Eckelmann 

batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.

This a situation which isn't expected but is already handled by functions
like batadv_v_neigh_cmp. The same kind of warning is therefore used before
the function returns without dereferencing the pointers.

Fixes: 9786906022eb ("batman-adv: B.A.T.M.A.N. V - implement neighbor 
comparison API calls")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_v.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/batman-adv/bat_v.c b/net/batman-adv/bat_v.c
index 4026f198a734..e81ad4b8e5c8 100644
--- a/net/batman-adv/bat_v.c
+++ b/net/batman-adv/bat_v.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -277,6 +278,9 @@ static bool batadv_v_neigh_is_sob(struct batadv_neigh_node 
*neigh1,
ifinfo1 = batadv_neigh_ifinfo_get(neigh1, if_outgoing1);
ifinfo2 = batadv_neigh_ifinfo_get(neigh2, if_outgoing2);
 
+   if (WARN_ON(!ifinfo1 || !ifinfo2))
+   return false;
+
threshold = ifinfo1->bat_v.throughput / 4;
threshold = ifinfo1->bat_v.throughput - threshold;
 
-- 
2.8.2

[PATCH v11 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-05-15 Thread Dexuan Cui

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
---

You can also get the patch on this branch:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160515_v11

For the change log before v10, please see https://lkml.org/lkml/2016/5/4/532

In v10, the main changes consist of
1) minimize struct hvsock_sock by making the send/recv buffers pointers.
   the buffers are allocated by kmalloc() in __hvsock_create().
2) minimize the sizes of the send/recv buffers and the vmbus ringbuffers.


In v11, the changes are:
1) add module params: send_ring_page, recv_ring_page. They can be used to
enlarge the ringbuffer size to get better performance, e.g.,
# modprobe hv_sock  recv_ring_page=16 send_ring_page=16
By default, recv_ring_page is 3 and send_ring_page is 2.

2) add module param max_socket_number (the default is 1024).
A user can enlarge the number to create more than 1024 hv_sock sockets.
By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
(Here 1+1 means 1 page for send/recv buffers per connection, respectively.)

3) implement the TODO in hvsock_shutdown().

4) fix a bug in hvsock_close_connection():
   I remove "sk->sk_socket->state = SS_UNCONNECTED;" -- actually this line
is not really useful. For a connection triggered by a host app’s connect(),
sk->sk_socket remains NULL before the connection is accepted by the server
app (in Linux VM): see hvsock_accept() -> hvsock_accept_wait() ->
sock_graft(connected, newsock). If the host app exits before the server
app’s accept() returns, the host can send a rescind-message to close the
connection and later in the Linux VM’s message handler 
i.e. vmbus_onoffer_rescind()), Linux will get a NULL de-referencing crash. 

5) fix a bug in hvsock_open_connection()
  I move the vmbus_set_chn_rescind_callback() to a later place, because
when vmbus_open() fails, hvsock_close_connection() can do nothing and we
count on vmbus_onoffer_rescind() -> vmbus_device_unregister() to clean up
the device.

6) some stylistic modificiation.


 MAINTAINERS |2 +
 include/linux/hyperv.h  |   14 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   78 +++
 include/uapi/linux/hyperv.h |   25 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1520 +++
 10 files changed, 1657 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b57df66..c9fe2c6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5271,7 +5271,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index aa0fadc..7be7237 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1338,4 +1338,18 @@ extern __u32 vmbus_proto_version;
 
 int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
  const uuid_le *shv_host_servie_id);
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,9 @@ struct ucred {
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41  /* Kernel

pull request [net]: batman-adv 20160515

Hello David,

although we are extremely late in the release cycle we have 4 fixes
which would really be worth merging before releasing linux-4.6.

As you can read in the git tag below, each of them can lead to a
kernel crash or to an unstable system.

We came up with several fixes after having tested our new B.A.T.M.A.N. V
code at the Wireless Battle Mesh in Porto (PT) at the beginning of the month,
however, what I am sending here is the minimum subset that we though being
extremely important to avoid easy kernel crashes. The change footprint is
also rather small.


Please pull or let me know if you rather prefer to get this through net-next.

If you decide to pull, you will hit some conflicts when merging net into
net-next, but I can send you some instructions to ease the process.


Thanks a lot!
Antonio


The following changes since commit b91506586206140154b0b44cccf88c8cc0a4dca5:

  Merge branch 'xgene-fixes' (2016-05-13 21:12:07 -0400)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem

for you to fetch changes up to 6b892c1cb0805acee5d4ddd9e7878ed076c1b7c7:

  batman-adv: Fix refcnt leak in batadv_v_neigh_* (2016-05-14 15:51:39 +0800)


During the Wireless Battle Mesh v9 in Porto (PT) at the beginning of
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the most critical fixes we
came up with aimed to avoid easy kernel crashes:
- avoid potential crash due to NULL pointer dereference in
  B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
  Sven Eckelmann
- avoid crash due to double kref_put on neigh_node object in
  B.A.T.M.A.N. V routine leading to use-after-free, by Sven
  Eckelmann (this crash can be always replicated)
- avoid use-after-free of skb when counting outgoing bytes, by Florian
  Westphal
- fix neigh_ifinfo object reference counting imbalance when using
  B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
  impossibility of releasing the related netdev object on shutdown.


Florian Westphal (1):
  batman-adv: fix skb deref after free

Sven Eckelmann (3):
  batman-adv: Avoid nullptr derefence in batadv_v_neigh_is_sob
  batman-adv: Fix double neigh_node_put in batadv_v_ogm_route_update
  batman-adv: Fix refcnt leak in batadv_v_neigh_*

 net/batman-adv/bat_v.c | 30 ++
 net/batman-adv/bat_v_ogm.c |  4 +++-
 net/batman-adv/routing.c   |  4 +++-
 3 files changed, 32 insertions(+), 6 deletions(-)

[PATCH 1/4] batman-adv: fix skb deref after free

From: Florian Westphal 

batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.

Fixes: 953324776d6d ("batman-adv: network coding - buffer unicast packets 
before forward")
Signed-off-by: Florian Westphal 
Reviewed-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/routing.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
index b781bf753250..0c0c30e101a1 100644
--- a/net/batman-adv/routing.c
+++ b/net/batman-adv/routing.c
@@ -601,6 +601,7 @@ static int batadv_route_unicast_packet(struct sk_buff *skb,
struct batadv_unicast_packet *unicast_packet;
struct ethhdr *ethhdr = eth_hdr(skb);
int res, hdr_len, ret = NET_RX_DROP;
+   unsigned int len;
 
unicast_packet = (struct batadv_unicast_packet *)skb->data;
 
@@ -641,6 +642,7 @@ static int batadv_route_unicast_packet(struct sk_buff *skb,
if (hdr_len > 0)
batadv_skb_set_priority(skb, hdr_len);
 
+   len = skb->len;
res = batadv_send_skb_to_orig(skb, orig_node, recv_if);
 
/* translate transmit result into receive result */
@@ -648,7 +650,7 @@ static int batadv_route_unicast_packet(struct sk_buff *skb,
/* skb was transmitted and consumed */
batadv_inc_counter(bat_priv, BATADV_CNT_FORWARD);
batadv_add_counter(bat_priv, BATADV_CNT_FORWARD_BYTES,
-  skb->len + ETH_HLEN);
+  len + ETH_HLEN);
 
ret = NET_RX_SUCCESS;
} else if (res == NET_XMIT_POLICED) {
-- 
2.8.2

[PATCH 1/5] ethtool: move option parsing related codes into function

From: Kan Liang 

Move option parsing code into find_option function.
No behavior changes.

Signed-off-by: Kan Liang 
---
 ethtool.c | 49 +++--
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 0cd0d4f..bd0583c 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4223,6 +4223,29 @@ static int show_usage(struct cmd_context *ctx)
return 0;
 }
 
+static int find_option(int argc, char **argp)
+{
+   const char *opt;
+   size_t len;
+   int k;
+
+   for (k = 0; args[k].opts; k++) {
+   opt = args[k].opts;
+   for (;;) {
+   len = strcspn(opt, "|");
+   if (strncmp(*argp, opt, len) == 0 &&
+   (*argp)[len] == 0)
+   return k;
+
+   if (opt[len] == 0)
+   break;
+   opt += len + 1;
+   }
+   }
+
+   return -1;
+}
+
 int main(int argc, char **argp)
 {
int (*func)(struct cmd_context *);
@@ -4240,24 +4263,14 @@ int main(int argc, char **argp)
 */
if (argc == 0)
exit_bad_args();
-   for (k = 0; args[k].opts; k++) {
-   const char *opt;
-   size_t len;
-   opt = args[k].opts;
-   for (;;) {
-   len = strcspn(opt, "|");
-   if (strncmp(*argp, opt, len) == 0 &&
-   (*argp)[len] == 0) {
-   argp++;
-   argc--;
-   func = args[k].func;
-   want_device = args[k].want_device;
-   goto opt_found;
-   }
-   if (opt[len] == 0)
-   break;
-   opt += len + 1;
-   }
+
+   k = find_option(argc, argp);
+   if (k > 0) {
+   argp++;
+   argc--;
+   func = args[k].func;
+   want_device = args[k].want_device;
+   goto opt_found;
}
if ((*argp)[0] == '-')
exit_bad_args();
-- 
2.5.0

[PATCH 2/5] ethtool: move cmdline_coalesce out of do_scoalesce

From: Kan Liang 

Moving cmdline_coalesce out of do_scoalesce, so it can be shared with
other functions.
No behavior change.

Signed-off-by: Kan Liang 
---
 ethtool.c | 147 +++---
 1 file changed, 74 insertions(+), 73 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index bd0583c..86724a2 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1883,85 +1883,86 @@ static int do_gcoalesce(struct cmd_context *ctx)
return 0;
 }
 
+static struct ethtool_coalesce s_ecoal;
+static s32 coal_stats_wanted = -1;
+static int coal_adaptive_rx_wanted = -1;
+static int coal_adaptive_tx_wanted = -1;
+static s32 coal_sample_rate_wanted = -1;
+static s32 coal_pkt_rate_low_wanted = -1;
+static s32 coal_pkt_rate_high_wanted = -1;
+static s32 coal_rx_usec_wanted = -1;
+static s32 coal_rx_frames_wanted = -1;
+static s32 coal_rx_usec_irq_wanted = -1;
+static s32 coal_rx_frames_irq_wanted = -1;
+static s32 coal_tx_usec_wanted = -1;
+static s32 coal_tx_frames_wanted = -1;
+static s32 coal_tx_usec_irq_wanted = -1;
+static s32 coal_tx_frames_irq_wanted = -1;
+static s32 coal_rx_usec_low_wanted = -1;
+static s32 coal_rx_frames_low_wanted = -1;
+static s32 coal_tx_usec_low_wanted = -1;
+static s32 coal_tx_frames_low_wanted = -1;
+static s32 coal_rx_usec_high_wanted = -1;
+static s32 coal_rx_frames_high_wanted = -1;
+static s32 coal_tx_usec_high_wanted = -1;
+static s32 coal_tx_frames_high_wanted = -1;
+
+static struct cmdline_info cmdline_coalesce[] = {
+   { "adaptive-rx", CMDL_BOOL, _adaptive_rx_wanted,
+ _ecoal.use_adaptive_rx_coalesce },
+   { "adaptive-tx", CMDL_BOOL, _adaptive_tx_wanted,
+ _ecoal.use_adaptive_tx_coalesce },
+   { "sample-interval", CMDL_S32, _sample_rate_wanted,
+ _ecoal.rate_sample_interval },
+   { "stats-block-usecs", CMDL_S32, _stats_wanted,
+ _ecoal.stats_block_coalesce_usecs },
+   { "pkt-rate-low", CMDL_S32, _pkt_rate_low_wanted,
+ _ecoal.pkt_rate_low },
+   { "pkt-rate-high", CMDL_S32, _pkt_rate_high_wanted,
+ _ecoal.pkt_rate_high },
+   { "rx-usecs", CMDL_S32, _rx_usec_wanted,
+ _ecoal.rx_coalesce_usecs },
+   { "rx-frames", CMDL_S32, _rx_frames_wanted,
+ _ecoal.rx_max_coalesced_frames },
+   { "rx-usecs-irq", CMDL_S32, _rx_usec_irq_wanted,
+ _ecoal.rx_coalesce_usecs_irq },
+   { "rx-frames-irq", CMDL_S32, _rx_frames_irq_wanted,
+ _ecoal.rx_max_coalesced_frames_irq },
+   { "tx-usecs", CMDL_S32, _tx_usec_wanted,
+ _ecoal.tx_coalesce_usecs },
+   { "tx-frames", CMDL_S32, _tx_frames_wanted,
+ _ecoal.tx_max_coalesced_frames },
+   { "tx-usecs-irq", CMDL_S32, _tx_usec_irq_wanted,
+ _ecoal.tx_coalesce_usecs_irq },
+   { "tx-frames-irq", CMDL_S32, _tx_frames_irq_wanted,
+ _ecoal.tx_max_coalesced_frames_irq },
+   { "rx-usecs-low", CMDL_S32, _rx_usec_low_wanted,
+ _ecoal.rx_coalesce_usecs_low },
+   { "rx-frames-low", CMDL_S32, _rx_frames_low_wanted,
+ _ecoal.rx_max_coalesced_frames_low },
+   { "tx-usecs-low", CMDL_S32, _tx_usec_low_wanted,
+ _ecoal.tx_coalesce_usecs_low },
+   { "tx-frames-low", CMDL_S32, _tx_frames_low_wanted,
+ _ecoal.tx_max_coalesced_frames_low },
+   { "rx-usecs-high", CMDL_S32, _rx_usec_high_wanted,
+ _ecoal.rx_coalesce_usecs_high },
+   { "rx-frames-high", CMDL_S32, _rx_frames_high_wanted,
+ _ecoal.rx_max_coalesced_frames_high },
+   { "tx-usecs-high", CMDL_S32, _tx_usec_high_wanted,
+ _ecoal.tx_coalesce_usecs_high },
+   { "tx-frames-high", CMDL_S32, _tx_frames_high_wanted,
+ _ecoal.tx_max_coalesced_frames_high },
+};
 static int do_scoalesce(struct cmd_context *ctx)
 {
-   struct ethtool_coalesce ecoal;
int gcoalesce_changed = 0;
-   s32 coal_stats_wanted = -1;
-   int coal_adaptive_rx_wanted = -1;
-   int coal_adaptive_tx_wanted = -1;
-   s32 coal_sample_rate_wanted = -1;
-   s32 coal_pkt_rate_low_wanted = -1;
-   s32 coal_pkt_rate_high_wanted = -1;
-   s32 coal_rx_usec_wanted = -1;
-   s32 coal_rx_frames_wanted = -1;
-   s32 coal_rx_usec_irq_wanted = -1;
-   s32 coal_rx_frames_irq_wanted = -1;
-   s32 coal_tx_usec_wanted = -1;
-   s32 coal_tx_frames_wanted = -1;
-   s32 coal_tx_usec_irq_wanted = -1;
-   s32 coal_tx_frames_irq_wanted = -1;
-   s32 coal_rx_usec_low_wanted = -1;
-   s32 coal_rx_frames_low_wanted = -1;
-   s32 coal_tx_usec_low_wanted = -1;
-   s32 coal_tx_frames_low_wanted = -1;
-   s32 coal_rx_usec_high_wanted = -1;
-   s32 coal_rx_frames_high_wanted = -1;
-   s32 coal_tx_usec_high_wanted = -1;
-   s32 coal_tx_frames_high_wanted = -1;
-   struct cmdline_info cmdline_coalesce[] = {
-   { "adaptive-rx", CMDL_BOOL, _adaptive_rx_wanted,
- _adaptive_rx_coalesce },
-   {

[PATCH 4/5] ethtool: support per queue sub command --show-coalesce

From: Kan Liang 

Get all masked queues' coalesce from kernel and dump them one by one.

Example:

 $ sudo ./ethtool --set-perqueue-command eth5 queue_mask 0x11
   --show-coalesce
 Queue: 0
 Adaptive RX: off  TX: off
 stats-block-usecs: 0
 sample-interval: 0
 pkt-rate-low: 0
 pkt-rate-high: 0

 rx-usecs: 222
 rx-frames: 0
 rx-usecs-irq: 0
 rx-frames-irq: 256

 tx-usecs: 222
 tx-frames: 0
 tx-usecs-irq: 0
 tx-frames-irq: 256

 rx-usecs-low: 0
 rx-frame-low: 0
 tx-usecs-low: 0
 tx-frame-low: 0

 rx-usecs-high: 0
 rx-frame-high: 0
 tx-usecs-high: 0
 tx-frame-high: 0

 Queue: 4
 Adaptive RX: off  TX: off
 stats-block-usecs: 0
 sample-interval: 0
 pkt-rate-low: 0
 pkt-rate-high: 0

 rx-usecs: 222
 rx-frames: 0
 rx-usecs-irq: 0
 rx-frames-irq: 256

 tx-usecs: 222
 tx-frames: 0
 tx-usecs-irq: 0
 tx-frames-irq: 256

 rx-usecs-low: 0
 rx-frame-low: 0
 tx-usecs-low: 0
 tx-frame-low: 0

 rx-usecs-high: 0
 rx-frame-high: 0
 tx-usecs-high: 0
 tx-frame-high: 0

Signed-off-by: Kan Liang 
---
 ethtool.8.in |  2 +-
 ethtool.c| 62 ++--
 2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/ethtool.8.in b/ethtool.8.in
index 26d01cb..210ec8c 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -937,7 +937,7 @@ Sets the specific queues which the sub command is applied 
to.
 If queue_mask is not set, the sub command will be applied to all queues.
 .TP
 .B sub_command
-Sets the sub command.
+Sets the sub command. The supported sub commands include --show-coalesce.
 .RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
diff --git a/ethtool.c b/ethtool.c
index ba741f0..a966bf8 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1219,6 +1219,29 @@ static int dump_coalesce(const struct ethtool_coalesce 
*ecoal)
return 0;
 }
 
+void dump_per_queue_coalesce(struct ethtool_per_queue_op *per_queue_opt,
+__u32 *queue_mask)
+{
+   char *addr;
+   int i;
+
+   addr = (char *)per_queue_opt + sizeof(*per_queue_opt);
+   for (i = 0; i < __KERNEL_DIV_ROUND_UP(MAX_NUM_QUEUE, 32); i++) {
+   int queue = i * 32;
+   __u32 mask = queue_mask[i];
+
+   while (mask > 0) {
+   if (mask & 0x1) {
+   fprintf(stdout, "Queue: %d\n", queue);
+   dump_coalesce((struct ethtool_coalesce *)addr);
+   addr += sizeof(struct ethtool_coalesce);
+   }
+   mask = mask >> 1;
+   queue++;
+   }
+   }
+}
+
 struct feature_state {
u32 off_flags;
struct ethtool_gfeatures features;
@@ -4198,7 +4221,8 @@ static const struct option {
  " [ advertise %x ]\n"
  " [ tx-lpi on|off ]\n"
  " [ tx-timer %d ]\n"},
-   { "--set-perqueue-command", 1, do_perqueue, "Set per queue command",
+   { "--set-perqueue-command", 1, do_perqueue, "Set per queue command. "
+ "The supported sub commands include --show-coalesce",
  " [queue_mask %x] SUB_COMMAND\n"},
{ "-h|--help", 0, show_usage, "Show this help" },
{ "--version", 0, do_version, "Show version number" },
@@ -4302,8 +4326,31 @@ static int find_max_num_queues(struct cmd_context *ctx)
return MAX(MAX(echannels.rx_count, echannels.tx_count), 
echannels.combined_count);
 }
 
+static struct ethtool_per_queue_op *
+get_per_queue_coalesce(struct cmd_context *ctx,
+  __u32 *queue_mask, int n_queues)
+{
+   struct ethtool_per_queue_op *per_queue_opt;
+
+   per_queue_opt = malloc(sizeof(*per_queue_opt) + n_queues * 
sizeof(struct ethtool_coalesce));
+   if (!per_queue_opt)
+   return NULL;
+
+   memcpy(per_queue_opt->queue_mask, queue_mask, 
__KERNEL_DIV_ROUND_UP(MAX_NUM_QUEUE, 32) * sizeof(__u32));
+   per_queue_opt->cmd = ETHTOOL_PERQUEUE;
+   per_queue_opt->sub_command = ETHTOOL_GCOALESCE;
+   if (send_ioctl(ctx, per_queue_opt)) {
+   free(per_queue_opt);
+   perror("Cannot get device per queue parameters");
+   return NULL;
+   }
+
+   return per_queue_opt;
+}
+
 static int do_perqueue(struct cmd_context *ctx)
 {
+   struct ethtool_per_queue_op *per_queue_opt;
__u32 queue_mask[__KERNEL_DIV_ROUND_UP(MAX_NUM_QUEUE, 32)] = {0};
int i, n_queues = 0;
 
@@ -4342,7 +4389,18 @@ static int do_perqueue(struct cmd_context *ctx)
if (i < 0)
exit_bad_args();
 
-   /* no sub_command support yet */
+   if (strstr(args[i].opts, "--show-coalesce") != NULL) {
+   per_queue_opt = get_per_queue_coalesce(ctx, queue_mask, 
n_queues);
+   if (per_queue_opt == NULL) {
+   perror("Cannot get device per queue parameters");
+   return

[PATCH 5/5] ethtool: support per queue sub command --coalesce

From: Kan Liang 

This patch uses a similar way as do_scoalesce to set coalesce per queue.
It reads the current settings, change them, and write them back to the
kernel for each masked queue.

Example:

 $ sudo ./ethtool --set-perqueue-command eth5 queue_mask 0x1 --coalesce
 rx-usecs 10 tx-usecs 5
 $ sudo ./ethtool --set-perqueue-command eth5 queue_mask 0x1
 --show-coalesce

 Queue: 0
 Adaptive RX: on  TX: on
 stats-block-usecs: 0
 sample-interval: 0
 pkt-rate-low: 0
 pkt-rate-high: 0

 rx-usecs: 10
 rx-frames: 0
 rx-usecs-irq: 0
 rx-frames-irq: 256

 tx-usecs: 5
 tx-frames: 0
 tx-usecs-irq: 0
 tx-frames-irq: 256

 rx-usecs-low: 0
 rx-frame-low: 0
 tx-usecs-low: 0
 tx-frame-low: 0

 rx-usecs-high: 0
 rx-frame-high: 0
 tx-usecs-high: 0
 tx-frame-high: 0

Signed-off-by: Kan Liang 
---
 ethtool.8.in |  2 +-
 ethtool.c| 58 +-
 2 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/ethtool.8.in b/ethtool.8.in
index 210ec8c..0e42180 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -937,7 +937,7 @@ Sets the specific queues which the sub command is applied 
to.
 If queue_mask is not set, the sub command will be applied to all queues.
 .TP
 .B sub_command
-Sets the sub command. The supported sub commands include --show-coalesce.
+Sets the sub command. The supported sub commands include --show-coalesce and 
--coalesce.
 .RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
diff --git a/ethtool.c b/ethtool.c
index a966bf8..55ba26c 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4222,7 +4222,7 @@ static const struct option {
  " [ tx-lpi on|off ]\n"
  " [ tx-timer %d ]\n"},
{ "--set-perqueue-command", 1, do_perqueue, "Set per queue command. "
- "The supported sub commands include --show-coalesce",
+ "The supported sub commands include --show-coalesce, --coalesce",
  " [queue_mask %x] SUB_COMMAND\n"},
{ "-h|--help", 0, show_usage, "Show this help" },
{ "--version", 0, do_version, "Show version number" },
@@ -4348,6 +4348,52 @@ get_per_queue_coalesce(struct cmd_context *ctx,
return per_queue_opt;
 }
 
+static void __set_per_queue_coalesce(int queue)
+{
+   int changed = 0;
+
+   do_generic_set(cmdline_coalesce, ARRAY_SIZE(cmdline_coalesce),
+  );
+
+   if (!changed)
+   fprintf(stderr, "Queue %d, no coalesce parameters changed\n", 
queue);
+}
+
+static void set_per_queue_coalesce(struct cmd_context *ctx,
+  struct ethtool_per_queue_op *per_queue_opt)
+{
+   __u32 *queue_mask = per_queue_opt->queue_mask;
+   char *addr = (char *)per_queue_opt + sizeof(*per_queue_opt);
+   int gcoalesce_changed = 0;
+   int i;
+
+   parse_generic_cmdline(ctx, _changed,
+ cmdline_coalesce, ARRAY_SIZE(cmdline_coalesce));
+
+   for (i = 0; i < __KERNEL_DIV_ROUND_UP(MAX_NUM_QUEUE, 32); i++) {
+   int queue = i * 32;
+   __u32 mask = queue_mask[i];
+
+   while (mask > 0) {
+   if (mask & 0x1) {
+   memcpy(_ecoal, addr, sizeof(struct 
ethtool_coalesce));
+   __set_per_queue_coalesce(queue);
+   memcpy(addr, _ecoal, sizeof(struct 
ethtool_coalesce));
+   addr += sizeof(struct ethtool_coalesce);
+   }
+   mask = mask >> 1;
+   queue++;
+   }
+   }
+
+   per_queue_opt->cmd = ETHTOOL_PERQUEUE;
+   per_queue_opt->sub_command = ETHTOOL_SCOALESCE;
+
+   if (send_ioctl(ctx, per_queue_opt))
+   perror("Cannot set device per queue parameters");
+
+}
+
 static int do_perqueue(struct cmd_context *ctx)
 {
struct ethtool_per_queue_op *per_queue_opt;
@@ -4397,6 +4443,16 @@ static int do_perqueue(struct cmd_context *ctx)
}
dump_per_queue_coalesce(per_queue_opt, queue_mask);
free(per_queue_opt);
+   } else if (strstr(args[i].opts, "--coalesce") != NULL) {
+   ctx->argc--;
+   ctx->argp++;
+   per_queue_opt = get_per_queue_coalesce(ctx, queue_mask, 
n_queues);
+   if (per_queue_opt == NULL) {
+   perror("Cannot get device per queue parameters");
+   return -EFAULT;
+   }
+   set_per_queue_coalesce(ctx, per_queue_opt);
+   free(per_queue_opt);
} else {
perror("The subcommand is not supported yet");
return -EOPNOTSUPP;
-- 
2.5.0

[PATCH 3/5] ethtool: introduce new ioctl for per queue setting

From: Kan Liang 

Introduce a new ioctl for per queue parameters setting.
Users can apply commands to specific queues by setting SUB_COMMAND and
queue_mask as following command.

 ethtool --set-perqueue-command DEVNAME [queue_mask %x] SUB_COMMAND

If queue_mask is not set, the SUB_COMMAND will be applied to all queues.

The following patches will enable SUB_COMMANDs for per queue setting.

Signed-off-by: Kan Liang 
---
 ethtool.8.in |  19 
 ethtool.c| 100 +++
 2 files changed, 119 insertions(+)

diff --git a/ethtool.8.in b/ethtool.8.in
index 009711d..26d01cb 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -339,6 +339,13 @@ ethtool \- query or control network driver and hardware 
settings
 .B2 tx-lpi on off
 .BN tx-timer
 .BN advertise
+.HP
+.B ethtool \-\-set\-perqueue\-command
+.I devname
+.RB [ queue_mask
+.IR %x ]
+.I sub_command
+.RB ...
 .
 .\" Adjust lines (i.e. full justification) and hyphenate.
 .ad
@@ -920,6 +927,18 @@ Values are as for
 Sets the amount of time the device should stay in idle mode prior to asserting
 its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled.
 .RE
+.TP
+.B \-\-set\-perqueue\-command
+Sets sub command to specific queues.
+.RS 4
+.TP
+.B queue_mask %x
+Sets the specific queues which the sub command is applied to.
+If queue_mask is not set, the sub command will be applied to all queues.
+.TP
+.B sub_command
+Sets the sub command.
+.RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
 .SH AUTHOR
diff --git a/ethtool.c b/ethtool.c
index 86724a2..ba741f0 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4037,6 +4037,8 @@ static int do_seee(struct cmd_context *ctx)
return 0;
 }
 
+static int do_perqueue(struct cmd_context *ctx);
+
 #ifndef TEST_ETHTOOL
 int send_ioctl(struct cmd_context *ctx, void *cmd)
 {
@@ -4196,6 +4198,8 @@ static const struct option {
  " [ advertise %x ]\n"
  " [ tx-lpi on|off ]\n"
  " [ tx-timer %d ]\n"},
+   { "--set-perqueue-command", 1, do_perqueue, "Set per queue command",
+ " [queue_mask %x] SUB_COMMAND\n"},
{ "-h|--help", 0, show_usage, "Show this help" },
{ "--version", 0, do_version, "Show version number" },
{}
@@ -4247,6 +4251,102 @@ static int find_option(int argc, char **argp)
return -1;
 }
 
+static int set_queue_mask(u32 *queue_mask, char *str)
+{
+   int len = strlen(str);
+   int index = __KERNEL_DIV_ROUND_UP(len * 4, 32);
+   char tmp[9];
+   char *end = str + len;
+   int i, num;
+   __u32 mask;
+   int n_queues = 0;
+
+   if (len > MAX_NUM_QUEUE)
+   return -EINVAL;
+
+   for (i = 0; i < index; i++) {
+   num = end - str;
+
+   if (num >= 8) {
+   end -= 8;
+   num = 8;
+   } else {
+   end = str;
+   }
+   strncpy(tmp, end, num);
+   tmp[num] = '\0';
+
+   queue_mask[i] = strtoul(tmp, NULL, 16);
+
+   mask = queue_mask[i];
+   while (mask > 0) {
+   if (mask & 0x1)
+   n_queues++;
+   mask = mask >> 1;
+   }
+   }
+
+   return n_queues;
+}
+
+#define MAX(x, y) (x > y ? x : y)
+
+static int find_max_num_queues(struct cmd_context *ctx)
+{
+   struct ethtool_channels echannels;
+
+   echannels.cmd = ETHTOOL_GCHANNELS;
+   if (send_ioctl(ctx, ))
+   return -1;
+
+   return MAX(MAX(echannels.rx_count, echannels.tx_count), 
echannels.combined_count);
+}
+
+static int do_perqueue(struct cmd_context *ctx)
+{
+   __u32 queue_mask[__KERNEL_DIV_ROUND_UP(MAX_NUM_QUEUE, 32)] = {0};
+   int i, n_queues = 0;
+
+   if (ctx->argc == 0)
+   exit_bad_args();
+
+   /*
+* The sub commands will be applied to
+* all queues if no queue_mask set
+*/
+   if (strncmp(*ctx->argp, "queue_mask", 10)) {
+   n_queues = find_max_num_queues(ctx);
+   if (n_queues < 0) {
+   perror("Cannot get number of queues");
+   return -EFAULT;
+   }
+   for (i = 0; i < n_queues / 32; i++)
+   queue_mask[i] = ~0;
+   queue_mask[i] = (1 << (n_queues - i * 32)) - 1;
+   fprintf(stdout, "The sub commands will be applied"
+   " to all %d queues\n", n_queues);
+   } else {
+   ctx->argc--;
+   ctx->argp++;
+   n_queues = set_queue_mask(queue_mask, *ctx->argp);
+   if (n_queues < 0) {
+   perror("Invalid queue mask");
+   return n_queues;
+   }
+   ctx->argc--;
+

[PATCH] net/hsr: Use setup_timer and mod_timer.

2016-05-15 Thread Muhammad Falak R Wani

The function setup_timer combines the initialization of a timer with the
initialization of the timer's function and data fields. The mulitiline
code for timer initialization is now replaced with function setup_timer.

Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
   active timer (if the timer is inactive it will be activated).

Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and aid readablity.

Signed-off-by: Muhammad Falak R Wani 
---
 net/hsr/hsr_device.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index 386cbce..16737cd 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -461,13 +461,9 @@ int hsr_dev_finalize(struct net_device *hsr_dev, struct 
net_device *slave[2],
hsr->sequence_nr = HSR_SEQNR_START;
hsr->sup_sequence_nr = HSR_SUP_SEQNR_START;
 
-   init_timer(>announce_timer);
-   hsr->announce_timer.function = hsr_announce;
-   hsr->announce_timer.data = (unsigned long) hsr;
+   setup_timer(>announce_timer, hsr_announce, (unsigned long)hsr);
 
-   init_timer(>prune_timer);
-   hsr->prune_timer.function = hsr_prune_nodes;
-   hsr->prune_timer.data = (unsigned long) hsr;
+   setup_timer(>prune_timer, hsr_prune_nodes, (unsigned long)hsr);
 
ether_addr_copy(hsr->sup_multicast_addr, def_multicast_addr);
hsr->sup_multicast_addr[ETH_ALEN - 1] = multicast_spec;
@@ -502,8 +498,7 @@ int hsr_dev_finalize(struct net_device *hsr_dev, struct 
net_device *slave[2],
if (res)
goto fail;
 
-   hsr->prune_timer.expires = jiffies + msecs_to_jiffies(PRUNE_PERIOD);
-   add_timer(>prune_timer);
+   mod_timer(>prune_timer, jiffies + msecs_to_jiffies(PRUNE_PERIOD));
 
return 0;
 
-- 
1.9.1

Re: [PATCH] ethernet:arc: Fix racing of TX ring buffer

2016-05-15 Thread Shuyu Wei

On Sun, May 15, 2016 at 11:19:53AM +0200, Francois Romieu wrote:
> 
> static void arc_emac_tx_clean(struct net_device *ndev)
> {
> [...]
> for (i = 0; i < TX_BD_NUM; i++) {
> unsigned int *txbd_dirty = >txbd_dirty;
> struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
> struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
> struct sk_buff *skb = tx_buff->skb;
> unsigned int info = le32_to_cpu(txbd->info);
> 
> if ((info & FOR_EMAC) || !txbd->data || !skb)
> break;
> ^
> 
> -> the "break" statement prevents reading all txbds. At most one extra
>descriptor is read and this driver isn't in the Mpps business.
> 

You are right, I forgot the break statement.

> > I tried your advice, Tx throughput can only reach 5.52MB/s.
> 
> Even with the original code above ?

Yes, I left tx_clean unmodified, and took your advice below.
I tested it again just now, this time throughput do reach 9.8MB/s,
Maybe last time cpu is not idle.

I still have a question, is it possible that tx_clean() run
between   priv->tx_buff[*txbd_curr].skb = skb   and   dma_wmb()?

--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -685,13 +685,15 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
wmb();
 
skb_tx_timestamp(skb);
+   priv->tx_buff[*txbd_curr].skb = skb;
+
+   dma_wmb();
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
/* Make sure info word is set */
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
 
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;

[PATCH net-next 3/5] qed: Reset link on IOV disable

From: Manish Chopra 

PF updates its VFs' bulletin boards with link configurations whenever
the physical carrier changes or whenever hyper-user explicitly requires
some setting of the VFs link via the hypervisor's PF.

Since the bulletin board is getting cleaned as part of the IOV disable
flow on the PF side, re-enabling sriov would lead to a VF that sees the
carrier as 'down', until an event causing the PF to re-fill the bulletin
with the link configuration would occur.

To fix this we simply refelect the link state during the flows, giving
the later VFs a default reflecting the PFs link state.

Signed-off-by: Manish Chopra 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_sriov.c | 90 -
 1 file changed, 51 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index 7b6b4a0..a977f39 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -806,9 +806,51 @@ static int qed_iov_init_hw_for_vf(struct qed_hwfn *p_hwfn,
return rc;
 }
 
+static void qed_iov_set_link(struct qed_hwfn *p_hwfn,
+u16 vfid,
+struct qed_mcp_link_params *params,
+struct qed_mcp_link_state *link,
+struct qed_mcp_link_capabilities *p_caps)
+{
+   struct qed_vf_info *p_vf = qed_iov_get_vf_info(p_hwfn,
+  vfid,
+  false);
+   struct qed_bulletin_content *p_bulletin;
+
+   if (!p_vf)
+   return;
+
+   p_bulletin = p_vf->bulletin.p_virt;
+   p_bulletin->req_autoneg = params->speed.autoneg;
+   p_bulletin->req_adv_speed = params->speed.advertised_speeds;
+   p_bulletin->req_forced_speed = params->speed.forced_speed;
+   p_bulletin->req_autoneg_pause = params->pause.autoneg;
+   p_bulletin->req_forced_rx = params->pause.forced_rx;
+   p_bulletin->req_forced_tx = params->pause.forced_tx;
+   p_bulletin->req_loopback = params->loopback_mode;
+
+   p_bulletin->link_up = link->link_up;
+   p_bulletin->speed = link->speed;
+   p_bulletin->full_duplex = link->full_duplex;
+   p_bulletin->autoneg = link->an;
+   p_bulletin->autoneg_complete = link->an_complete;
+   p_bulletin->parallel_detection = link->parallel_detection;
+   p_bulletin->pfc_enabled = link->pfc_enabled;
+   p_bulletin->partner_adv_speed = link->partner_adv_speed;
+   p_bulletin->partner_tx_flow_ctrl_en = link->partner_tx_flow_ctrl_en;
+   p_bulletin->partner_rx_flow_ctrl_en = link->partner_rx_flow_ctrl_en;
+   p_bulletin->partner_adv_pause = link->partner_adv_pause;
+   p_bulletin->sfp_tx_fault = link->sfp_tx_fault;
+
+   p_bulletin->capability_speed = p_caps->speed_capabilities;
+}
+
 static int qed_iov_release_hw_for_vf(struct qed_hwfn *p_hwfn,
 struct qed_ptt *p_ptt, u16 rel_vf_id)
 {
+   struct qed_mcp_link_capabilities caps;
+   struct qed_mcp_link_params params;
+   struct qed_mcp_link_state link;
struct qed_vf_info *vf = NULL;
int rc = 0;
 
@@ -823,6 +865,15 @@ static int qed_iov_release_hw_for_vf(struct qed_hwfn 
*p_hwfn,
 
memset(>p_vf_info, 0, sizeof(vf->p_vf_info));
 
+   /* Get the link configuration back in bulletin so
+* that when VFs are re-enabled they get the actual
+* link configuration.
+*/
+   memcpy(, qed_mcp_get_link_params(p_hwfn), sizeof(params));
+   memcpy(, qed_mcp_get_link_state(p_hwfn), sizeof(link));
+   memcpy(, qed_mcp_get_link_capabilities(p_hwfn), sizeof(caps));
+   qed_iov_set_link(p_hwfn, rel_vf_id, , , );
+
if (vf->state != VF_STOPPED) {
/* Stopping the VF */
rc = qed_sp_vf_stop(p_hwfn, vf->concrete_fid, vf->opaque_fid);
@@ -2542,45 +2593,6 @@ int qed_iov_mark_vf_flr(struct qed_hwfn *p_hwfn, u32 
*p_disabled_vfs)
return found;
 }
 
-void qed_iov_set_link(struct qed_hwfn *p_hwfn,
- u16 vfid,
- struct qed_mcp_link_params *params,
- struct qed_mcp_link_state *link,
- struct qed_mcp_link_capabilities *p_caps)
-{
-   struct qed_vf_info *p_vf = qed_iov_get_vf_info(p_hwfn,
-  vfid,
-  false);
-   struct qed_bulletin_content *p_bulletin;
-
-   if (!p_vf)
-   return;
-
-   p_bulletin = p_vf->bulletin.p_virt;
-   p_bulletin->req_autoneg = params->speed.autoneg;
-   p_bulletin->req_adv_speed = params->speed.advertised_speeds;
-   p_bulletin->req_forced_speed = params->speed.forced_speed;
-

[PATCH net-next 0/5] qed: IOV enhncements and fixups

Hi Dave,

This is a follow-up on the recent patch series that adds SR-IOV support
to qed. All content here is iov-related fixups [nothing terminal] and
enhancements.

Please consider applying this series to `net-next'.

Thanks,
Yuval

Manish Chopra (1):
  qed: Reset link on IOV disable

Yuval Mintz (4):
  qed: Correct PF-sanity check
  qed: Improve VF interrupt reset
  qed: Allow more than 16 VFs
  qed: VFs gracefully accept lack of PM

 drivers/net/ethernet/qlogic/qed/qed_int.c  |  59 ++--
 drivers/net/ethernet/qlogic/qed/qed_int.h  |  20 +
 drivers/net/ethernet/qlogic/qed/qed_main.c |   2 +-
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |   2 +
 drivers/net/ethernet/qlogic/qed/qed_sriov.c| 119 +
 5 files changed, 99 insertions(+), 103 deletions(-)

-- 
1.9.3

[PATCH net-next 4/5] qed: Allow more than 16 VFs

In multi-function modes, PFs are currently limited to using 16 VFs -
But that limitation would also currently apply in case there's a single
PCI function exposed, where no such restriction should have existed.

This lifts the restriction for the default mode; User should be able
to start the maximum number of VFs as appear in the PCI config space.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_sriov.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index a977f39..c325ee8 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -3099,6 +3099,9 @@ static int qed_sriov_enable(struct qed_dev *cdev, int num)
goto err;
}
 
+   if (IS_MF_DEFAULT(hwfn))
+   limit = MAX_NUM_VFS_BB / hwfn->num_funcs_on_engine;
+
memset(_cnt_info, 0, sizeof(sb_cnt_info));
qed_int_get_num_sbs(hwfn, _cnt_info);
num_sbs = min_t(int, sb_cnt_info.sb_free_blk, limit);
-- 
1.9.3

[PATCH net-next 2/5] qed: Improve VF interrupt reset

During FLR flow, need to make sure HW is no longer capable of writing to
host memory as part of its interrupt mechanisms.
While we're at it, unify the logic cleaning the driver's status-blocks
into using a single API function for both PFs and VFs.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_int.c  | 59 ++
 drivers/net/ethernet/qlogic/qed/qed_int.h  | 20 +
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |  2 +
 drivers/net/ethernet/qlogic/qed/qed_sriov.c| 20 +++--
 4 files changed, 41 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c 
b/drivers/net/ethernet/qlogic/qed/qed_int.c
index bbecfa5..09a6ad3 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.c
@@ -2805,20 +2805,13 @@ void qed_int_igu_disable_int(struct qed_hwfn *p_hwfn,
 }
 
 #define IGU_CLEANUP_SLEEP_LENGTH(1000)
-void qed_int_igu_cleanup_sb(struct qed_hwfn *p_hwfn,
-   struct qed_ptt *p_ptt,
-   u32 sb_id,
-   bool cleanup_set,
-   u16 opaque_fid
-   )
+static void qed_int_igu_cleanup_sb(struct qed_hwfn *p_hwfn,
+  struct qed_ptt *p_ptt,
+  u32 sb_id, bool cleanup_set, u16 opaque_fid)
 {
+   u32 cmd_ctrl = 0, val = 0, sb_bit = 0, sb_bit_addr = 0, data = 0;
u32 pxp_addr = IGU_CMD_INT_ACK_BASE + sb_id;
u32 sleep_cnt = IGU_CLEANUP_SLEEP_LENGTH;
-   u32 data = 0;
-   u32 cmd_ctrl = 0;
-   u32 val = 0;
-   u32 sb_bit = 0;
-   u32 sb_bit_addr = 0;
 
/* Set the data field */
SET_FIELD(data, IGU_CLEANUP_CLEANUP_SET, cleanup_set ? 1 : 0);
@@ -2863,11 +2856,9 @@ void qed_int_igu_cleanup_sb(struct qed_hwfn *p_hwfn,
 
 void qed_int_igu_init_pure_rt_single(struct qed_hwfn *p_hwfn,
 struct qed_ptt *p_ptt,
-u32 sb_id,
-u16 opaque,
-bool b_set)
+u32 sb_id, u16 opaque, bool b_set)
 {
-   int pi;
+   int pi, i;
 
/* Set */
if (b_set)
@@ -2876,6 +2867,22 @@ void qed_int_igu_init_pure_rt_single(struct qed_hwfn 
*p_hwfn,
/* Clear */
qed_int_igu_cleanup_sb(p_hwfn, p_ptt, sb_id, 0, opaque);
 
+   /* Wait for the IGU SB to cleanup */
+   for (i = 0; i < IGU_CLEANUP_SLEEP_LENGTH; i++) {
+   u32 val;
+
+   val = qed_rd(p_hwfn, p_ptt,
+IGU_REG_WRITE_DONE_PENDING + ((sb_id / 32) * 4));
+   if (val & (1 << (sb_id % 32)))
+   usleep_range(10, 20);
+   else
+   break;
+   }
+   if (i == IGU_CLEANUP_SLEEP_LENGTH)
+   DP_NOTICE(p_hwfn,
+ "Failed SB[0x%08x] still appearing in 
WRITE_DONE_PENDING\n",
+ sb_id);
+
/* Clear the CAU for the SB */
for (pi = 0; pi < 12; pi++)
qed_wr(p_hwfn, p_ptt,
@@ -2884,13 +2891,11 @@ void qed_int_igu_init_pure_rt_single(struct qed_hwfn 
*p_hwfn,
 
 void qed_int_igu_init_pure_rt(struct qed_hwfn *p_hwfn,
  struct qed_ptt *p_ptt,
- bool b_set,
- bool b_slowpath)
+ bool b_set, bool b_slowpath)
 {
u32 igu_base_sb = p_hwfn->hw_info.p_igu_info->igu_base_sb;
u32 igu_sb_cnt = p_hwfn->hw_info.p_igu_info->igu_sb_cnt;
-   u32 sb_id = 0;
-   u32 val = 0;
+   u32 sb_id = 0, val = 0;
 
val = qed_rd(p_hwfn, p_ptt, IGU_REG_BLOCK_CONFIGURATION);
val |= IGU_REG_BLOCK_CONFIGURATION_VF_CLEANUP_EN;
@@ -2906,14 +2911,14 @@ void qed_int_igu_init_pure_rt(struct qed_hwfn *p_hwfn,
p_hwfn->hw_info.opaque_fid,
b_set);
 
-   if (b_slowpath) {
-   sb_id = p_hwfn->hw_info.p_igu_info->igu_dsb_id;
-   DP_VERBOSE(p_hwfn, NETIF_MSG_INTR,
-  "IGU cleaning slowpath SB [%d]\n", sb_id);
-   qed_int_igu_init_pure_rt_single(p_hwfn, p_ptt, sb_id,
-   p_hwfn->hw_info.opaque_fid,
-   b_set);
-   }
+   if (!b_slowpath)
+   return;
+
+   sb_id = p_hwfn->hw_info.p_igu_info->igu_dsb_id;
+   DP_VERBOSE(p_hwfn, NETIF_MSG_INTR,
+  "IGU cleaning slowpath SB [%d]\n", sb_id);
+   qed_int_igu_init_pure_rt_single(p_hwfn, p_ptt, sb_id,
+   p_hwfn->hw_info.opaque_fid, b_set);
 }
 
 static u32 qed_int_igu_read_cam_block(struct qed_hwfn  *p_hwfn,
diff --git

[PATCH net-next 5/5] qed: VFs gracefully accept lack of PM

VF's probe might log that it has no PM capability in its PCI configuration
space. As this is a valid configuration, silence such prints.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 6ffc21d..56f6bc1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -158,7 +158,7 @@ static int qed_init_pci(struct qed_dev *cdev,
}
 
cdev->pci_params.pm_cap = pci_find_capability(pdev, PCI_CAP_ID_PM);
-   if (cdev->pci_params.pm_cap == 0)
+   if (IS_PF(cdev) && !cdev->pci_params.pm_cap)
DP_NOTICE(cdev, "Cannot find power management capability\n");
 
rc = qed_set_coherency_mask(cdev);
-- 
1.9.3

[PATCH net-next 1/5] qed: Correct PF-sanity check

Seems like something broke in commit 1408cc1fa48c ("qed: Introduce VFs")
and the function no longer verifies that the vf is indeed a valid one.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_sriov.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index d4df406..2c4f9b0 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -476,12 +476,12 @@ int qed_iov_hw_info(struct qed_hwfn *p_hwfn)
 static bool qed_iov_pf_sanity_check(struct qed_hwfn *p_hwfn, int vfid)
 {
/* Check PF supports sriov */
-   if (!IS_QED_SRIOV(p_hwfn->cdev) || !IS_PF_SRIOV_ALLOC(p_hwfn))
+   if (IS_VF(p_hwfn->cdev) || !IS_QED_SRIOV(p_hwfn->cdev) ||
+   !IS_PF_SRIOV_ALLOC(p_hwfn))
return false;
 
/* Check VF validity */
-   if (IS_VF(p_hwfn->cdev) || !IS_QED_SRIOV(p_hwfn->cdev) ||
-   !IS_PF_SRIOV_ALLOC(p_hwfn))
+   if (!qed_iov_is_valid_vfid(p_hwfn, vfid, true))
return false;
 
return true;
-- 
1.9.3

Re: [PATCH] ethernet:arc: Fix racing of TX ring buffer

2016-05-15 Thread Francois Romieu

Shuyu Wei  :
[...]
> I don't think taking txbd_curr and txbd_dirty only as hints is a good idea.
> That could be a big waste, since tx_clean have to go through all the txbds.

Sorry if my point was not clear: arc_emac_tx_clean() does not need
to change (at least not for the reason given in the commit message) :o)

Current code:

static void arc_emac_tx_clean(struct net_device *ndev)
{
[...]
for (i = 0; i < TX_BD_NUM; i++) {
unsigned int *txbd_dirty = >txbd_dirty;
struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
struct sk_buff *skb = tx_buff->skb;
unsigned int info = le32_to_cpu(txbd->info);

if ((info & FOR_EMAC) || !txbd->data || !skb)
break;
^

-> the "break" statement prevents reading all txbds. At most one extra
   descriptor is read and this driver isn't in the Mpps business.

> I tried your advice, Tx throughput can only reach 5.52MB/s.

Even with the original code above ?

> Leaving one sent packet in tx_clean is acceptable if we respect to txbd_curr
> and txbd_dirty, since the ignored packet will be cleaned when new packets
> arrive.

There is no reason to leave tx packet roting in the first place. Really.
I doubt it would help bql for one.

Packet may rot because of unexpected hardware behavior and driver should
cope with it when it is diagnosed, sure. However, you don't want the driver
to opens it own unbounded window. Next packet: 10 us, 10 ms, 10 s ?

-- 
Ueimor

Re: What ixgbe devices support HWTSTAMP_FILTER_ALL for hardware time stamping?

2016-05-15 Thread Richard Cochran

On Sat, May 14, 2016 at 07:11:50PM -0700, Guy Harris wrote:
> It could do a combination of #2 and #1, where "offers all
> possibilities" is replaced by "opens the adapter, tries each of the
> possibilities, and offers the ones that don't fail" - but, other
> than the current bugs with ETHTOOL_GET_TS_INFO, I don't see any
> advantage to doing only #1, rather than trying #2, perhaps with some
> special-casing to work around the bugs in question, and only falling
> back on actually trying to set the options if we can't ask about
> them.

Right.

Regarding the drivers you found, even if you can't patch them
yourself, please write the maintainers directly with me and netdev on
CC in order to bring attention to the bugs.

The get_maintainer.pl script can help identify the maintainers.

Thanks,
Richard

[PATCH net] net/mlx4_core: Fix access to uninitialized index

2016-05-15 Thread Tariq Toukan

Prevent using uninitialized or negative index when handling
steering entries.

Fixes: b12d93d63c32 ('mlx4: Add support for promiscuous mode in the new 
steering model.')
Signed-off-by: Tariq Toukan 
Reported-by: Dan Carpenter 
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 6aa7397..f2d0920 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -1102,7 +1102,7 @@ int mlx4_qp_attach_common(struct mlx4_dev *dev, struct 
mlx4_qp *qp, u8 gid[16],
struct mlx4_cmd_mailbox *mailbox;
struct mlx4_mgm *mgm;
u32 members_count;
-   int index, prev;
+   int index = -1, prev;
int link = 0;
int i;
int err;
@@ -1181,7 +1181,7 @@ int mlx4_qp_attach_common(struct mlx4_dev *dev, struct 
mlx4_qp *qp, u8 gid[16],
goto out;
 
 out:
-   if (prot == MLX4_PROT_ETH) {
+   if (prot == MLX4_PROT_ETH && index != -1) {
/* manage the steering entry for promisc mode */
if (new_entry)
err = new_steering_entry(dev, port, steer,
-- 
1.8.3.1

[PATCH net] net/mlx4_core: Fix access to uninitialized index

2016-05-15 Thread Tariq Toukan

Prevent using uninitialized or negative index when handling
steering entries.

Fixes: b12d93d63c32 ('mlx4: Add support for promiscuous mode in the new 
steering model.')
Signed-off-by: Tariq Toukan 
Reported-by: Dan Carpenter 
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 6aa7397..f2d0920 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -1102,7 +1102,7 @@ int mlx4_qp_attach_common(struct mlx4_dev *dev, struct 
mlx4_qp *qp, u8 gid[16],
struct mlx4_cmd_mailbox *mailbox;
struct mlx4_mgm *mgm;
u32 members_count;
-   int index, prev;
+   int index = -1, prev;
int link = 0;
int i;
int err;
@@ -1181,7 +1181,7 @@ int mlx4_qp_attach_common(struct mlx4_dev *dev, struct 
mlx4_qp *qp, u8 gid[16],
goto out;
 
 out:
-   if (prot == MLX4_PROT_ETH) {
+   if (prot == MLX4_PROT_ETH && index != -1) {
/* manage the steering entry for promisc mode */
if (new_entry)
err = new_steering_entry(dev, port, steer,
-- 
1.8.3.1

[PATCH net-next v2 2/9] bnxt_en: Add Support for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPRO

From: Ajit Khaparde 

Add support to fetch the SFP EEPROM settings from the firmware
and display it via the ethtool -m command.  We support SFP+ and QSFP
modules.

v2: Fixed a bug in bnxt_get_module_eeprom() found by Ben Hutchings.

Signed-off-by: Ajit Khaparde 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  11 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 121 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h |  34 ++
 4 files changed, 167 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6a5a717..59b2e36 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4671,6 +4671,7 @@ static int bnxt_update_link(struct bnxt *bp, bool 
chng_link_state)
link_info->transceiver = resp->xcvr_pkg_type;
link_info->phy_addr = resp->eee_config_phy_addr &
  PORT_PHY_QCFG_RESP_PHY_ADDR_MASK;
+   link_info->module_status = resp->module_status;
 
if (bp->flags & BNXT_FLAG_EEE_CAP) {
struct ethtool_eee *eee = >eee;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 6289635..355843b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -829,6 +829,7 @@ struct bnxt_link_info {
u16 lp_auto_link_speeds;
u16 force_link_speed;
u32 preemphasis;
+   u8  module_status;
 
/* copy of requested setting from ethtool cmd */
u8  autoneg;
@@ -1121,6 +1122,16 @@ static inline void bnxt_disable_poll(struct bnxt_napi 
*bnapi)
 
 #endif
 
+#define I2C_DEV_ADDR_A00xa0
+#define I2C_DEV_ADDR_A20xa2
+#define SFP_EEPROM_SFF_8472_COMP_ADDR  0x5e
+#define SFP_EEPROM_SFF_8472_COMP_SIZE  1
+#define SFF_MODULE_ID_SFP  0x3
+#define SFF_MODULE_ID_QSFP 0xc
+#define SFF_MODULE_ID_QSFP_PLUS0xd
+#define SFF_MODULE_ID_QSFP28   0x11
+#define BNXT_MAX_PHY_I2C_RESP_SIZE 64
+
 void bnxt_set_ring_params(struct bnxt *);
 void bnxt_hwrm_cmd_hdr_init(struct bnxt *, void *, u16, u16, u16);
 int _hwrm_send_message(struct bnxt *, void *, u32, int);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 28171f9..a38cb04 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1498,6 +1498,125 @@ static int bnxt_get_eee(struct net_device *dev, struct 
ethtool_eee *edata)
return 0;
 }
 
+static int bnxt_read_sfp_module_eeprom_info(struct bnxt *bp, u16 i2c_addr,
+   u16 page_number, u16 start_addr,
+   u16 data_length, u8 *buf)
+{
+   struct hwrm_port_phy_i2c_read_input req = {0};
+   struct hwrm_port_phy_i2c_read_output *output = bp->hwrm_cmd_resp_addr;
+   int rc, byte_offset = 0;
+
+   bnxt_hwrm_cmd_hdr_init(bp, , HWRM_PORT_PHY_I2C_READ, -1, -1);
+   req.i2c_slave_addr = i2c_addr;
+   req.page_number = cpu_to_le16(page_number);
+   req.port_id = cpu_to_le16(bp->pf.port_id);
+   do {
+   u16 xfer_size;
+
+   xfer_size = min_t(u16, data_length, BNXT_MAX_PHY_I2C_RESP_SIZE);
+   data_length -= xfer_size;
+   req.page_offset = cpu_to_le16(start_addr + byte_offset);
+   req.data_length = xfer_size;
+   req.enables = cpu_to_le32(start_addr + byte_offset ?
+PORT_PHY_I2C_READ_REQ_ENABLES_PAGE_OFFSET : 0);
+   mutex_lock(>hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, , sizeof(req),
+   HWRM_CMD_TIMEOUT);
+   if (!rc)
+   memcpy(buf + byte_offset, output->data, xfer_size);
+   mutex_unlock(>hwrm_cmd_lock);
+   byte_offset += xfer_size;
+   } while (!rc && data_length > 0);
+
+   return rc;
+}
+
+static int bnxt_get_module_info(struct net_device *dev,
+   struct ethtool_modinfo *modinfo)
+{
+   struct bnxt *bp = netdev_priv(dev);
+   struct hwrm_port_phy_i2c_read_input req = {0};
+   struct hwrm_port_phy_i2c_read_output *output = bp->hwrm_cmd_resp_addr;
+   int rc;
+
+   /* No point in going further if phy status indicates
+* module is not inserted or if it is powered down or
+* if it is of type 10GBase-T
+*/
+   if

[PATCH net-next v2 6/9] bnxt_en: Fix length value in dmesg log firmware error message.

The len value in the hwrm error message is wrong.  Use the properly adjusted
value in the variable len.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d33b20f..0a83fd8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2774,7 +2774,7 @@ static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void 
*msg, u32 msg_len,
if (i >= tmo_count) {
netdev_err(bp->dev, "Error (timeout: %d) msg {0x%x 
0x%x} len:%d\n",
   timeout, le16_to_cpu(req->req_type),
-  le16_to_cpu(req->seq_id), *resp_len);
+  le16_to_cpu(req->seq_id), len);
return -1;
}
 
-- 
1.8.3.1

[PATCH net-next v2 3/9] bnxt_en: Report PCIe link speed and width during driver load

From: Ajit Khaparde 

Add code to log a message during driver load indicating PCIe link
speed and width.

The log message will look like this:
bnxt_en :86:00.0 eth0: PCIe: Speed 8.0GT/s Width x8

Signed-off-by: Ajit Khaparde 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 59b2e36..ba0c3e5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6198,6 +6198,22 @@ static int bnxt_set_dflt_rings(struct bnxt *bp)
return rc;
 }
 
+static void bnxt_parse_log_pcie_link(struct bnxt *bp)
+{
+   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
+   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
+
+   if (pcie_get_minimum_link(bp->pdev, , ) ||
+   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
+   netdev_info(bp->dev, "Failed to determine PCIe Link Info\n");
+   else
+   netdev_info(bp->dev, "PCIe: Speed %s Width x%d\n",
+   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
+   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
+   speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
+   "Unknown", width);
+}
+
 static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
static int version_printed;
@@ -6318,6 +6334,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
board_info[ent->driver_data].name,
(long)pci_resource_start(pdev, 0), dev->dev_addr);
 
+   bnxt_parse_log_pcie_link(bp);
+
return 0;
 
 init_err:
-- 
1.8.3.1

[PATCH net-next v2 8/9] bnxt_en: Add BCM57314 device ID.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6def145..f2ac7da 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -78,6 +78,7 @@ enum board_idx {
BCM57402,
BCM57404,
BCM57406,
+   BCM57314,
BCM57304_VF,
BCM57404_VF,
 };
@@ -92,6 +93,7 @@ static const struct {
{ "Broadcom BCM57402 NetXtreme-E Dual-port 10Gb Ethernet" },
{ "Broadcom BCM57404 NetXtreme-E Dual-port 10Gb/25Gb Ethernet" },
{ "Broadcom BCM57406 NetXtreme-E Dual-port 10GBase-T Ethernet" },
+   { "Broadcom BCM57314 NetXtreme-C Dual-port 10Gb/25Gb/40Gb/50Gb 
Ethernet" },
{ "Broadcom BCM57304 NetXtreme-C Ethernet Virtual Function" },
{ "Broadcom BCM57404 NetXtreme-E Ethernet Virtual Function" },
 };
@@ -103,6 +105,7 @@ static const struct pci_device_id bnxt_pci_tbl[] = {
{ PCI_VDEVICE(BROADCOM, 0x16d0), .driver_data = BCM57402 },
{ PCI_VDEVICE(BROADCOM, 0x16d1), .driver_data = BCM57404 },
{ PCI_VDEVICE(BROADCOM, 0x16d2), .driver_data = BCM57406 },
+   { PCI_VDEVICE(BROADCOM, 0x16df), .driver_data = BCM57314 },
 #ifdef CONFIG_BNXT_SRIOV
{ PCI_VDEVICE(BROADCOM, 0x16cb), .driver_data = BCM57304_VF },
{ PCI_VDEVICE(BROADCOM, 0x16d3), .driver_data = BCM57404_VF },
-- 
1.8.3.1

[PATCH net-next v2 5/9] bnxt_en: Improve the delay logic for firmware response.

The current code has 2 problems:

1. The maximum wait time is not long enough.  It is about 60% of the
duration specified by the firmware.  It is calling usleep_range(600, 800)
for every 1 msec we are supposed to wait.

2. The granularity of the delay is too coarse.  Many simple firmware
commands finish in 25 usec or less.

We fix these 2 issues by multiplying the original 1 msec loop counter by
40 and calling usleep_range(25, 40) for each iteration.

There is also a second delay loop to wait for the last DMA word to
complete.  This delay loop should be a very short 5 usec wait.

This change results in much faster bring-up/down time:

Before the patch:

time ip link set p4p1 up

real0m0.120s
user0m0.001s
sys 0m0.009s

After the patch:

time ip link set p4p1 up

real0m0.030s
user0m0.000s
sys 0m0.010s

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index ba0c3e5..d33b20f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2718,7 +2718,7 @@ void bnxt_hwrm_cmd_hdr_init(struct bnxt *bp, void 
*request, u16 req_type,
 static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void *msg, u32 msg_len,
 int timeout, bool silent)
 {
-   int i, intr_process, rc;
+   int i, intr_process, rc, tmo_count;
struct input *req = msg;
u32 *data = msg;
__le32 *resp_len, *valid;
@@ -2747,11 +2747,12 @@ static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void 
*msg, u32 msg_len,
timeout = DFLT_HWRM_CMD_TIMEOUT;
 
i = 0;
+   tmo_count = timeout * 40;
if (intr_process) {
/* Wait until hwrm response cmpl interrupt is processed */
while (bp->hwrm_intr_seq_id != HWRM_SEQ_ID_INVALID &&
-  i++ < timeout) {
-   usleep_range(600, 800);
+  i++ < tmo_count) {
+   usleep_range(25, 40);
}
 
if (bp->hwrm_intr_seq_id != HWRM_SEQ_ID_INVALID) {
@@ -2762,15 +2763,15 @@ static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void 
*msg, u32 msg_len,
} else {
/* Check if response len is updated */
resp_len = bp->hwrm_cmd_resp_addr + HWRM_RESP_LEN_OFFSET;
-   for (i = 0; i < timeout; i++) {
+   for (i = 0; i < tmo_count; i++) {
len = (le32_to_cpu(*resp_len) & HWRM_RESP_LEN_MASK) >>
  HWRM_RESP_LEN_SFT;
if (len)
break;
-   usleep_range(600, 800);
+   usleep_range(25, 40);
}
 
-   if (i >= timeout) {
+   if (i >= tmo_count) {
netdev_err(bp->dev, "Error (timeout: %d) msg {0x%x 
0x%x} len:%d\n",
   timeout, le16_to_cpu(req->req_type),
   le16_to_cpu(req->seq_id), *resp_len);
@@ -2779,13 +2780,13 @@ static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void 
*msg, u32 msg_len,
 
/* Last word of resp contains valid bit */
valid = bp->hwrm_cmd_resp_addr + len - 4;
-   for (i = 0; i < timeout; i++) {
+   for (i = 0; i < 5; i++) {
if (le32_to_cpu(*valid) & HWRM_RESP_VALID_MASK)
break;
-   usleep_range(600, 800);
+   udelay(1);
}
 
-   if (i >= timeout) {
+   if (i >= 5) {
netdev_err(bp->dev, "Error (timeout: %d) msg {0x%x 
0x%x} len:%d v:%d\n",
   timeout, le16_to_cpu(req->req_type),
   le16_to_cpu(req->seq_id), len, *valid);
-- 
1.8.3.1

[PATCH net-next v2 4/9] bnxt_en: Reduce maximum ring pages if page size is 64K.

The chip supports 4K/8K/64K page sizes for the rings and we try to
match it to the CPU PAGE_SIZE.  The current page size limits for the rings
are based on 4K/8K page size. If the page size is 64K, these limits are
too large.  Reduce them appropriately.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 355843b..408bb00 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -425,10 +425,17 @@ struct rx_tpa_end_cmp_ext {
 
 #define MAX_TPA64
 
+#if (BNXT_PAGE_SHIFT == 16)
+#define MAX_RX_PAGES   1
+#define MAX_RX_AGG_PAGES   4
+#define MAX_TX_PAGES   1
+#define MAX_CP_PAGES   8
+#else
 #define MAX_RX_PAGES   8
 #define MAX_RX_AGG_PAGES   32
 #define MAX_TX_PAGES   8
 #define MAX_CP_PAGES   64
+#endif
 
 #define RX_DESC_CNT (BNXT_PAGE_SIZE / sizeof(struct rx_bd))
 #define TX_DESC_CNT (BNXT_PAGE_SIZE / sizeof(struct tx_bd))
-- 
1.8.3.1

[PATCH net-next v2 7/9] bnxt_en: Simplify and improve unsupported SFP+ module reporting.

The current code is more complicated than necessary and can only report
unsupported SFP+ module if it is plugged in after the device is up.

Rename bnxt_port_module_event() to bnxt_get_port_module_status().  We
already have the current module_status in the link_info structure, so
just check that and report any unsupported SFP+ module status.  Delete
the unnecessary last_port_module_event.  Call this function at the
end of bnxt_open to report unsupported module already plugged in.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 66 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 -
 2 files changed, 30 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 0a83fd8..6def145 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1263,15 +1263,6 @@ next_rx_no_prod:
((data) &   \
 HWRM_ASYNC_EVENT_CMPL_PORT_CONN_NOT_ALLOWED_EVENT_DATA1_PORT_ID_MASK)
 
-#define BNXT_EVENT_POLICY_MASK \
-   
HWRM_ASYNC_EVENT_CMPL_PORT_CONN_NOT_ALLOWED_EVENT_DATA1_ENFORCEMENT_POLICY_MASK
-
-#define BNXT_EVENT_POLICY_SFT  \
-   
HWRM_ASYNC_EVENT_CMPL_PORT_CONN_NOT_ALLOWED_EVENT_DATA1_ENFORCEMENT_POLICY_SFT
-
-#define BNXT_GET_EVENT_POLICY(data)\
-   (((data) & BNXT_EVENT_POLICY_MASK) >> BNXT_EVENT_POLICY_SFT)
-
 static int bnxt_async_event_process(struct bnxt *bp,
struct hwrm_async_event_cmpl *cmpl)
 {
@@ -1310,9 +1301,6 @@ static int bnxt_async_event_process(struct bnxt *bp,
if (bp->pf.port_id != port_id)
break;
 
-   bp->link_info.last_port_module_event =
-   BNXT_GET_EVENT_POLICY(data1);
-
set_bit(BNXT_HWRM_PORT_MODULE_SP_EVENT, >sp_event);
break;
}
@@ -4725,6 +4713,33 @@ static int bnxt_update_link(struct bnxt *bp, bool 
chng_link_state)
return 0;
 }
 
+static void bnxt_get_port_module_status(struct bnxt *bp)
+{
+   struct bnxt_link_info *link_info = >link_info;
+   struct hwrm_port_phy_qcfg_output *resp = _info->phy_qcfg_resp;
+   u8 module_status;
+
+   if (bnxt_update_link(bp, true))
+   return;
+
+   module_status = link_info->module_status;
+   switch (module_status) {
+   case PORT_PHY_QCFG_RESP_MODULE_STATUS_DISABLETX:
+   case PORT_PHY_QCFG_RESP_MODULE_STATUS_PWRDOWN:
+   case PORT_PHY_QCFG_RESP_MODULE_STATUS_WARNINGMSG:
+   netdev_warn(bp->dev, "Unqualified SFP+ module detected on port 
%d\n",
+   bp->pf.port_id);
+   if (bp->hwrm_spec_code >= 0x10201) {
+   netdev_warn(bp->dev, "Module part number %s\n",
+   resp->phy_vendor_partnumber);
+   }
+   if (module_status == PORT_PHY_QCFG_RESP_MODULE_STATUS_DISABLETX)
+   netdev_warn(bp->dev, "TX is disabled\n");
+   if (module_status == PORT_PHY_QCFG_RESP_MODULE_STATUS_PWRDOWN)
+   netdev_warn(bp->dev, "SFP+ module is shutdown\n");
+   }
+}
+
 static void
 bnxt_hwrm_set_pause_common(struct bnxt *bp, struct hwrm_port_phy_cfg_input 
*req)
 {
@@ -5017,7 +5032,8 @@ static int __bnxt_open_nic(struct bnxt *bp, bool 
irq_re_init, bool link_re_init)
/* Enable TX queues */
bnxt_tx_enable(bp);
mod_timer(>timer, jiffies + bp->current_interval);
-   bnxt_update_link(bp, true);
+   /* Poll link status and check for SFP+ module status */
+   bnxt_get_port_module_status(bp);
 
return 0;
 
@@ -5552,28 +5568,6 @@ bnxt_restart_timer:
mod_timer(>timer, jiffies + bp->current_interval);
 }
 
-static void bnxt_port_module_event(struct bnxt *bp)
-{
-   struct bnxt_link_info *link_info = >link_info;
-   struct hwrm_port_phy_qcfg_output *resp = _info->phy_qcfg_resp;
-
-   if (bnxt_update_link(bp, true))
-   return;
-
-   if (link_info->last_port_module_event != 0) {
-   netdev_warn(bp->dev, "Unqualified SFP+ module detected on port 
%d\n",
-   bp->pf.port_id);
-   if (bp->hwrm_spec_code >= 0x10201) {
-   netdev_warn(bp->dev, "Module part number %s\n",
-   resp->phy_vendor_partnumber);
-   }
-   }
-   if (link_info->last_port_module_event == 1)
-   netdev_warn(bp->dev, "TX is disabled\n");
-   if (link_info->last_port_module_event == 3)
-   netdev_warn(bp->dev, "Shutdown SFP+ module\n");
-}
-
 static void bnxt_cfg_ntp_filters(struct bnxt *);
 
 static void bnxt_sp_task(struct work_struct *work)
@@ -5622,7 +5616,7 @@ static void bnxt_sp_task(struct work_struct *work)
}
 
if

[PATCH net-next v2 1/9] bnxt_en: Fix invalid max channel parameter in ethtool -l.

From: Satish Baddipadige 

When there is only 1 MSI-X vector or in INTA mode, tx and rx pre-set
max channel parameters are shown incorrectly in ethtool -l.  With only 1
vector, bnxt_get_max_rings() will return -ENOMEM.  bnxt_get_channels
should check this return value, and set max_rx/max_tx to 0 if it is
non-zero.

Signed-off-by: Satish Baddipadige 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index d6e41f2..28171f9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -327,7 +327,11 @@ static void bnxt_get_channels(struct net_device *dev,
bnxt_get_max_rings(bp, _rx_rings, _tx_rings, true);
channel->max_combined = max_rx_rings;
 
-   bnxt_get_max_rings(bp, _rx_rings, _tx_rings, false);
+   if (bnxt_get_max_rings(bp, _rx_rings, _tx_rings, false)) {
+   max_rx_rings = 0;
+   max_tx_rings = 0;
+   }
+
tcs = netdev_get_num_tc(dev);
if (tcs > 1)
max_tx_rings /= tcs;
-- 
1.8.3.1

[PATCH net-next v2 0/9] bnxt_en: updates for net-next.

Non-critical bug fixes, improvements, a new ethtool feature, and a new
device ID.

v2: Fixed a bug in bnxt_get_module_eeprom() found by Ben Hutchings.

Ajit Khaparde (2):
  bnxt_en: Add Support for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPRO
  bnxt_en: Report PCIe link speed and width during driver load

Michael Chan (6):
  bnxt_en: Reduce maximum ring pages if page size is 64K.
  bnxt_en: Improve the delay logic for firmware response.
  bnxt_en: Fix length value in dmesg log firmware error message.
  bnxt_en: Simplify and improve unsupported SFP+ module reporting.
  bnxt_en: Add BCM57314 device ID.
  bnxt_en: Use dma_rmb() instead of rmb().

Satish Baddipadige (1):
  bnxt_en: Fix invalid max channel parameter in ethtool -l.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 111 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  19 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 127 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h |  34 ++
 4 files changed, 242 insertions(+), 49 deletions(-)

-- 
1.8.3.1

[PATCH net-next v2 9/9] bnxt_en: Use dma_rmb() instead of rmb().

Use the weaker but more appropriate dma_rmb() to order the reading of
the completion ring.

Suggested-by: Ajit Khaparde 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f2ac7da..643c3ec 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1433,7 +1433,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* The valid test of the entry must be done first before
 * reading any further.
 */
-   rmb();
+   dma_rmb();
if (TX_CMP_TYPE(txcmp) == CMP_TYPE_TX_L2_CMP) {
tx_pkts++;
/* return full budget so NAPI will complete. */
-- 
1.8.3.1

Re: [patch net-next 1/4] netdevice: add SW statistics ndo

2016-05-15 Thread Jiri Pirko

Sun, May 15, 2016 at 06:11:20AM CEST, ro...@cumulusnetworks.com wrote:
>On 5/14/16, 11:46 AM, Jiri Pirko wrote:
>> Sat, May 14, 2016 at 05:47:41PM CEST, ro...@cumulusnetworks.com wrote:
>>> On 5/14/16, 5:49 AM, Jiri Pirko wrote:
 Fri, May 13, 2016 at 08:47:48PM CEST, ro...@cumulusnetworks.com wrote:
>
>[snip]
>  Jiri Pirko 
> ---
>
>>> To me netdev stats is  combined 'SW + HW' stats for that netdev.
>>> ndo_get_stats64 callback into the drivers does the magic of adding HW 
>>> stats
>>> to SW (netdev) stats and returning (see enic_get_stats). HW stats is 
>>> available for netdevs
>>> that are offloaded or are backed by hardware. SW stats is the stats 
>>> that the driver maintains
>>> (logical or physical). HW stats is queried and added to the SW stats.
>> I'm not sure I follow. HW stats already contain SW stats. Because on
>> slow path every packet that is not offloaded and goes through kernel is
>> counted into HW stats as well (because it goes through HW port). 
> yes, correct... we don't want to double count those. But since these 
> stats are
> generally queried from hw, I am calling them HW stats.
> you will not really maintain a software counter for this. But, the driver 
> can maintain its own
> counters for rx and tx errors etc and I call these SW stats. They are 
> counted at the driver.
>
>> If you
>> do HW stats + SW stats, what you get makes no sense. Am I missing 
>> something?
> If you go by my definition of HW and SW stats above, on a 
> ndo_get_stats64() call,
> you will add the SW counters + HW counters and return. In my definition, 
> the pkts
> that was rx'ed or tx'ed successfully are always in the HW count.
>
>> Btw, looking at enic_get_stats, looks exactly what we introduce for
>> mlxsw in this patchset.
> In enic_get_stats, the ones counted in software are the ones taken from 
> 'enic->'
> net_stats->rx_over_errors = enic->rq_truncated_pkts;
> net_stats->rx_crc_errors = enic->rq_bad_fcs;
>
>> With this patchset, we only allow user to se the actual stats for
>> slow-path aka SW stats.
> hmm...ok. But i am not sure how many will use this new attribute.
> When you do 'ip -s link show' you really want all counters on that port
> hardware or software does not matter at that point.
>
> My suggestion to move this to ethtool like attribute is because that is 
> an existing
> way to break down your stats which ever way you want. And the best part 
> is it can be
> customized (say rx_pkts_cpu_saw)
 I bevieve that ethtool is really not a place to expose sw stats. Does
 not make sense.
>>> 2 things:
>>> - i was surprised you don't want your ndo_get_stats64 to be a unified view 
>>> of HW and SW stats
>> Roopa, please, look at the patch 4/4. That is exactly what we are doing.
>> We expose HW stats via ndo_get_stats64 and that is of course including
>> whatever comes through slowpath (non-forwarded in HW).
>
>Maybe i missed it but i did not think it included any rx or tx err counters 
>counted solely
>by the driver.
>>
>>
>>> - by bringing up ethtool like stats (IFLA_STATS_LINK_HW_EXTENDED) I am just 
>>> saying
>>> it has always been a way to breakdown stats. If you don't want to show 
>>> explicit SW stats there,
>>> there is always a way to show HW only statsand now you know the delta 
>>> between the unified stats
>>> and the HW only stats is your SW stats.
>> I think we don/t understand each other. HW stats always include SW
>> stats. Because whatever goes in or out goes through HW. Therefore, the
>> "unified stats" you mention are exactly HW stats.
>>
>> This is fine, Patch 4/4 would do to make this correct. However, I think
>> it has value for user to know what went via slowpath (non-forwarded in HW).
>> And that is exacly exposed by the SW stats we try to add.
>>
>> Is that confusing?
>
>Its not confusing. I understand what you are doing.
>The only point I was making was that most drivers have unified stats via ndo
>and there are also hw stats via ethtool like api (which will also be part of 
>the stats
>api in the future). And sw only stats can be derived from that...which is the 
>way most

The thing is, they can't be derived from it. That is my whole point.
HW-HW=0


>people do today.
>But that's fine. If you think it will be useful/easier to have a new 
>api/attribute
>for software only stats for some drivers, sure, fine. Lets move on.
>
>

Re: [PATCH net-next 2/9] bnxt_en: Add Support for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPRO