Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-05 Thread Jason Wang



On 2017年02月06日 15:28, Benjamin Serebrin wrote:

On Sun, Feb 5, 2017 at 11:24 PM, Jason Wang  wrote:

On 2017年02月03日 14:19, Ben Serebrin wrote:

From: Benjamin Serebrin

If the number of virtio queue pairs is not equal to the
number of VCPUs, the virtio guest driver doesn't assign
any CPU affinity for the queue interrupts or the xps
aggregation interrupt.

So this in fact is not a affinity fixing for #cpus > 32 but adding  affinity
for #cpus != #queue pairs.

Fair enough.  I'll adjust the title line in the subsequent version.



Google Compute Engine currently provides 1 queue pair for
every VCPU, but limits that at a maximum of 32 queue pairs.

This code assigns interrupt affinity even when there are more than
32 VCPUs.

Tested:

(on a 64-VCPU VM with debian 8, jessie-backports 4.9.2)

Without the fix we see all queues affinitized to all CPUs:

[...]


   + /* If there are more cpus than queues, then assign the queues'
+* interrupts to the first cpus until we run out.
+*/
 i = 0;
 for_each_online_cpu(cpu) {
+   if (i == vi->max_queue_pairs)
+   break;
 virtqueue_set_affinity(vi->rq[i].vq, cpu);
 virtqueue_set_affinity(vi->sq[i].vq, cpu);
-   netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
 i++;
 }
   + /* Stripe the XPS affinities across the online CPUs.
+* Hyperthread pairs are typically assigned such that Linux's
+* CPU X and X + (numcpus / 2) are hyperthread twins, so we cause
+* hyperthread twins to share TX queues, in the case where there
are
+* more cpus than queues.

Since we use combined queue pairs, why not use the same policy for RX?

XPS is for transmit only.




Yes, but I mean, e.g consider you let hyperthread twins to share TX 
queues (XPS), why not share TX and RX queue interrupts (affinity)?


Thanks


RE: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Yotam Gigi
>-Original Message-
>From: Florian Fainelli [mailto:f.faine...@gmail.com]
>Sent: Sunday, February 05, 2017 10:55 PM
>To: Yotam Gigi ; step...@networkplumber.org;
>netdev@vger.kernel.org; Jiri Pirko ; Elad Raz
>
>Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc 
>action
>
>Le 02/05/17 à 12:22, Yotam Gigi a écrit :
>>> -Original Message-
>>> From: Florian Fainelli [mailto:f.faine...@gmail.com]
>>> Sent: Sunday, February 05, 2017 8:37 PM
>>> To: Yotam Gigi ; step...@networkplumber.org;
>>> netdev@vger.kernel.org; Jiri Pirko ; Elad Raz
>>> 
>>> Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc
>action
>>>
>>> On 02/04/2017 11:58 PM, Yotam Gigi wrote:
 The sample tc action allows sampling packets matching a classifier. It
 peeks randomly packets, and samples them using the psample netlink
 channel. The user can specify the psample group, which the packet will be
 sampled to, the sampling rate and the packet truncation (to save
 kernel-user traffic).

 The sampled packets contain informative metadata, for example, the input
 interface and the original packet length.

 The action syntax:
 tc filter add [...] \
action sample rate  group  [trunc ]
[...]

 Where:
   RATE := The sampling rate which is the ratio of packets observed at the
  data source to the samples generated
   GROUP := the psample module sampling group
   SIZE := optional truncation size

 An example for a common usecase of the sample tc action: to sample ingress
 traffic from interface eth1, one may use the commands:

 tc qdisc add dev eth1 handle : ingress

 tc filter add dev eth1 parent : \
matchall action sample rate 12 group 4

 Where the first command adds an ingress qdisc and the second starts
 sampling randomly with an average of one sampled packet per 12 packets
 on dev eth1 to psample group 4.
>>>
>>> The group argument seems to be mandatory from looking at the code, but
>>> what if just wanted to have a port mirroring between, say sw0p1 and
>>> sw0p2 with the sample rate specified instead (without using the psample
>>> netlink channel at all)? Could we make this group an optional argument
>>> instead?
>>
>> The kernel action currently don't support it, and I am not sure it should.
>>
>> If I understand you correctly, you want to make the sample action identical
>> to mirred-mirror, only with random behavior. This can be done using the
>> matchall and mirred action, plus the 'random' gact keyword.
>
>It sounds like we can indeed, with random determ and using the VAL
>argument we should be able to configure the capture divider; thanks!

You're welcome. It took me some time to find that keyword too :)

>
>>
>> The sample action attaches some metadata in addition to the original packet
>> data, and that cannot be achieved by mirroring the packets, thus making it
>> unusable for our usecase. In the former version we attached the metadata
>> using the IFE protocol, but we decided to use a dedicated netlink channel
>> instead.
>
>Yeah I see that now, thanks for the explanation!
>--
>Florian


Re: [PATCHv1] net-next: treewide use is_vlan_dev() helper function.

2017-02-05 Thread Johannes Thumshirn


On 02/04/2017 06:00 PM, Parav Pandit wrote:

This patch makes use of is_vlan_dev() function instead of flag
comparison which is exactly done by is_vlan_dev() helper function.

Signed-off-by: Parav Pandit 
Reviewed-by: Daniel Jurgens 
---



For drivers/scsi/fcoe/fcoe.c:
Acked-by: Johannes Thumshirn 



Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-05 Thread Benjamin Serebrin
On Sun, Feb 5, 2017 at 11:24 PM, Jason Wang  wrote:
>
>
> On 2017年02月03日 14:19, Ben Serebrin wrote:
>>
>> From: Benjamin Serebrin 
>>
>> If the number of virtio queue pairs is not equal to the
>> number of VCPUs, the virtio guest driver doesn't assign
>> any CPU affinity for the queue interrupts or the xps
>> aggregation interrupt.
>
>
> So this in fact is not a affinity fixing for #cpus > 32 but adding  affinity
> for #cpus != #queue pairs.

Fair enough.  I'll adjust the title line in the subsequent version.


>
>> Google Compute Engine currently provides 1 queue pair for
>> every VCPU, but limits that at a maximum of 32 queue pairs.
>>
>> This code assigns interrupt affinity even when there are more than
>> 32 VCPUs.
>>
>> Tested:
>>
>> (on a 64-VCPU VM with debian 8, jessie-backports 4.9.2)
>>
>> Without the fix we see all queues affinitized to all CPUs:
>
>
> [...]
>
>>   + /* If there are more cpus than queues, then assign the queues'
>> +* interrupts to the first cpus until we run out.
>> +*/
>> i = 0;
>> for_each_online_cpu(cpu) {
>> +   if (i == vi->max_queue_pairs)
>> +   break;
>> virtqueue_set_affinity(vi->rq[i].vq, cpu);
>> virtqueue_set_affinity(vi->sq[i].vq, cpu);
>> -   netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
>> i++;
>> }
>>   + /* Stripe the XPS affinities across the online CPUs.
>> +* Hyperthread pairs are typically assigned such that Linux's
>> +* CPU X and X + (numcpus / 2) are hyperthread twins, so we cause
>> +* hyperthread twins to share TX queues, in the case where there
>> are
>> +* more cpus than queues.
>
>
> Since we use combined queue pairs, why not use the same policy for RX?

XPS is for transmit only.


> Thanks
>
>
>> +*/
>> +   for (i = 0; i < vi->max_queue_pairs; i++) {
>> +   struct cpumask mask;
>> +   int skip = i;
>> +
>> +   cpumask_clear();
>> +   for_each_online_cpu(cpu) {
>> +   while (skip--)
>> +   cpu = cpumask_next(cpu, cpu_online_mask);
>> +   if (cpu < num_possible_cpus())
>> +   cpumask_set_cpu(cpu, );
>> +   skip = vi->max_queue_pairs - 1;
>> +   }
>> +   netif_set_xps_queue(vi->dev, , i);
>> +   }
>> +
>> vi->affinity_hint_set = true;
>>   }
>>
>
>


Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-05 Thread Jason Wang



On 2017年02月03日 14:19, Ben Serebrin wrote:

From: Benjamin Serebrin 

If the number of virtio queue pairs is not equal to the
number of VCPUs, the virtio guest driver doesn't assign
any CPU affinity for the queue interrupts or the xps
aggregation interrupt.


So this in fact is not a affinity fixing for #cpus > 32 but adding  
affinity for #cpus != #queue pairs.



Google Compute Engine currently provides 1 queue pair for
every VCPU, but limits that at a maximum of 32 queue pairs.

This code assigns interrupt affinity even when there are more than
32 VCPUs.

Tested:

(on a 64-VCPU VM with debian 8, jessie-backports 4.9.2)

Without the fix we see all queues affinitized to all CPUs:


[...]

  
+	/* If there are more cpus than queues, then assign the queues'

+* interrupts to the first cpus until we run out.
+*/
i = 0;
for_each_online_cpu(cpu) {
+   if (i == vi->max_queue_pairs)
+   break;
virtqueue_set_affinity(vi->rq[i].vq, cpu);
virtqueue_set_affinity(vi->sq[i].vq, cpu);
-   netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
i++;
}
  
+	/* Stripe the XPS affinities across the online CPUs.

+* Hyperthread pairs are typically assigned such that Linux's
+* CPU X and X + (numcpus / 2) are hyperthread twins, so we cause
+* hyperthread twins to share TX queues, in the case where there are
+* more cpus than queues.


Since we use combined queue pairs, why not use the same policy for RX?

Thanks


+*/
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct cpumask mask;
+   int skip = i;
+
+   cpumask_clear();
+   for_each_online_cpu(cpu) {
+   while (skip--)
+   cpu = cpumask_next(cpu, cpu_online_mask);
+   if (cpu < num_possible_cpus())
+   cpumask_set_cpu(cpu, );
+   skip = vi->max_queue_pairs - 1;
+   }
+   netif_set_xps_queue(vi->dev, , i);
+   }
+
vi->affinity_hint_set = true;
  }
  





Re: fs, net: deadlock between bind/splice on af_unix

2017-02-05 Thread Cong Wang
On Tue, Jan 31, 2017 at 10:14 AM, Mateusz Guzik  wrote:
> On Mon, Jan 30, 2017 at 10:44:03PM -0800, Cong Wang wrote:
>> Mind being more specific?
>
> Consider 2 threads which bind the same socket, but with different paths.
>
> Currently exactly one file will get created, the one used to bind.
>
> With your patch both threads can succeed creating their respective
> files, but only one will manage to bind. The other one must error out,
> but it already created a file it is unclear what to do with.

In this case, it simply puts the path back:

err = -EINVAL;
if (u->addr)
goto out_up;
[...]

out_up:
mutex_unlock(>bindlock);
out_put:
if (err)
path_put();
out:
return err;


Which is what unix_release_sock() does too:

if (path.dentry)
path_put();


Re: [PATCH v4 net-next] net: mvneta: implement .set_wol and .get_wol

2017-02-05 Thread Jisheng Zhang
On Mon, 6 Feb 2017 15:08:48 +0800 Jisheng Zhang wrote:

> Hi Andrew,
> 
> On Mon, 23 Jan 2017 19:10:34 +0100 Andrew Lunn wrote:
> 
> > 
> > On Mon, Jan 23, 2017 at 02:55:07PM +0800, Jisheng Zhang wrote:  
> > > From: Jingju Hou 
> > > 
> > > From: Jingju Hou 
> > > 
> > > The mvneta itself does not support WOL, but the PHY might.
> > > So pass the calls to the PHY
> > > 
> > > Signed-off-by: Jingju Hou 
> > > Signed-off-by: Jisheng Zhang 
> > > ---
> > > since v3:
> > >  - really fix the build error
> > 
> > Keep trying
> > 
> > But maybe tomorrow, after you have taken the pause Dave said you
> > should take, and maybe ask Jingju to really review it, in detail.  
> 
> Jingju is a newbie in the Linux kernel community. She made a mistake
> when trying to send the old patch. I picked up her patch when she went
> on vacation, fixed the error and send it out on behalf of her.
> 
> >   
> > > 
> > > since v2,v1:
> > >  - using phy_dev member in struct net_device
> > >  - add commit msg
> > > 
> > >  drivers/net/ethernet/marvell/mvneta.c | 21 +
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> > > b/drivers/net/ethernet/marvell/mvneta.c
> > > index 6dcc951af0ff..02611fa1c3b8 100644
> > > --- a/drivers/net/ethernet/marvell/mvneta.c
> > > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > > @@ -3929,6 +3929,25 @@ static int mvneta_ethtool_get_rxfh(struct 
> > > net_device *dev, u32 *indir, u8 *key,
> > >   return 0;
> > >  }
> > >  
> > > +static void mvneta_ethtool_get_wol(struct net_device *dev,
> > > +struct ethtool_wolinfo *wol)
> > > +{
> > > + wol->supported = 0;
> > > + wol->wolopts = 0;
> > > +
> > > + if (dev->phydev)
> > > + return phy_ethtool_get_wol(dev->phydev, wol);
> > 
> > This is a void function.  And you are returning a value.  And
> > phy_ethtool_get_wol() is also a void function, so does not actually
> > return anything.  
> 
> Thanks for catching it, fixed in v4, can you please review?

typo, fixed in v5. 



Re: [PATCH v4 net-next] net: mvneta: implement .set_wol and .get_wol

2017-02-05 Thread Jisheng Zhang
Hi Andrew,

On Mon, 23 Jan 2017 19:10:34 +0100 Andrew Lunn wrote:

> 
> On Mon, Jan 23, 2017 at 02:55:07PM +0800, Jisheng Zhang wrote:
> > From: Jingju Hou 
> > 
> > From: Jingju Hou 
> > 
> > The mvneta itself does not support WOL, but the PHY might.
> > So pass the calls to the PHY
> > 
> > Signed-off-by: Jingju Hou 
> > Signed-off-by: Jisheng Zhang 
> > ---
> > since v3:
> >  - really fix the build error  
> 
> Keep trying
> 
> But maybe tomorrow, after you have taken the pause Dave said you
> should take, and maybe ask Jingju to really review it, in detail.

Jingju is a newbie in the Linux kernel community. She made a mistake
when trying to send the old patch. I picked up her patch when she went
on vacation, fixed the error and send it out on behalf of her.

> 
> > 
> > since v2,v1:
> >  - using phy_dev member in struct net_device
> >  - add commit msg
> > 
> >  drivers/net/ethernet/marvell/mvneta.c | 21 +
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> > b/drivers/net/ethernet/marvell/mvneta.c
> > index 6dcc951af0ff..02611fa1c3b8 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -3929,6 +3929,25 @@ static int mvneta_ethtool_get_rxfh(struct net_device 
> > *dev, u32 *indir, u8 *key,
> > return 0;
> >  }
> >  
> > +static void mvneta_ethtool_get_wol(struct net_device *dev,
> > +  struct ethtool_wolinfo *wol)
> > +{
> > +   wol->supported = 0;
> > +   wol->wolopts = 0;
> > +
> > +   if (dev->phydev)
> > +   return phy_ethtool_get_wol(dev->phydev, wol);  
> 
> This is a void function.  And you are returning a value.  And
> phy_ethtool_get_wol() is also a void function, so does not actually
> return anything.

Thanks for catching it, fixed in v4, can you please review?



Re: [net-next PATCH v2 0/5] XDP adjust head support for virtio

2017-02-05 Thread Jason Wang



On 2017年02月06日 12:39, Michael S. Tsirkin wrote:

On Sun, Feb 05, 2017 at 05:36:34PM -0500, David Miller wrote:

From: John Fastabend 
Date: Thu, 02 Feb 2017 19:14:05 -0800


This series adds adjust head support for virtio. The following is my
test setup. I use qemu + virtio as follows,

./x86_64-softmmu/qemu-system-x86_64 \
   -hda /var/lib/libvirt/images/Fedora-test0.img \
   -m 4096  -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
   -device 
virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9

In order to use XDP with virtio until LRO is supported TSO must be
turned off in the host. The important fields in the above command line
are the following,

   guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off

Also note it is possible to conusme more queues than can be supported
because when XDP is enabled for retransmit XDP attempts to use a queue
per cpu. My standard queue count is 'queues=4'.

After loading the VM I run the relevant XDP test programs in,

   ./sammples/bpf

For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
with iperf (-d option to get bidirectional traffic), ping, and pktgen.
I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
the normal traffic path to the stack continues to work with XDP loaded.

It would be great to automate this soon. At the moment I do it by hand
which is starting to get tedious.

v2: original series dropped trace points after merge.

Michael, I just want to apply this right now.

I don't think haggling over whether to allocate the adjust_head area
unconditionally or not is a blocker for this series going in.  That
can be addressed trivially in a follow-on patch.

FYI it would just mean we revert most of this patchset except patches 2 and 3 
though.


We want these new reset paths tested as much as possible and each day
we delay this series is detrimental towards that goal.

Thanks.

Well the point is to avoid resets completely, at the cost of extra 256 bytes
for packets > 128 bytes on ppc (64k pages) only.

Found a volunteer so I hope to have this idea tested on ppc Tuesday.

And really all we need to know is confirm whether this:
-#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT ((PAGE_SHIFT + 1) / 2)
+#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT (PAGE_SHIFT / 2 + 1)

affects performance in a measureable way.


Ok, but we still need to drop some packets with this way I believe, and 
does it work if we allow to change the size of headroom in the future?


Thanks



So I would rather wait another day. But the patches themselves
look correct, from that POV.

Acked-by: Michael S. Tsirkin 

but I would prefer that you waited another day for a Tested-by from me too.





Re: [net-next PATCH v2 5/5] virtio_net: XDP support for adjust_head

2017-02-05 Thread Jason Wang



On 2017年02月03日 11:16, John Fastabend wrote:

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

At the moment this code must do a full reset to ensure old buffers
without headroom on program add or with headroom on program removal
are not used incorrectly in the datapath. Ideally we would only
have to disable/enable the RX queues being updated but there is no
API to do this at the moment in virtio so use the big hammer. In
practice it is likely not that big of a problem as this will only
happen when XDP is enabled/disabled changing programs does not
require the reset. There is some risk that the driver may either
have an allocation failure or for some reason fail to correctly
negotiate with the underlying backend in this case the driver will
be left uninitialized. I have not seen this ever happen on my test
systems and for what its worth this same failure case can occur
from probe and other contexts in virtio framework.

Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |  154 +-
  1 file changed, 125 insertions(+), 29 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 07f9076..52a18b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -42,6 +42,9 @@
  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
  #define GOOD_COPY_LEN 128
  
+/* Amount of XDP headroom to prepend to packets for use by xdp_adjust_head */

+#define VIRTIO_XDP_HEADROOM 256
+
  /* RX packet size EWMA. The average packet size is used to determine the 
packet
   * buffer size when refilling RX rings. As the entire RX ring may be refilled
   * at once, the weight is chosen so that the EWMA will be insensitive to 
short-
@@ -368,6 +371,7 @@ static bool virtnet_xdp_xmit(struct virtnet_info *vi,
}
  
  	if (vi->mergeable_rx_bufs) {

+   xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
/* Zero header and leave csum up to XDP layers */
hdr = xdp->data;
memset(hdr, 0, vi->hdr_len);
@@ -384,7 +388,9 @@ static bool virtnet_xdp_xmit(struct virtnet_info *vi,
num_sg = 2;
sg_init_table(sq->sg, 2);
sg_set_buf(sq->sg, hdr, vi->hdr_len);
-   skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   skb_to_sgvec(skb, sq->sg + 1,
+xdp->data - xdp->data_hard_start,
+xdp->data_end - xdp->data);
}
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
   data, GFP_ATOMIC);
@@ -412,7 +418,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
struct bpf_prog *xdp_prog;
  
  	len -= vi->hdr_len;

-   skb_trim(skb, len);
  
  	rcu_read_lock();

xdp_prog = rcu_dereference(rq->xdp_prog);
@@ -424,12 +429,16 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
goto err_xdp;
  
-		xdp.data = skb->data;

+   xdp.data_hard_start = skb->data;
+   xdp.data = skb->data + VIRTIO_XDP_HEADROOM;
xdp.data_end = xdp.data + len;
act = bpf_prog_run_xdp(xdp_prog, );
  
  		switch (act) {

case XDP_PASS:
+   /* Recalculate length in case bpf program changed it */
+   __skb_pull(skb, xdp.data - xdp.data_hard_start);


But skb->len were trimmed to len below which seems wrong.


+   len = xdp.data_end - xdp.data;
break;
case XDP_TX:
if (unlikely(!virtnet_xdp_xmit(vi, rq, , skb)))
@@ -446,6 +455,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
}
rcu_read_unlock();
  
+	skb_trim(skb, len);

return skb;
  
  err_xdp:

@@ -494,7 +504,7 @@ static struct page *xdp_linearize_page(struct receive_queue 
*rq,
   unsigned int *len)
  {
struct page *page = alloc_page(GFP_ATOMIC);
-   unsigned int page_off = 0;
+   unsigned int page_off = VIRTIO_XDP_HEADROOM;
  
  	if (!page)

return NULL;
@@ -530,7 +540,8 @@ static struct page *xdp_linearize_page(struct receive_queue 
*rq,
put_page(p);
}
  
-	*len = page_off;

+   /* Headroom does not contribute to packet length */
+   *len = page_off - VIRTIO_XDP_HEADROOM;
return page;
  err_buf:
__free_pages(page, 0);
@@ -569,7 +580,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
  

Re: [net-next PATCH v2 3/5] virtio_net: remove duplicate queue pair binding in XDP

2017-02-05 Thread Jason Wang



On 2017年02月03日 11:15, John Fastabend wrote:

Factor out qp assignment.

Signed-off-by: John Fastabend 


Acked-by: Jason Wang 


---
  drivers/net/virtio_net.c |   20 +++-
  1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3b49363..dba5afb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -341,15 +341,19 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
  
  static bool virtnet_xdp_xmit(struct virtnet_info *vi,

 struct receive_queue *rq,
-struct send_queue *sq,
 struct xdp_buff *xdp,
 void *data)
  {
struct virtio_net_hdr_mrg_rxbuf *hdr;
unsigned int num_sg, len;
+   struct send_queue *sq;
+   unsigned int qp;
void *xdp_sent;
int err;
  
+	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();

+   sq = >sq[qp];
+
/* Free up any pending old buffers before queueing new ones. */
while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) {
if (vi->mergeable_rx_bufs) {
@@ -415,7 +419,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
if (xdp_prog) {
struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
struct xdp_buff xdp;
-   unsigned int qp;
u32 act;
  
  		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))

@@ -429,11 +432,7 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
case XDP_PASS:
break;
case XDP_TX:
-   qp = vi->curr_queue_pairs -
-   vi->xdp_queue_pairs +
-   smp_processor_id();
-   if (unlikely(!virtnet_xdp_xmit(vi, rq, >sq[qp],
-  , skb)))
+   if (unlikely(!virtnet_xdp_xmit(vi, rq, , skb)))
trace_xdp_exception(vi->dev, xdp_prog, act);
rcu_read_unlock();
goto xdp_xmit;
@@ -560,7 +559,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
if (xdp_prog) {
struct page *xdp_page;
struct xdp_buff xdp;
-   unsigned int qp;
void *data;
u32 act;
  
@@ -602,11 +600,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,

}
break;
case XDP_TX:
-   qp = vi->curr_queue_pairs -
-   vi->xdp_queue_pairs +
-   smp_processor_id();
-   if (unlikely(!virtnet_xdp_xmit(vi, rq, >sq[qp],
-  , data)))
+   if (unlikely(!virtnet_xdp_xmit(vi, rq, , data)))
trace_xdp_exception(vi->dev, xdp_prog, act);
ewma_pkt_len_add(>mrg_avg_pkt_len, len);
if (unlikely(xdp_page != page))





Re: [net-next PATCH v2 4/5] virtio_net: refactor freeze/restore logic into virtnet reset logic

2017-02-05 Thread Jason Wang



On 2017年02月03日 11:16, John Fastabend wrote:

For XDP we will need to reset the queues to allow for buffer headroom
to be configured. In order to do this we need to essentially run the
freeze()/restore() code path. Unfortunately the locking requirements
between the freeze/restore and reset paths are different however so
we can not simply reuse the code.

This patch refactors the code path and adds a reset helper routine.

Signed-off-by: John Fastabend 


Acked-by: Jason Wang 


---
  drivers/net/virtio_net.c |   75 --
  drivers/virtio/virtio.c  |   42 ++
  include/linux/virtio.h   |4 ++
  3 files changed, 73 insertions(+), 48 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dba5afb..07f9076 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1698,6 +1698,49 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
  };
  
+static void virtnet_freeze_down(struct virtio_device *vdev)

+{
+   struct virtnet_info *vi = vdev->priv;
+   int i;
+
+   /* Make sure no work handler is accessing the device */
+   flush_work(>config_work);
+
+   netif_device_detach(vi->dev);
+   cancel_delayed_work_sync(>refill);
+
+   if (netif_running(vi->dev)) {
+   for (i = 0; i < vi->max_queue_pairs; i++)
+   napi_disable(>rq[i].napi);
+   }
+}
+
+static int init_vqs(struct virtnet_info *vi);
+
+static int virtnet_restore_up(struct virtio_device *vdev)
+{
+   struct virtnet_info *vi = vdev->priv;
+   int err, i;
+
+   err = init_vqs(vi);
+   if (err)
+   return err;
+
+   virtio_device_ready(vdev);
+
+   if (netif_running(vi->dev)) {
+   for (i = 0; i < vi->curr_queue_pairs; i++)
+   if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
+   schedule_delayed_work(>refill, 0);
+
+   for (i = 0; i < vi->max_queue_pairs; i++)
+   virtnet_napi_enable(>rq[i]);
+   }
+
+   netif_device_attach(vi->dev);
+   return err;
+}
+
  static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
  {
unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
@@ -2393,21 +2436,9 @@ static void virtnet_remove(struct virtio_device *vdev)
  static int virtnet_freeze(struct virtio_device *vdev)
  {
struct virtnet_info *vi = vdev->priv;
-   int i;
  
  	virtnet_cpu_notif_remove(vi);

-
-   /* Make sure no work handler is accessing the device */
-   flush_work(>config_work);
-
-   netif_device_detach(vi->dev);
-   cancel_delayed_work_sync(>refill);
-
-   if (netif_running(vi->dev)) {
-   for (i = 0; i < vi->max_queue_pairs; i++)
-   napi_disable(>rq[i].napi);
-   }
-
+   virtnet_freeze_down(vdev);
remove_vq_common(vi);
  
  	return 0;

@@ -2416,25 +2447,11 @@ static int virtnet_freeze(struct virtio_device *vdev)
  static int virtnet_restore(struct virtio_device *vdev)
  {
struct virtnet_info *vi = vdev->priv;
-   int err, i;
+   int err;
  
-	err = init_vqs(vi);

+   err = virtnet_restore_up(vdev);
if (err)
return err;
-
-   virtio_device_ready(vdev);
-
-   if (netif_running(vi->dev)) {
-   for (i = 0; i < vi->curr_queue_pairs; i++)
-   if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
-   schedule_delayed_work(>refill, 0);
-
-   for (i = 0; i < vi->max_queue_pairs; i++)
-   virtnet_napi_enable(>rq[i]);
-   }
-
-   netif_device_attach(vi->dev);
-
virtnet_set_queues(vi, vi->curr_queue_pairs);
  
  	err = virtnet_cpu_notif_add(vi);

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 7062bb0..400d70b 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -100,11 +100,6 @@ static int virtio_uevent(struct device *_dv, struct 
kobj_uevent_env *env)
  dev->id.device, dev->id.vendor);
  }
  
-static void add_status(struct virtio_device *dev, unsigned status)

-{
-   dev->config->set_status(dev, dev->config->get_status(dev) | status);
-}
-
  void virtio_check_driver_offered_feature(const struct virtio_device *vdev,
 unsigned int fbit)
  {
@@ -145,14 +140,15 @@ void virtio_config_changed(struct virtio_device *dev)
  }
  EXPORT_SYMBOL_GPL(virtio_config_changed);
  
-static void virtio_config_disable(struct virtio_device *dev)

+void virtio_config_disable(struct virtio_device *dev)
  {
spin_lock_irq(>config_lock);
dev->config_enabled = false;
spin_unlock_irq(>config_lock);
  }
+EXPORT_SYMBOL_GPL(virtio_config_disable);
  
-static void virtio_config_enable(struct 

Re: Fw: [Bug 193911] New: net_prio.ifpriomap is not aware of the network namespace, and discloses all network interface

2017-02-05 Thread Cong Wang
On Fri, Feb 3, 2017 at 3:53 PM, Stephen Hemminger
 wrote:
>
>
> Begin forwarded message:
>
> Date: Fri, 03 Feb 2017 21:14:28 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 193911] New: net_prio.ifpriomap is not aware of the network 
> namespace, and discloses all network interface
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=193911
>
> Bug ID: 193911
>Summary: net_prio.ifpriomap is not aware of the network
> namespace, and discloses all network interface
>Product: Networking
>Version: 2.5
> Kernel Version: 4.9
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: step...@networkplumber.org
>   Reporter: xga...@email.wm.edu
> Regression: No
>
> The pseudo file net_prio.ifpriomap (under /sys/fs/cgroup/net_prio) contains a
> map of the priorities assigned to traffic starting from processes in a cgroup
> and leaving the system on various interfaces. The data format is in the form 
> of
> [ifname priority].
>
> We find that the kernel handler function hooked at net_prio.ifpriomap is not
> aware of the network namespace, and thus it discloses all network interfaces 
> on
> the physical machine to the containerized applications.
>
> To be more specific, the read operation of net_prio.ifpriomap is handled by 
> the
> function read_priomap. Tracing from this function, we can find it invokes
> for_each_netdev_rcu and set the first parameter as the address of init_net. It
> iterates all network devices of the host regardless of the network namespace.
> Thus, from the view of a container, it can read the names of all network
> devices of the host.

I think that is probably because cgroup files don't provide a net pointer
for the context, if so we probably need some API similar to
class_create_file_ns().


[PATCH v5 net-next] net: mvneta: implement .set_wol and .get_wol

2017-02-05 Thread Jisheng Zhang
From: Jingju Hou 

The mvneta itself does not support WOL, but the PHY might.
So pass the calls to the PHY

Signed-off-by: Jingju Hou 
Signed-off-by: Jisheng Zhang 
---
since v4:
 - address Andrew Lunn's comment

since v3:
 - really fix the build error

since v2,v1:
 - using phy_dev member in struct net_device
 - add commit msg

 drivers/net/ethernet/marvell/mvneta.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index de6c47744b8e..0f4d1697be46 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3927,6 +3927,25 @@ static int mvneta_ethtool_get_rxfh(struct net_device 
*dev, u32 *indir, u8 *key,
return 0;
 }
 
+static void mvneta_ethtool_get_wol(struct net_device *dev,
+  struct ethtool_wolinfo *wol)
+{
+   wol->supported = 0;
+   wol->wolopts = 0;
+
+   if (dev->phydev)
+   phy_ethtool_get_wol(dev->phydev, wol);
+}
+
+static int mvneta_ethtool_set_wol(struct net_device *dev,
+ struct ethtool_wolinfo *wol)
+{
+   if (!dev->phydev)
+   return -EOPNOTSUPP;
+
+   return phy_ethtool_set_wol(dev->phydev, wol);
+}
+
 static const struct net_device_ops mvneta_netdev_ops = {
.ndo_open= mvneta_open,
.ndo_stop= mvneta_stop,
@@ -3956,6 +3975,8 @@ const struct ethtool_ops mvneta_eth_tool_ops = {
.set_rxfh   = mvneta_ethtool_set_rxfh,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = mvneta_ethtool_set_link_ksettings,
+   .get_wol= mvneta_ethtool_get_wol,
+   .set_wol= mvneta_ethtool_set_wol,
 };
 
 /* Initialize hw */
-- 
2.11.0



Re: [net-next PATCH v2 1/5] virtio_net: wrap rtnl_lock in test for calling with lock already held

2017-02-05 Thread Jason Wang



On 2017年02月03日 11:14, John Fastabend wrote:

For XDP use case and to allow ethtool reset tests it is useful to be
able to use reset paths from contexts where rtnl lock is already
held.

This requries updating virtnet_set_queues and free_receive_bufs the
two places where rtnl_lock is taken in virtio_net. To do this we
use the following pattern,

_foo(...) { do stuff }
foo(...) { rtnl_lock(); _foo(...); rtnl_unlock()};

this allows us to use freeze()/restore() flow from both contexts.

Signed-off-by: John Fastabend 


Acked-by: Jason Wang 


---
  drivers/net/virtio_net.c |   31 +--
  1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bd22cf3..f8ba586 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1342,7 +1342,7 @@ static void virtnet_ack_link_announce(struct virtnet_info 
*vi)
rtnl_unlock();
  }
  
-static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)

+static int _virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
  {
struct scatterlist sg;
struct net_device *dev = vi->dev;
@@ -1368,6 +1368,16 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
u16 queue_pairs)
return 0;
  }
  
+static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)

+{
+   int err;
+
+   rtnl_lock();
+   err = _virtnet_set_queues(vi, queue_pairs);
+   rtnl_unlock();
+   return err;
+}
+
  static int virtnet_close(struct net_device *dev)
  {
struct virtnet_info *vi = netdev_priv(dev);
@@ -1620,7 +1630,7 @@ static int virtnet_set_channels(struct net_device *dev,
return -EINVAL;
  
  	get_online_cpus();

-   err = virtnet_set_queues(vi, queue_pairs);
+   err = _virtnet_set_queues(vi, queue_pairs);
if (!err) {
netif_set_real_num_tx_queues(dev, queue_pairs);
netif_set_real_num_rx_queues(dev, queue_pairs);
@@ -1752,7 +1762,7 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
return -ENOMEM;
}
  
-	err = virtnet_set_queues(vi, curr_qp + xdp_qp);

+   err = _virtnet_set_queues(vi, curr_qp + xdp_qp);
if (err) {
dev_warn(>dev, "XDP Device queue allocation failure.\n");
return err;
@@ -1761,7 +1771,7 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
if (prog) {
prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
if (IS_ERR(prog)) {
-   virtnet_set_queues(vi, curr_qp);
+   _virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
}
}
@@ -1880,12 +1890,11 @@ static void virtnet_free_queues(struct virtnet_info *vi)
kfree(vi->sq);
  }
  
-static void free_receive_bufs(struct virtnet_info *vi)

+static void _free_receive_bufs(struct virtnet_info *vi)
  {
struct bpf_prog *old_prog;
int i;
  
-	rtnl_lock();

for (i = 0; i < vi->max_queue_pairs; i++) {
while (vi->rq[i].pages)
__free_pages(get_a_page(>rq[i], GFP_KERNEL), 0);
@@ -1895,6 +1904,12 @@ static void free_receive_bufs(struct virtnet_info *vi)
if (old_prog)
bpf_prog_put(old_prog);
}
+}
+
+static void free_receive_bufs(struct virtnet_info *vi)
+{
+   rtnl_lock();
+   _free_receive_bufs(vi);
rtnl_unlock();
  }
  
@@ -2333,9 +2348,7 @@ static int virtnet_probe(struct virtio_device *vdev)

goto free_unregister_netdev;
}
  
-	rtnl_lock();

virtnet_set_queues(vi, vi->curr_queue_pairs);
-   rtnl_unlock();
  
  	/* Assume link up if device can't report link status,

   otherwise get link status from config. */
@@ -2444,9 +2457,7 @@ static int virtnet_restore(struct virtio_device *vdev)
  
  	netif_device_attach(vi->dev);
  
-	rtnl_lock();

virtnet_set_queues(vi, vi->curr_queue_pairs);
-   rtnl_unlock();
  
  	err = virtnet_cpu_notif_add(vi);

if (err)





Re: [net-next PATCH v2 2/5] virtio_net: factor out xdp handler for readability

2017-02-05 Thread Jason Wang



On 2017年02月03日 11:15, John Fastabend wrote:

At this point the do_xdp_prog is mostly if/else branches handling
the different modes of virtio_net. So remove it and handle running
the program in the per mode handlers.

Signed-off-by: John Fastabend 


Acked-by: Jason Wang 


---
  drivers/net/virtio_net.c |   86 +++---
  1 file changed, 35 insertions(+), 51 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f8ba586..3b49363 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -399,52 +399,6 @@ static bool virtnet_xdp_xmit(struct virtnet_info *vi,
return true;
  }
  
-static u32 do_xdp_prog(struct virtnet_info *vi,

-  struct receive_queue *rq,
-  struct bpf_prog *xdp_prog,
-  void *data, int len)
-{
-   int hdr_padded_len;
-   struct xdp_buff xdp;
-   void *buf;
-   unsigned int qp;
-   u32 act;
-
-   if (vi->mergeable_rx_bufs) {
-   hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-   xdp.data = data + hdr_padded_len;
-   xdp.data_end = xdp.data + (len - vi->hdr_len);
-   buf = data;
-   } else { /* small buffers */
-   struct sk_buff *skb = data;
-
-   xdp.data = skb->data;
-   xdp.data_end = xdp.data + len;
-   buf = skb->data;
-   }
-
-   act = bpf_prog_run_xdp(xdp_prog, );
-   switch (act) {
-   case XDP_PASS:
-   return XDP_PASS;
-   case XDP_TX:
-   qp = vi->curr_queue_pairs -
-   vi->xdp_queue_pairs +
-   smp_processor_id();
-   xdp.data = buf;
-   if (unlikely(!virtnet_xdp_xmit(vi, rq, >sq[qp], ,
-  data)))
-   trace_xdp_exception(vi->dev, xdp_prog, act);
-   return XDP_TX;
-   default:
-   bpf_warn_invalid_xdp_action(act);
-   case XDP_ABORTED:
-   trace_xdp_exception(vi->dev, xdp_prog, act);
-   case XDP_DROP:
-   return XDP_DROP;
-   }
-}
-
  static struct sk_buff *receive_small(struct net_device *dev,
 struct virtnet_info *vi,
 struct receive_queue *rq,
@@ -460,19 +414,34 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
xdp_prog = rcu_dereference(rq->xdp_prog);
if (xdp_prog) {
struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
+   struct xdp_buff xdp;
+   unsigned int qp;
u32 act;
  
  		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))

goto err_xdp;
-   act = do_xdp_prog(vi, rq, xdp_prog, skb, len);
+
+   xdp.data = skb->data;
+   xdp.data_end = xdp.data + len;
+   act = bpf_prog_run_xdp(xdp_prog, );
+
switch (act) {
case XDP_PASS:
break;
case XDP_TX:
+   qp = vi->curr_queue_pairs -
+   vi->xdp_queue_pairs +
+   smp_processor_id();
+   if (unlikely(!virtnet_xdp_xmit(vi, rq, >sq[qp],
+  , skb)))
+   trace_xdp_exception(vi->dev, xdp_prog, act);
rcu_read_unlock();
goto xdp_xmit;
-   case XDP_DROP:
default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(vi->dev, xdp_prog, act);
+   case XDP_DROP:
goto err_xdp;
}
}
@@ -590,6 +559,9 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
xdp_prog = rcu_dereference(rq->xdp_prog);
if (xdp_prog) {
struct page *xdp_page;
+   struct xdp_buff xdp;
+   unsigned int qp;
+   void *data;
u32 act;
  
  		/* This happens when rx buffer size is underestimated */

@@ -612,8 +584,11 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
if (unlikely(hdr->hdr.gso_type))
goto err_xdp;
  
-		act = do_xdp_prog(vi, rq, xdp_prog,

- page_address(xdp_page) + offset, len);
+   data = page_address(xdp_page) + offset;
+   xdp.data = data + vi->hdr_len;
+   xdp.data_end = xdp.data + (len - vi->hdr_len);
+   act = bpf_prog_run_xdp(xdp_prog, );
+
switch (act) {
case XDP_PASS:
/* We can only create skb based on xdp_page. */
@@ -627,13 

Re: net: deadlock on genl_mutex

2017-02-05 Thread Cong Wang
On Sun, Jan 29, 2017 at 2:11 AM, Dmitry Vyukov  wrote:
> On Fri, Dec 9, 2016 at 6:08 AM, Cong Wang  wrote:
 Chain exists of:
  Possible unsafe locking scenario:

CPU0CPU1

   lock(genl_mutex);
lock(nlk->cb_mutex);
lock(genl_mutex);
   lock(rtnl_mutex);

  *** DEADLOCK ***
>>>
>>> This one looks legitimate, because nlk->cb_mutex could be rtnl_mutex.
>>> Let me think about it.
>>
>> Never mind. Actually both reports in this thread are legitimate.
>>
>> I know what happened now, the lock chain is so long, 4 locks are involved
>> to form a chain!!!
>>
>> Let me think about how to break the chain.
>
>
> Cong, any success with breaking the chain?

No luck yet. Each part of the chain seems legit, not sure which
one could be reordered. :-/


Re: Understanding mutual exclusion between rtnl_lock and rcu_read_lock

2017-02-05 Thread Cong Wang
On Fri, Feb 3, 2017 at 7:25 PM, Joel Cunningham  wrote:
> From the documentation in dev.c:
> /*
>  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
>  * semaphore.
>  *
>  * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
>  *
>  * Writers must hold the rtnl semaphore while they loop through the
>  * dev_base_head list, and hold dev_base_lock for writing when they do the
>  * actual updates.  This allows pure readers to access the list even
>  * while a writer is preparing to update it.
>  *
>  * To put it another way, dev_base_lock is held for writing only to
>  * protect against pure readers; the rtnl semaphore provides the
>  * protection against other writers.
>  *
>  * See, for example usages, register_netdevice() and
>  * unregister_netdevice(), which must be called with the rtnl
>  * semaphore held.
>  */

This comment is pretty old, it was added prior to git history.


> Is the correct usage is to hold both rtnl_lock() and dev_base_lock when 
> modifying a member of a struct net_device?  The wording seems vague as to 
> which synchronization issue holding both covers.  What does “do the actual 
> update” mean, updating the list or structure member?  If the latter, then 
> maybe the concurrent dev_ioctl() case has never been safe
>

I think dev_base_lock is supposed to speed up the readers when
we only read one or a few fields from netdevice, otherwise it would
be pretty pointless since we already have the RTNL lock.

Unfortunately, as you noticed, not all of these fields are protected
by dev_base_lock, therefore the readers who only take this read
lock is not enough to read an atomic result.

RCU doesn't seem to be the solution here, since it still requires
a whole copy of netdevice even we only update, for example MTU.
This is very inconvenient.

It is also kinda messy due to the mix of dev_base_lock, RCU,
and RTNL. Sigh...


[lkp-robot] [net] 5e2f2bdf34: BUG:unable_to_handle_kernel

2017-02-05 Thread kernel test robot

FYI, we noticed the following commit:

commit: 5e2f2bdf34997460bc590e2bf56438fa9ec50322 ("net: ipv6: Keep nexthop of 
multipath route on admin down")
url: 
https://github.com/0day-ci/linux/commits/David-Ahern/net-ipv6-Keep-nexthop-of-multipath-route-on-admin-down/20170119-120949


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-i386 -enable-kvm -m 256M

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


+---+++
|   | 4a7c972644 | 
5e2f2bdf34 |
+---+++
| boot_successes| 8  | 0
  |
| boot_failures | 0  | 8
  |
| BUG:unable_to_handle_kernel   | 0  | 6
  |
| Oops  | 0  | 6
  |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0  | 6
  |
| WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page   | 0  | 2
  |
+---+++



[   14.013077] random: uci: uninitialized urandom read (6 bytes read)
[   14.025067] random: uci: uninitialized urandom read (6 bytes read)
[   14.100439] sysctl (464) used greatest stack depth: 6480 bytes left
[   14.371523] BUG: unable to handle kernel NULL pointer dereference at 0178
[   14.372723] IP: rt6_fill_node+0x473/0x980
[   14.373313] *pdpt = 0c4cc001 *pde =  
[   14.373315] 
[   14.374008] Oops:  [#1] SMP
[   14.374383] CPU: 0 PID: 520 Comm: netifd Not tainted 
4.10.0-rc4-00654-g5e2f2bd #1
[   14.375122] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3-20161025_171302-gandalf 04/01/2014
[   14.376145] task: 8d8c4000 task.stack: 8c4c8000
[   14.376849] EIP: rt6_fill_node+0x473/0x980
[   14.377456] EFLAGS: 00010206 CPU: 0
[   14.377840] EAX:  EBX: 8c4327a0 ECX: 0001 EDX: 0002
[   14.378444] ESI: 8d9dc028 EDI: 0200 EBP: 8c4c9bf4 ESP: 8c4c9b64
[   14.379067]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   14.379607] CR0: 80050033 CR2: 0178 CR3: 0d8c1240 CR4: 06b0
[   14.380218] Call Trace:
[   14.380469]  ? __local_bh_enable_ip+0x105/0x160
[   14.380902]  rt6_dump_route+0x97/0xb0
[   14.381252]  fib6_dump_node+0x29/0x60
[   14.381595]  fib6_walk_continue+0x2c5/0x320
[   14.382009]  fib6_walk+0x33/0x70
[   14.382370]  inet6_dump_fib+0x31d/0x400
[   14.382748]  ? inet6_dump_fib+0x84/0x400
[   14.383162]  netlink_dump+0x18e/0x3e0
[   14.383531]  __netlink_dump_start+0x178/0x270
[   14.383963]  ? fib6_flush_trees+0x60/0x60
[   14.384349]  rtnetlink_rcv_msg+0x244/0x320
[   14.384744]  ? fib6_flush_trees+0x60/0x60
[   14.385122]  ? fib6_flush_trees+0x60/0x60
[   14.385490]  ? __rtnl_unlock+0x60/0x60
[   14.385848]  netlink_rcv_skb+0x129/0x170
[   14.386367]  ? __rtnl_unlock+0x60/0x60
[   14.386938]  rtnetlink_rcv+0x23/0x30
[   14.387500]  netlink_unicast+0x20f/0x2c0
[   14.388107]  netlink_sendmsg+0x4aa/0x4e0
[   14.388689]  ___sys_sendmsg+0x168/0x3c0
[   14.389323]  ? ___sys_recvmsg+0xf6/0x200
[   14.390040]  ? __lru_cache_add+0xaa/0x120
[   14.390809]  ? do_raw_spin_unlock+0xe5/0x140
[   14.391561]  ? debug_lockdep_rcu_enabled+0x1a/0x20
[   14.392398]  ? __fget_light+0xc1/0x100
[   14.393051]  ? sockfd_lookup_light+0xd9/0x120
[   14.393779]  __sys_sendmsg+0x73/0xb0
[   14.394415]  SyS_socketcall+0x5dc/0xa60
[   14.395133]  do_int80_syscall_32+0xaa/0x290
[   14.395848]  entry_INT80_32+0x36/0x36
[   14.396498] EIP: 0x77729384
[   14.397018] EFLAGS: 0216 CPU: 0
[   14.397620] EAX: ffda EBX: 0010 ECX: 7f9df738 EDX: 7776f000
[   14.398810] ESI: 7f9df738 EDI: 09a28804 EBP: 7f9df788 ESP: 7f9df728
[   14.399846]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[   14.400769] Code: ff 8b 45 88 8b 90 a8 00 00 00 29 d6 89 f2 e8 e5 d7 da ff 
e9 14 fc ff ff c7 46 18 10 00 00 00 8b 83 cc 00 00 00 ff 05 58 da 41 83 <8b> 90 
78 01 00 00 31 c0 85 d2 0f 95 c0 85 d2 8b 3c 85 40 da 41
[   14.404032] EIP: rt6_fill_node+0x473/0x980 SS:ESP: 0068:8c4c9b64
[   14.405027] CR2: 0178
[   14.405365] ---[ end trace 2967256f85d3e8d9 ]---
[   14.405869] Kernel panic - not syncing: Fatal exception in interrupt
[   14.406596] Kernel Offset: disabled


To reproduce:

git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k  job-script  # job-script is attached in this 
email



Thanks,
Xiaolong
#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 4.10.0-rc4 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y

Re: [net-next PATCH v2 0/5] XDP adjust head support for virtio

2017-02-05 Thread Michael S. Tsirkin
On Sun, Feb 05, 2017 at 05:36:34PM -0500, David Miller wrote:
> From: John Fastabend 
> Date: Thu, 02 Feb 2017 19:14:05 -0800
> 
> > This series adds adjust head support for virtio. The following is my
> > test setup. I use qemu + virtio as follows,
> > 
> > ./x86_64-softmmu/qemu-system-x86_64 \
> >   -hda /var/lib/libvirt/images/Fedora-test0.img \
> >   -m 4096  -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
> >   -device 
> > virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9
> > 
> > In order to use XDP with virtio until LRO is supported TSO must be
> > turned off in the host. The important fields in the above command line
> > are the following,
> > 
> >   guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off
> > 
> > Also note it is possible to conusme more queues than can be supported
> > because when XDP is enabled for retransmit XDP attempts to use a queue
> > per cpu. My standard queue count is 'queues=4'.
> > 
> > After loading the VM I run the relevant XDP test programs in,
> > 
> >   ./sammples/bpf
> > 
> > For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
> > with iperf (-d option to get bidirectional traffic), ping, and pktgen.
> > I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
> > the normal traffic path to the stack continues to work with XDP loaded.
> > 
> > It would be great to automate this soon. At the moment I do it by hand
> > which is starting to get tedious.
> > 
> > v2: original series dropped trace points after merge.
> 
> Michael, I just want to apply this right now.
> 
> I don't think haggling over whether to allocate the adjust_head area
> unconditionally or not is a blocker for this series going in.  That
> can be addressed trivially in a follow-on patch.

FYI it would just mean we revert most of this patchset except patches 2 and 3 
though.

> We want these new reset paths tested as much as possible and each day
> we delay this series is detrimental towards that goal.
> 
> Thanks.

Well the point is to avoid resets completely, at the cost of extra 256 bytes
for packets > 128 bytes on ppc (64k pages) only.

Found a volunteer so I hope to have this idea tested on ppc Tuesday.

And really all we need to know is confirm whether this:
-#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT ((PAGE_SHIFT + 1) / 2)
+#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT (PAGE_SHIFT / 2 + 1)

affects performance in a measureable way.

So I would rather wait another day. But the patches themselves
look correct, from that POV.

Acked-by: Michael S. Tsirkin 

but I would prefer that you waited another day for a Tested-by from me too.

-- 
MST



[PATCH net] ipv6: tcp: add a missing tcp_v6_restore_cb()

2017-02-05 Thread Eric Dumazet
From: Eric Dumazet 

Dmitry reported use-after-free in ip6_datagram_recv_specific_ctl()

A similar bug was fixed in commit 8ce48623f0cf ("ipv6: tcp: restore
IP6CB for pktoptions skbs"), but I missed another spot.

tcp_v6_syn_recv_sock() can indeed set np->pktoptions from ireq->pktopts

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
misses")
Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
---
 net/ipv6/tcp_ipv6.c |   24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 
cb8929681dc7597eebcc46026e4b6548f4bedadb..eaad72c3d7462b4af09d632fe88466148964e679
 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -991,6 +991,16 @@ static int tcp_v6_conn_request(struct sock *sk, struct 
sk_buff *skb)
return 0; /* don't send reset */
 }
 
+static void tcp_v6_restore_cb(struct sk_buff *skb)
+{
+   /* We need to move header back to the beginning if xfrm6_policy_check()
+* and tcp_v6_fill_cb() are going to be called again.
+* ip6_datagram_recv_specific_ctl() also expects IP6CB to be there.
+*/
+   memmove(IP6CB(skb), _SKB_CB(skb)->header.h6,
+   sizeof(struct inet6_skb_parm));
+}
+
 static struct sock *tcp_v6_syn_recv_sock(const struct sock *sk, struct sk_buff 
*skb,
 struct request_sock *req,
 struct dst_entry *dst,
@@ -1182,8 +1192,10 @@ static struct sock *tcp_v6_syn_recv_sock(const struct 
sock *sk, struct sk_buff *
  sk_gfp_mask(sk, 
GFP_ATOMIC));
consume_skb(ireq->pktopts);
ireq->pktopts = NULL;
-   if (newnp->pktoptions)
+   if (newnp->pktoptions) {
+   tcp_v6_restore_cb(newnp->pktoptions);
skb_set_owner_r(newnp->pktoptions, newsk);
+   }
}
}
 
@@ -1198,16 +1210,6 @@ static struct sock *tcp_v6_syn_recv_sock(const struct 
sock *sk, struct sk_buff *
return NULL;
 }
 
-static void tcp_v6_restore_cb(struct sk_buff *skb)
-{
-   /* We need to move header back to the beginning if xfrm6_policy_check()
-* and tcp_v6_fill_cb() are going to be called again.
-* ip6_datagram_recv_specific_ctl() also expects IP6CB to be there.
-*/
-   memmove(IP6CB(skb), _SKB_CB(skb)->header.h6,
-   sizeof(struct inet6_skb_parm));
-}
-
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *




Re: [PATCH net-next v2 1/2] Add a helper function to get socket cookie in eBPF

2017-02-05 Thread Eric Dumazet
On Mon, 2017-02-06 at 12:01 +0900, Lorenzo Colitti wrote:
> On Mon, Feb 6, 2017 at 11:17 AM, Chenbo Feng
>  wrote:
> > +BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
> > +{
> > +   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
> > +}
> > +
> 
> Does this code need to increment the socket refcount, or call
> ACCESS_ONCE to get skb->sk? The socket filter codepath should be safe,
> but if this function is called in xt_ebpf, could it race with
> something that sets skb->sk to null?

I do not see how this could possibly happen.

READ_ONCE() would not prevent the 'old' sk from disappearing anyway.





Re: [PATCH net-next 2/2] Add a eBPF helper function to retrieve socket uid

2017-02-05 Thread Lorenzo Colitti
On Fri, Feb 3, 2017 at 10:51 AM, Eric Dumazet  wrote:
> if (sk) {
> sk = sk_to_full_sk(sk);
> if (sk_fullsock(sk))
> return sk->sk_uid;
> }

Sure, though sk_to_full_sk is in inet_sock.h so I have to move some
core around. Options I see:

1. Move sk_to_full_sk from inet_sock.h to sock.h.
2. Move sock_net_uid to inet_sock.h.
3. Move sock_net_uid to sock.c and EXPORT_SYMBOL_GPL it.

Thoughts? #1 seems reasonable, since sk_fullsock is already in sock.h.
#2 would mean that we can't call sock_net_uid from non-inet code.
Currently the only code that accesses sk->sk_uid is inet code, but in
the future perhaps some of the code around the tree that calls
sock_i_uid could be migrated to use at sk->sk_uid instead.


Re: [PATCH net-next v2 1/2] Add a helper function to get socket cookie in eBPF

2017-02-05 Thread Lorenzo Colitti
On Mon, Feb 6, 2017 at 11:17 AM, Chenbo Feng
 wrote:
> +BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
> +{
> +   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
> +}
> +

Does this code need to increment the socket refcount, or call
ACCESS_ONCE to get skb->sk? The socket filter codepath should be safe,
but if this function is called in xt_ebpf, could it race with
something that sets skb->sk to null?


Re: [PATCH net-next v2 2/2] Add a eBPF helper function to retrieve socket uid

2017-02-05 Thread Eric Dumazet
On Sun, 2017-02-05 at 18:17 -0800, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Returns the owner uid of the socket inside a sk_buff. This is useful to
> perform per-UID accounting of network traffic or per-UID packet
> filtering.
> 
> Signed-off-by: Chenbo Feng 
> ---
> +BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
> +{
> + struct sock *sk = skb->sk;
> + kuid_t kuid = sock_net_uid(net, sk && sk_fullsock(sk) ?
> +sk : NULL);
> + return (u32)kuid.val;
> +}
> +

Have you considered to use sk_to_full_sk() ?

struct sock *sk = sk_to_full_sk(skb->sk);
kuid_t kuid = sock_net_uid(net, sk);







Re: [PATCH net-next v2 1/2] Add a helper function to get socket cookie in eBPF

2017-02-05 Thread Eric Dumazet
On Sun, 2017-02-05 at 18:17 -0800, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
> with a known socket. Generates a new cookie if one was not yet set.If
> the socket pointer inside sk_buff is NULL, 0 is returned. The helper
> function coud be useful in monitoring per socket networking traffic
> statistics and provide a unique socket identifier per namespace.
> 
> Signed-off-by: Chenbo Feng 
> ---

Acked-by: Eric Dumazet 




Re: [PATCH net-next v2 2/2] Add a eBPF helper function to retrieve socket uid

2017-02-05 Thread kbuild test robot
Hi Chenbo,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Chenbo-Feng/Two-Helper-function-about-socket-information/20170206-102835
config: x86_64-randconfig-x014-201706 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/core/filter.c: In function 'bpf_get_socket_uid':
>> net/core/filter.c:2618:29: error: 'net' undeclared (first use in this 
>> function)
 kuid_t kuid = sock_net_uid(net, sk && sk_fullsock(sk) ?
^~~
   net/core/filter.c:2618:29: note: each undeclared identifier is reported only 
once for each function it appears in

vim +/net +2618 net/core/filter.c

  2612  .arg1_type  = ARG_PTR_TO_CTX,
  2613  };
  2614  
  2615  BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
  2616  {
  2617  struct sock *sk = skb->sk;
> 2618  kuid_t kuid = sock_net_uid(net, sk && sk_fullsock(sk) ?
  2619 sk : NULL);
  2620  return (u32)kuid.val;
  2621  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH net-next v2 0/2] Two Helper function about socket information

2017-02-05 Thread Chenbo Feng
From: Chenbo Feng 

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty.

Chenbo Feng (2):
  Add a helper function to get socket cookie in eBPF
  Add a eBPF helper function to retrieve socket uid

 include/linux/sock_diag.h |  1 +
 include/uapi/linux/bpf.h  | 16 +++-
 net/core/filter.c | 32 
 net/core/sock_diag.c  |  2 +-
 4 files changed, 49 insertions(+), 2 deletions(-)

-- 
2.7.4



[PATCH net-next v2 2/2] Add a eBPF helper function to retrieve socket uid

2017-02-05 Thread Chenbo Feng
From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering.

Signed-off-by: Chenbo Feng 
---
 include/uapi/linux/bpf.h |  9 -
 net/core/filter.c| 17 +
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6923d21..4854027 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -455,6 +455,12 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: 8 Bytes non-decreasing number on success or 0 if the socket
  * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -503,7 +509,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 632fb91..523ed08 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,21 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   struct sock *sk = skb->sk;
+   kuid_t kuid = sock_net_uid(net, sk && sk_fullsock(sk) ?
+  sk : NULL);
+   return (u32)kuid.val;
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2637,6 +2652,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
return bpf_get_trace_printk_proto();
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return NULL;
}
-- 
2.7.4



[PATCH net-next v2 1/2] Add a helper function to get socket cookie in eBPF

2017-02-05 Thread Chenbo Feng
From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Signed-off-by: Chenbo Feng 
---
 include/linux/sock_diag.h |  1 +
 include/uapi/linux/bpf.h  |  9 -
 net/core/filter.c | 15 +++
 net/core/sock_diag.c  |  2 +-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index a0596ca0..a2f8109 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h);
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 
+u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e07fd5a..6923d21 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -449,6 +449,12 @@ union bpf_attr {
  * Return:
  *   > 0 length of the string including the trailing NUL on success
  *   < 0 error
+ *
+ * u64 bpf_bpf_get_socket_cookie(skb)
+ * Get the cookie for the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: 8 Bytes non-decreasing number on success or 0 if the socket
+ * field is missing inside sk_buff
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -496,7 +502,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 0b753cb..632fb91 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2599,6 +2600,18 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
+{
+   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
+}
+
+static const struct bpf_func_proto bpf_get_socket_cookie_proto = {
+   .func   = bpf_get_socket_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2622,6 +2635,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
case BPF_FUNC_trace_printk:
if (capable(CAP_SYS_ADMIN))
return bpf_get_trace_printk_proto();
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return NULL;
}
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..acd2a6c 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct 
nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
-static u64 sock_gen_cookie(struct sock *sk)
+u64 sock_gen_cookie(struct sock *sk)
 {
while (1) {
u64 res = atomic64_read(>sk_cookie);
-- 
2.7.4



linux-next: manual merge of the drm tree with the net-next tree

2017-02-05 Thread Stephen Rothwell
Hi Dave,

Today's linux-next merge of the drm tree got a conflict in:

  lib/Kconfig

between commit:

  44091d29f207 ("lib: Introduce priority array area manager")

from the net-next tree and commit:

  cf4a7207b1cb ("lib: Add a simple prime number generator")

from the drm tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc lib/Kconfig
index 5d644f180fe5,1788a1f50d28..
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@@ -550,7 -550,11 +550,14 @@@ config STACKDEPO
  config SBITMAP
bool
  
 +config PARMAN
 +  tristate "parman"
 +
+ config PRIME_NUMBERS
+   tristate "Prime number generator"
+   default n
+   help
+ Provides a helper module to generate prime numbers. Useful for writing
+ test code, especially when checking multiplication and divison.
+ 
  endmenu


Re: [PATCHv1] net-next: treewide use is_vlan_dev() helper function.

2017-02-05 Thread Jonathan Maxwell
On Sun, Feb 5, 2017 at 4:00 AM, Parav Pandit  wrote:
> This patch makes use of is_vlan_dev() function instead of flag
> comparison which is exactly done by is_vlan_dev() helper function.
>
> Signed-off-by: Parav Pandit 
> Reviewed-by: Daniel Jurgens 
> ---
>  drivers/infiniband/core/cma.c|  6 ++
>  drivers/infiniband/sw/rxe/rxe_net.c  |  2 +-
>  drivers/net/ethernet/broadcom/cnic.c |  2 +-
>  drivers/net/ethernet/chelsio/cxgb3/l2t.c |  2 +-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  4 ++--
>  drivers/net/ethernet/chelsio/cxgb4/l2t.c |  2 +-
>  drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |  8 
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |  4 ++--
>  drivers/net/hyperv/netvsc_drv.c  |  2 +-
>  drivers/scsi/bnx2fc/bnx2fc_fcoe.c|  6 +++---
>  drivers/scsi/cxgbi/libcxgbi.c|  6 +++---
>  drivers/scsi/fcoe/fcoe.c | 13 ++---
>  include/rdma/ib_addr.h   |  6 ++
>  net/hsr/hsr_slave.c  |  3 ++-
>  14 files changed, 31 insertions(+), 35 deletions(-)
>

Neatens the code up nicely.

Acked-by: Jon Maxwell 


[PATCH net-next v1 1/7] bpf: Add missing header to the library

2017-02-05 Thread Mickaël Salaün
Including stddef.h is needed to define size_t.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
 tools/lib/bpf/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2f9853dd882..df6e186da788 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -22,6 +22,7 @@
 #define __BPF_BPF_H
 
 #include 
+#include 
 
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
   int max_entries, __u32 map_flags);
-- 
2.11.0



[PATCH net-next v1 6/7] bpf: Use the bpf_load_program() from the library

2017-02-05 Thread Mickaël Salaün
Replace bpf_prog_load() with bpf_load_program() calls.

Use the tools include directory instead of the installed one to allow
builds from other kernels.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Shuah Khan 
---
 tools/testing/selftests/bpf/Makefile|  6 +-
 tools/testing/selftests/bpf/bpf_sys.h   | 21 -
 tools/testing/selftests/bpf/test_tag.c  |  6 --
 tools/testing/selftests/bpf/test_verifier.c |  8 +---
 4 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 769a6cb42b4b..712861492278 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,12 +1,16 @@
-CFLAGS += -Wall -O2 -I../../../../usr/include
+CFLAGS += -Wall -O2 -I../../../include/uapi -I../../../lib
 
 test_objs = test_verifier test_tag test_maps test_lru_map test_lpm_map
 
 TEST_PROGS := $(test_objs) test_kmod.sh
 TEST_FILES := $(test_objs)
+LIBBPF := ../../../lib/bpf/bpf.o
 
 all: $(test_objs)
 
+test_verifier: $(LIBBPF)
+test_tag: $(LIBBPF)
+
 include ../lib.mk
 
 clean:
diff --git a/tools/testing/selftests/bpf/bpf_sys.h 
b/tools/testing/selftests/bpf/bpf_sys.h
index 6b4565f2a3f2..e7bbe3e5402e 100644
--- a/tools/testing/selftests/bpf/bpf_sys.h
+++ b/tools/testing/selftests/bpf/bpf_sys.h
@@ -84,25 +84,4 @@ static inline int bpf_map_create(enum bpf_map_type type, 
uint32_t size_key,
return bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
 
-static inline int bpf_prog_load(enum bpf_prog_type type,
-   const struct bpf_insn *insns, size_t size_insns,
-   const char *license, char *log, size_t size_log)
-{
-   union bpf_attr attr = {};
-
-   attr.prog_type = type;
-   attr.insns = bpf_ptr_to_u64(insns);
-   attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
-   attr.license = bpf_ptr_to_u64(license);
-
-   if (size_log > 0) {
-   attr.log_buf = bpf_ptr_to_u64(log);
-   attr.log_size = size_log;
-   attr.log_level = 1;
-   log[0] = 0;
-   }
-
-   return bpf(BPF_PROG_LOAD, , sizeof(attr));
-}
-
 #endif /* __BPF_SYS__ */
diff --git a/tools/testing/selftests/bpf/test_tag.c 
b/tools/testing/selftests/bpf/test_tag.c
index 5f7c602f47d1..b77dc4b03e77 100644
--- a/tools/testing/selftests/bpf/test_tag.c
+++ b/tools/testing/selftests/bpf/test_tag.c
@@ -16,6 +16,8 @@
 #include 
 #include 
 
+#include 
+
 #include "../../../include/linux/filter.h"
 
 #include "bpf_sys.h"
@@ -55,8 +57,8 @@ static int bpf_try_load_prog(int insns, int fd_map,
int fd_prog;
 
bpf_filler(insns, fd_map);
-   fd_prog = bpf_prog_load(BPF_PROG_TYPE_SCHED_CLS, prog, insns *
-   sizeof(struct bpf_insn), "", NULL, 0);
+   fd_prog = bpf_load_program(BPF_PROG_TYPE_SCHED_CLS, prog, insns, "", 0,
+   NULL, 0);
assert(fd_prog > 0);
if (fd_map > 0)
bpf_filler(insns, 0);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 0d0912c7f03c..04a549e54f61 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 #include "../../../include/linux/filter.h"
 
 #include "bpf_sys.h"
@@ -4456,9 +4458,9 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
 
do_test_fixup(test, prog, _f1, _f2, _f3);
 
-   fd_prog = bpf_prog_load(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
-   prog, prog_len * sizeof(struct bpf_insn),
-   "GPL", bpf_vlog, sizeof(bpf_vlog));
+   fd_prog = bpf_load_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
+   prog, prog_len, "GPL", 0, bpf_vlog,
+   sizeof(bpf_vlog));
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
   test->result_unpriv : test->result;
-- 
2.11.0



[PATCH net-next v1 4/7] tools: Sync {,tools/}include/uapi/linux/bpf.h

2017-02-05 Thread Mickaël Salaün
The tools version of this header is out of date; update it to the latest
version from kernel header.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: David S. Miller 
---
 tools/include/uapi/linux/bpf.h | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..e07fd5a324e6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
@@ -430,6 +437,18 @@ union bpf_attr {
  * @xdp_md: pointer to xdp_md
  * @delta: An positive/negative integer to be added to xdp_md.data
  * Return: 0 on success or negative on error
+ *
+ * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+ * Copy a NUL terminated string from unsafe address. In case the string
+ * length is smaller than size, the target is not padded with further NUL
+ * bytes. In case the string length is larger than size, just count-1
+ * bytes are copied and the last byte is set to NUL.
+ * @dst: destination address
+ * @size: maximum number of bytes to copy, including the trailing NUL
+ * @unsafe_ptr: unsafe address
+ * Return:
+ *   > 0 length of the string including the trailing NUL on success
+ *   < 0 error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -476,7 +495,8 @@ union bpf_attr {
FN(set_hash_invalid),   \
FN(get_numa_node_id),   \
FN(skb_change_head),\
-   FN(xdp_adjust_head),
+   FN(xdp_adjust_head),\
+   FN(probe_read_str),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -502,6 +522,7 @@ enum bpf_func_id {
 /* BPF_FUNC_l4_csum_replace flags. */
 #define BPF_F_PSEUDO_HDR   (1ULL << 4)
 #define BPF_F_MARK_MANGLED_0   (1ULL << 5)
+#define BPF_F_MARK_ENFORCE (1ULL << 6)
 
 /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
 #define BPF_F_INGRESS  (1ULL << 0)
-- 
2.11.0



[PATCH net-next v1 2/7] samples/bpf: Ignore already processed ELF sections

2017-02-05 Thread Mickaël Salaün
Add a missing check for the map fixup loop.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
---
 samples/bpf/bpf_load.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 396e204888b3..e04fe09d7c2e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -328,6 +328,8 @@ int load_bpf_file(char *path)
 
/* load programs that need map fixup (relocations) */
for (i = 1; i < ehdr.e_shnum; i++) {
+   if (processed_sec[i])
+   continue;
 
if (get_sec(elf, i, , , , ))
continue;
-- 
2.11.0



[PATCH net-next v1 3/7] samples/bpf: Reset global variables

2017-02-05 Thread Mickaël Salaün
Before loading a new ELF, clean previous kernel version, license and
processed sections.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: David S. Miller 
---
 samples/bpf/bpf_load.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index e04fe09d7c2e..b86ee54da2d1 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -277,6 +277,11 @@ int load_bpf_file(char *path)
Elf_Data *data, *data_prog, *symbols = NULL;
char *shname, *shname_prog;
 
+   /* reset global variables */
+   kern_version = 0;
+   memset(license, 0, sizeof(license));
+   memset(processed_sec, 0, sizeof(processed_sec));
+
if (elf_version(EV_CURRENT) == EV_NONE)
return 1;
 
-- 
2.11.0



[PATCH net-next v1 7/7] bpf: Always test unprivileged programs

2017-02-05 Thread Mickaël Salaün
If selftests are run as root, then execute the unprivileged checks as
well. This switch from 240 to 364 tests.

The test numbers are suffixed with "/u" when executed as unprivileged or
with "/p" when executed as privileged.

The geteuid() check is replaced with a capability check.

Handling capabilities require the libcap dependency.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Shuah Khan 
---
 tools/testing/selftests/bpf/Makefile|  2 +-
 tools/testing/selftests/bpf/test_verifier.c | 68 ++---
 2 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 712861492278..30bb40261692 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,4 +1,4 @@
-CFLAGS += -Wall -O2 -I../../../include/uapi -I../../../lib
+CFLAGS += -Wall -O2 -lcap -I../../../include/uapi -I../../../lib
 
 test_objs = test_verifier test_tag test_maps test_lru_map test_lpm_map
 
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 04a549e54f61..aa42aef22b85 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include 
@@ -4498,6 +4499,55 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
goto close_fds;
 }
 
+static bool is_admin(void)
+{
+   cap_t caps;
+   cap_flag_value_t sysadmin = CAP_CLEAR;
+   const cap_value_t cap_val = CAP_SYS_ADMIN;
+
+   if (!CAP_IS_SUPPORTED(CAP_SETFCAP)) {
+   perror("cap_get_flag");
+   return false;
+   }
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return false;
+   }
+   if (cap_get_flag(caps, cap_val, CAP_EFFECTIVE, ))
+   perror("cap_get_flag");
+   if (cap_free(caps) == -1)
+   perror("cap_free");
+   return (sysadmin == CAP_SET);
+}
+
+static int set_admin(bool admin)
+{
+   cap_t caps;
+   const cap_value_t cap_val = CAP_SYS_ADMIN;
+   int ret = -1;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+   if (cap_set_flag(caps, CAP_EFFECTIVE, 1, _val,
+   admin ? CAP_SET : CAP_CLEAR)) {
+   perror("cap_set_flag");
+   goto out;
+   }
+   if (cap_set_proc(caps)) {
+   perror("cap_set_proc");
+   goto out;
+   }
+   ret = 0;
+out:
+   if (cap_free(caps) == -1)
+   perror("cap_free");
+   return ret;
+}
+
 static int do_test(bool unpriv, unsigned int from, unsigned int to)
 {
int i, passes = 0, errors = 0;
@@ -4508,11 +4558,19 @@ static int do_test(bool unpriv, unsigned int from, 
unsigned int to)
/* Program types that are not supported by non-root we
 * skip right away.
 */
-   if (unpriv && test->prog_type)
-   continue;
+   if (!test->prog_type) {
+   if (!unpriv)
+   set_admin(false);
+   printf("#%d/u %s ", i, test->descr);
+   do_test_single(test, true, , );
+   if (!unpriv)
+   set_admin(true);
+   }
 
-   printf("#%d %s ", i, test->descr);
-   do_test_single(test, unpriv, , );
+   if (!unpriv) {
+   printf("#%d/p %s ", i, test->descr);
+   do_test_single(test, false, , );
+   }
}
 
printf("Summary: %d PASSED, %d FAILED\n", passes, errors);
@@ -4524,7 +4582,7 @@ int main(int argc, char **argv)
struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
struct rlimit rlim = { 1 << 20, 1 << 20 };
unsigned int from = 0, to = ARRAY_SIZE(tests);
-   bool unpriv = geteuid() != 0;
+   bool unpriv = !is_admin();
 
if (argc == 3) {
unsigned int l = atoi(argv[argc - 2]);
-- 
2.11.0



[PATCH net-next v1 5/7] bpf: Simplify bpf_load_program() error handling in the library

2017-02-05 Thread Mickaël Salaün
Do not call a second time bpf(2) when a program load failed.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
 tools/lib/bpf/bpf.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3ddb58a36d3c..fda3f494f1cd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -73,7 +73,6 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn 
*insns,
 size_t insns_cnt, char *license,
 __u32 kern_version, char *log_buf, size_t log_buf_sz)
 {
-   int fd;
union bpf_attr attr;
 
bzero(, sizeof(attr));
@@ -81,20 +80,15 @@ int bpf_load_program(enum bpf_prog_type type, struct 
bpf_insn *insns,
attr.insn_cnt = (__u32)insns_cnt;
attr.insns = ptr_to_u64(insns);
attr.license = ptr_to_u64(license);
-   attr.log_buf = ptr_to_u64(NULL);
-   attr.log_size = 0;
-   attr.log_level = 0;
+   attr.log_buf = ptr_to_u64(log_buf);
+   attr.log_size = log_buf_sz;
attr.kern_version = kern_version;
 
-   fd = sys_bpf(BPF_PROG_LOAD, , sizeof(attr));
-   if (fd >= 0 || !log_buf || !log_buf_sz)
-   return fd;
+   if (log_buf && log_buf_sz > 0) {
+   attr.log_level = 1;
+   log_buf[0] = 0;
+   }
 
-   /* Try again with log */
-   attr.log_buf = ptr_to_u64(log_buf);
-   attr.log_size = log_buf_sz;
-   attr.log_level = 1;
-   log_buf[0] = 0;
return sys_bpf(BPF_PROG_LOAD, , sizeof(attr));
 }
 
-- 
2.11.0



[PATCH] net: intel: ixgb: use new api ethtool_{get|set}_link_ksettings

2017-02-05 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c |   39 ++--
 1 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c 
b/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
index e5d7255..d10a0d2 100644
--- a/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
@@ -94,24 +94,30 @@ struct ixgb_stats {
 #define IXGB_STATS_LEN ARRAY_SIZE(ixgb_gstrings_stats)
 
 static int
-ixgb_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+ixgb_get_link_ksettings(struct net_device *netdev,
+   struct ethtool_link_ksettings *cmd)
 {
struct ixgb_adapter *adapter = netdev_priv(netdev);
 
-   ecmd->supported = (SUPPORTED_1baseT_Full | SUPPORTED_FIBRE);
-   ecmd->advertising = (ADVERTISED_1baseT_Full | ADVERTISED_FIBRE);
-   ecmd->port = PORT_FIBRE;
-   ecmd->transceiver = XCVR_EXTERNAL;
+   ethtool_link_ksettings_zero_link_mode(cmd, supported);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, 1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
+
+   ethtool_link_ksettings_zero_link_mode(cmd, advertising);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, 1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
+
+   cmd->base.port = PORT_FIBRE;
 
if (netif_carrier_ok(adapter->netdev)) {
-   ethtool_cmd_speed_set(ecmd, SPEED_1);
-   ecmd->duplex = DUPLEX_FULL;
+   cmd->base.speed = SPEED_1;
+   cmd->base.duplex = DUPLEX_FULL;
} else {
-   ethtool_cmd_speed_set(ecmd, SPEED_UNKNOWN);
-   ecmd->duplex = DUPLEX_UNKNOWN;
+   cmd->base.speed = SPEED_UNKNOWN;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
}
 
-   ecmd->autoneg = AUTONEG_DISABLE;
+   cmd->base.autoneg = AUTONEG_DISABLE;
return 0;
 }
 
@@ -126,13 +132,14 @@ void ixgb_set_speed_duplex(struct net_device *netdev)
 }
 
 static int
-ixgb_set_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+ixgb_set_link_ksettings(struct net_device *netdev,
+   const struct ethtool_link_ksettings *cmd)
 {
struct ixgb_adapter *adapter = netdev_priv(netdev);
-   u32 speed = ethtool_cmd_speed(ecmd);
+   u32 speed = cmd->base.speed;
 
-   if (ecmd->autoneg == AUTONEG_ENABLE ||
-   (speed + ecmd->duplex != SPEED_1 + DUPLEX_FULL))
+   if (cmd->base.autoneg == AUTONEG_ENABLE ||
+   (speed + cmd->base.duplex != SPEED_1 + DUPLEX_FULL))
return -EINVAL;
 
if (netif_running(adapter->netdev)) {
@@ -630,8 +637,6 @@ void ixgb_set_speed_duplex(struct net_device *netdev)
 }
 
 static const struct ethtool_ops ixgb_ethtool_ops = {
-   .get_settings = ixgb_get_settings,
-   .set_settings = ixgb_set_settings,
.get_drvinfo = ixgb_get_drvinfo,
.get_regs_len = ixgb_get_regs_len,
.get_regs = ixgb_get_regs,
@@ -649,6 +654,8 @@ void ixgb_set_speed_duplex(struct net_device *netdev)
.set_phys_id = ixgb_set_phys_id,
.get_sset_count = ixgb_get_sset_count,
.get_ethtool_stats = ixgb_get_ethtool_stats,
+   .get_link_ksettings = ixgb_get_link_ksettings,
+   .set_link_ksettings = ixgb_set_link_ksettings,
 };
 
 void ixgb_set_ethtool_ops(struct net_device *netdev)
-- 
1.7.4.4



[PATCH] net: intel: igbvf: use new api ethtool_{get|set}_link_ksettings

2017-02-05 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/intel/igbvf/ethtool.c |   38 ++--
 1 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/ethtool.c 
b/drivers/net/ethernet/intel/igbvf/ethtool.c
index 8dea1b1..34faa11 100644
--- a/drivers/net/ethernet/intel/igbvf/ethtool.c
+++ b/drivers/net/ethernet/intel/igbvf/ethtool.c
@@ -71,45 +71,45 @@ struct igbvf_stats {
 
 #define IGBVF_TEST_LEN ARRAY_SIZE(igbvf_gstrings_test)
 
-static int igbvf_get_settings(struct net_device *netdev,
- struct ethtool_cmd *ecmd)
+static int igbvf_get_link_ksettings(struct net_device *netdev,
+   struct ethtool_link_ksettings *cmd)
 {
struct igbvf_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
u32 status;
 
-   ecmd->supported   = SUPPORTED_1000baseT_Full;
+   ethtool_link_ksettings_zero_link_mode(cmd, supported);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, 1000baseT_Full);
+   ethtool_link_ksettings_zero_link_mode(cmd, advertising);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, 1000baseT_Full);
 
-   ecmd->advertising = ADVERTISED_1000baseT_Full;
-
-   ecmd->port = -1;
-   ecmd->transceiver = XCVR_DUMMY1;
+   cmd->base.port = -1;
 
status = er32(STATUS);
if (status & E1000_STATUS_LU) {
if (status & E1000_STATUS_SPEED_1000)
-   ethtool_cmd_speed_set(ecmd, SPEED_1000);
+   cmd->base.speed = SPEED_1000;
else if (status & E1000_STATUS_SPEED_100)
-   ethtool_cmd_speed_set(ecmd, SPEED_100);
+   cmd->base.speed = SPEED_100;
else
-   ethtool_cmd_speed_set(ecmd, SPEED_10);
+   cmd->base.speed = SPEED_10;
 
if (status & E1000_STATUS_FD)
-   ecmd->duplex = DUPLEX_FULL;
+   cmd->base.duplex = DUPLEX_FULL;
else
-   ecmd->duplex = DUPLEX_HALF;
+   cmd->base.duplex = DUPLEX_HALF;
} else {
-   ethtool_cmd_speed_set(ecmd, SPEED_UNKNOWN);
-   ecmd->duplex = DUPLEX_UNKNOWN;
+   cmd->base.speed = SPEED_UNKNOWN;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
}
 
-   ecmd->autoneg = AUTONEG_DISABLE;
+   cmd->base.autoneg = AUTONEG_DISABLE;
 
return 0;
 }
 
-static int igbvf_set_settings(struct net_device *netdev,
- struct ethtool_cmd *ecmd)
+static int igbvf_set_link_ksettings(struct net_device *netdev,
+   const struct ethtool_link_ksettings *cmd)
 {
return -EOPNOTSUPP;
 }
@@ -443,8 +443,6 @@ static void igbvf_get_strings(struct net_device *netdev, 
u32 stringset,
 }
 
 static const struct ethtool_ops igbvf_ethtool_ops = {
-   .get_settings   = igbvf_get_settings,
-   .set_settings   = igbvf_set_settings,
.get_drvinfo= igbvf_get_drvinfo,
.get_regs_len   = igbvf_get_regs_len,
.get_regs   = igbvf_get_regs,
@@ -467,6 +465,8 @@ static void igbvf_get_strings(struct net_device *netdev, 
u32 stringset,
.get_ethtool_stats  = igbvf_get_ethtool_stats,
.get_coalesce   = igbvf_get_coalesce,
.set_coalesce   = igbvf_set_coalesce,
+   .get_link_ksettings = igbvf_get_link_ksettings,
+   .set_link_ksettings = igbvf_set_link_ksettings,
 };
 
 void igbvf_set_ethtool_ops(struct net_device *netdev)
-- 
1.7.4.4



Re: [RFC 2/2] net: emac: add support for device-tree based PHY discovery and setup

2017-02-05 Thread Florian Fainelli
Le 02/05/17 à 14:25, Christian Lamparter a écrit :
> From: Christian Lamparter 
> 
> This patch adds glue-code that allows the EMAC driver to interface
> with the existing dt-supported PHYs in drivers/net/phy.
> 
> Because currently, the emac driver maintains a small library of
> supported phys for in a private phy.c file located in the drivers
> directory.
> 
> The support is limited to mostly single ethernet transceiver like the:
> CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.
> However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W)
> have a 5-port switch (QCA8327N) attached to the MDIO of the EMAC.
> The switch chip has already a proper phy-driver (qca8k) that uses
> the generic phy library.

Technically, it's a mdio_device in the upstream kernel that registers a
switch with DSA (and a PHY device in the OpenWrt/LEDE downstream
kernel). If your goal is to specifically support that device you should
consider making the EMAC interface with a fixed link PHY to properly
initialize the EMAC <=> CPU port of the switch link, and then declare
the qca8k device as a child MDIO device (not a PHY), similar to what is
done in arch/arm/boot/dts/vf610-zii-dev-rev-b.dts for instance.

> 
> Signed-off-by: Christian Lamparter 
> ---
>  drivers/net/ethernet/ibm/emac/core.c | 188 
> +++
>  drivers/net/ethernet/ibm/emac/core.h |   4 +
>  2 files changed, 192 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
> b/drivers/net/ethernet/ibm/emac/core.c
> index 6ead2335a169..ea9234cdb227 100644
> --- a/drivers/net/ethernet/ibm/emac/core.c
> +++ b/drivers/net/ethernet/ibm/emac/core.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -2420,6 +2421,179 @@ static int emac_read_uint_prop(struct device_node 
> *np, const char *name,
>   return 0;
>  }
>  
> +static void emac_adjust_link(struct net_device *ndev)
> +{
> + struct emac_instance *dev = netdev_priv(ndev);
> + struct phy_device *phy = dev->phy_dev;
> +
> + mutex_lock(>link_lock);
> + dev->phy.autoneg = phy->autoneg;
> + dev->phy.speed = phy->speed;
> + dev->phy.duplex = phy->duplex;
> + dev->phy.pause = phy->pause;
> + dev->phy.asym_pause = phy->asym_pause;
> + dev->phy.advertising = phy->advertising;
> + mutex_unlock(>link_lock);

PHYLIB already executes grabbing the phy device's mutex, is this really
needed here?

> +}
> +
> +static int emac_mii_bus_read(struct mii_bus *bus, int addr, int regnum)
> +{
> + return emac_mdio_read(bus->priv, addr, regnum);
> +}
> +
> +static int emac_mii_bus_write(struct mii_bus *bus, int addr, int regnum,
> +   u16 val)
> +{
> + emac_mdio_write(bus->priv, addr, regnum, val);
> + return 0;
> +}
> +
> +static int emac_mii_bus_reset(struct mii_bus *bus)
> +{
> + struct emac_instance *dev = netdev_priv(bus->priv);
> +
> + emac_mii_reset_phy(>phy);

This seems wrong, emac_mii_reset_phy() does a BMCR software reset, which
PHYLIB is already going to do (phy_init_hw), yet you do this here at the
MDIO bus level towards a specify PHY, whereas this should be affecting
the MDIO bus itself (and/or *all* PHY child devices for quirks).

> + return 0;
> +}
> +
> +static int emac_mdio_probe(struct emac_instance *dev)
> +{
> + struct device_node *mii_np;
> + struct mii_bus *bus;
> + int res;
> +
> + bus = mdiobus_alloc();
> + if (!bus)
> + return -ENOMEM;
> +
> + mii_np = of_get_child_by_name(dev->ofdev->dev.of_node, "mdio");
> + if (!mii_np) {
> + dev_err(>ndev->dev, "no mdio definition found.");
> + return -ENODEV;
> + }
> +
> + if (!of_device_is_available(mii_np))
> + return 0;
> +
> + bus->priv = dev->ndev;
> + bus->parent = dev->ndev->dev.parent;
> + bus->name = "emac_mdio";
> + bus->read = _mii_bus_read;
> + bus->write = _mii_bus_write;
> + bus->reset = _mii_bus_reset;
> +
> + snprintf(bus->id, MII_BUS_ID_SIZE, "%s", bus->name);

You should pick a more unique name here, if you ever have a second
instance it would just clash with the previous one.

> +
> + res = of_mdiobus_register(bus, mii_np);
> + if (res) {
> + dev_err(>ndev->dev, "cannot register MDIO bus %s\n",
> + bus->name);
> + mdiobus_free(bus);
> + }
> +
> + dev->mii_bus = bus;
> + return res;
> +}
> +
> +static void emac_mdio_cleanup(struct emac_instance *dev)
> +{
> + if (dev->mii_bus) {
> + if (dev->mii_bus->state == MDIOBUS_REGISTERED)
> + mdiobus_unregister(dev->mii_bus);

If you need to make that kind of check, why not separate how the mdio
bus structure's lifecycle is managed? This seems to be avoiding to hit
the BUG_ON() in mdiobus_unregister..

> + mdiobus_free(dev->mii_bus);
> + 

Re: [net-next PATCH v2 0/5] XDP adjust head support for virtio

2017-02-05 Thread David Miller
From: John Fastabend 
Date: Thu, 02 Feb 2017 19:14:05 -0800

> This series adds adjust head support for virtio. The following is my
> test setup. I use qemu + virtio as follows,
> 
> ./x86_64-softmmu/qemu-system-x86_64 \
>   -hda /var/lib/libvirt/images/Fedora-test0.img \
>   -m 4096  -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
>   -device 
> virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9
> 
> In order to use XDP with virtio until LRO is supported TSO must be
> turned off in the host. The important fields in the above command line
> are the following,
> 
>   guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off
> 
> Also note it is possible to conusme more queues than can be supported
> because when XDP is enabled for retransmit XDP attempts to use a queue
> per cpu. My standard queue count is 'queues=4'.
> 
> After loading the VM I run the relevant XDP test programs in,
> 
>   ./sammples/bpf
> 
> For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
> with iperf (-d option to get bidirectional traffic), ping, and pktgen.
> I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
> the normal traffic path to the stack continues to work with XDP loaded.
> 
> It would be great to automate this soon. At the moment I do it by hand
> which is starting to get tedious.
> 
> v2: original series dropped trace points after merge.

Michael, I just want to apply this right now.

I don't think haggling over whether to allocate the adjust_head area
unconditionally or not is a blocker for this series going in.  That
can be addressed trivially in a follow-on patch.

We want these new reset paths tested as much as possible and each day
we delay this series is detrimental towards that goal.

Thanks.


Re: [RFC 1/2] dt: emac: document device-tree based phy discovery and setup

2017-02-05 Thread Florian Fainelli
Le 02/05/17 à 14:25, Christian Lamparter a écrit :
> This patch adds documentation for a new "phy-handler" property
> and "mdio" sub-node. These allows the enumeration of PHYs which
> are supported by the phy library under drivers/net/phy.
> 
> The EMAC ethernet controller in IBM and AMCC 4xx chips is
> currently stuck with a few privately defined phy
> implementations. It has no support for PHYs which
> are supported by the generic phylib.
> 
> Signed-off-by: Christian Lamparter 
> ---
>  .../devicetree/bindings/powerpc/4xx/emac.txt   | 60 
> +-
>  1 file changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
> b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> index 712baf6c3e24..0572d053c35a 100644
> --- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> +++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> @@ -71,6 +71,8 @@
> For Axon it can be absent, though my current driver
> doesn't handle phy-address yet so for now, keep
> 0x00ff in it.
> +- phy-handle : See net/ethernet.txt file; used to describe
> +   configurations where a external PHY is used.
>  - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
> operations (if absent the value is the same as
> rx-fifo-size).  For Axon, either absent or 2048.
> @@ -82,7 +84,18 @@
>  - tah-channel   : 1 cell, optional. If appropriate, channel used on 
> the
> TAH engine.
>  
> -Example:
> +- mdio subnode   : When the EMAC has a phy connected to its local
> +   mdio, which us supported by the kernel's network
> +   PHY library in drivers/net/phy, there must be device
> +   tree subnode with the following required properties:
> + - #address-cells: Must be <1>.
> + - #size-cells: Must be <0>.
> +
> +   For each phy on the mdio bus, there must be a node
> +   with the following fields:
> + - reg: phy id used to communicate to phy.
> + - device_type: Must be "ethernet-phy".

Just provide a reference to
Documentation/devicetree/bindings/net/phy.txt and
Documentation/devicetree/bindings/net/ethernet.txt here. device_type is
not required.

> +Examples:
>  
>   EMAC0: ethernet@4800 {
>   device_type = "network";
> @@ -104,6 +117,50 @@
>   zmii-channel = <0>;
>   };
>  
> + EMAC1: ethernet@ef600c00 {
> + device_type = "network";
> + compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
> + interrupt-parent = <>;
> + interrupts = <0 1>;
> + #interrupt-cells = <1>;
> + #address-cells = <0>;
> + #size-cells = <0>;
> + interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
> +  1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
> + reg = <0xef600c00 0x00c4>;
> + local-mac-address = []; /* Filled in by U-Boot */
> + mal-device = <>;
> + mal-tx-channel = <0>;
> + mal-rx-channel = <0>;
> + cell-index = <0>;
> + max-frame-size = <9000>;
> + rx-fifo-size = <16384>;
> + tx-fifo-size = <2048>;
> + fifo-entry-size = <10>;
> + phy-mode = "rgmii";
> + phy-map = <0x>;

If you have a proper mdio subnode, this property becomes irrelevant and
should be unused.

> + phy-handle = <>;
> + rgmii-device = <>;
> + rgmii-channel = <0>;
> + tah-device = <>;
> + tah-channel = <0>;
> + has-inverted-stacr-oc;
> + has-new-stacr-staopc;
> +
> + mdio {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + phy0: ethernet-phy@0 {
> + device_type = "ethernet-phy";
> + reg = <0>;
> +
> + qca,ar8327-initvals = <
> + 0x0010 0x4000>;
> + };
> + };
> +
> +
>ii) McMAL node
>  
>  Required properties:
> @@ -145,4 +202,3 @@
>  - revision   : as provided by the RGMII new version register if
>  available.
>  For Axon: 0x012a
> -
> 


-- 
Florian


Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.

2017-02-05 Thread David Miller
From: Jarno Rajahalme 
Date: Thu,  2 Feb 2017 17:10:00 -0800

> This does not match either of the conntrack tuples above.  Normally
> this does not matter, as the conntrack lookup was already done using
> the tuple (B,A), but if the current packet does not match any flow in
> the OVS datapath, the packet is sent to userspace via an upcall,
> during which the packet's skb is freed, and the conntrack entry
> pointer in the skb is lost.

This is the real bug.

If the metadata for a packet is important, as it obviously is here,
then upcalls should preserve that metadata across the upcall.  This
is exactly how NF_QUEUE handles this problem and so should OVS.


[RFC 1/2] dt: emac: document device-tree based phy discovery and setup

2017-02-05 Thread Christian Lamparter
This patch adds documentation for a new "phy-handler" property
and "mdio" sub-node. These allows the enumeration of PHYs which
are supported by the phy library under drivers/net/phy.

The EMAC ethernet controller in IBM and AMCC 4xx chips is
currently stuck with a few privately defined phy
implementations. It has no support for PHYs which
are supported by the generic phylib.

Signed-off-by: Christian Lamparter 
---
 .../devicetree/bindings/powerpc/4xx/emac.txt   | 60 +-
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
index 712baf6c3e24..0572d053c35a 100644
--- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
+++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
@@ -71,6 +71,8 @@
  For Axon it can be absent, though my current driver
  doesn't handle phy-address yet so for now, keep
  0x00ff in it.
+- phy-handle   : See net/ethernet.txt file; used to describe
+ configurations where a external PHY is used.
 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
  operations (if absent the value is the same as
  rx-fifo-size).  For Axon, either absent or 2048.
@@ -82,7 +84,18 @@
 - tah-channel   : 1 cell, optional. If appropriate, channel used on the
  TAH engine.
 
-Example:
+- mdio subnode : When the EMAC has a phy connected to its local
+ mdio, which us supported by the kernel's network
+ PHY library in drivers/net/phy, there must be device
+ tree subnode with the following required properties:
+   - #address-cells: Must be <1>.
+   - #size-cells: Must be <0>.
+
+ For each phy on the mdio bus, there must be a node
+ with the following fields:
+   - reg: phy id used to communicate to phy.
+   - device_type: Must be "ethernet-phy".
+Examples:
 
EMAC0: ethernet@4800 {
device_type = "network";
@@ -104,6 +117,50 @@
zmii-channel = <0>;
};
 
+   EMAC1: ethernet@ef600c00 {
+   device_type = "network";
+   compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
+   interrupt-parent = <>;
+   interrupts = <0 1>;
+   #interrupt-cells = <1>;
+   #address-cells = <0>;
+   #size-cells = <0>;
+   interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
+1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
+   reg = <0xef600c00 0x00c4>;
+   local-mac-address = []; /* Filled in by U-Boot */
+   mal-device = <>;
+   mal-tx-channel = <0>;
+   mal-rx-channel = <0>;
+   cell-index = <0>;
+   max-frame-size = <9000>;
+   rx-fifo-size = <16384>;
+   tx-fifo-size = <2048>;
+   fifo-entry-size = <10>;
+   phy-mode = "rgmii";
+   phy-map = <0x>;
+   phy-handle = <>;
+   rgmii-device = <>;
+   rgmii-channel = <0>;
+   tah-device = <>;
+   tah-channel = <0>;
+   has-inverted-stacr-oc;
+   has-new-stacr-staopc;
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   device_type = "ethernet-phy";
+   reg = <0>;
+
+   qca,ar8327-initvals = <
+   0x0010 0x4000>;
+   };
+   };
+
+
   ii) McMAL node
 
 Required properties:
@@ -145,4 +202,3 @@
 - revision   : as provided by the RGMII new version register if
   available.
   For Axon: 0x012a
-
-- 
2.11.0



[RFC 2/2] net: emac: add support for device-tree based PHY discovery and setup

2017-02-05 Thread Christian Lamparter
From: Christian Lamparter 

This patch adds glue-code that allows the EMAC driver to interface
with the existing dt-supported PHYs in drivers/net/phy.

Because currently, the emac driver maintains a small library of
supported phys for in a private phy.c file located in the drivers
directory.

The support is limited to mostly single ethernet transceiver like the:
CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.
However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W)
have a 5-port switch (QCA8327N) attached to the MDIO of the EMAC.
The switch chip has already a proper phy-driver (qca8k) that uses
the generic phy library.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/core.c | 188 +++
 drivers/net/ethernet/ibm/emac/core.h |   4 +
 2 files changed, 192 insertions(+)

diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index 6ead2335a169..ea9234cdb227 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2420,6 +2421,179 @@ static int emac_read_uint_prop(struct device_node *np, 
const char *name,
return 0;
 }
 
+static void emac_adjust_link(struct net_device *ndev)
+{
+   struct emac_instance *dev = netdev_priv(ndev);
+   struct phy_device *phy = dev->phy_dev;
+
+   mutex_lock(>link_lock);
+   dev->phy.autoneg = phy->autoneg;
+   dev->phy.speed = phy->speed;
+   dev->phy.duplex = phy->duplex;
+   dev->phy.pause = phy->pause;
+   dev->phy.asym_pause = phy->asym_pause;
+   dev->phy.advertising = phy->advertising;
+   mutex_unlock(>link_lock);
+}
+
+static int emac_mii_bus_read(struct mii_bus *bus, int addr, int regnum)
+{
+   return emac_mdio_read(bus->priv, addr, regnum);
+}
+
+static int emac_mii_bus_write(struct mii_bus *bus, int addr, int regnum,
+ u16 val)
+{
+   emac_mdio_write(bus->priv, addr, regnum, val);
+   return 0;
+}
+
+static int emac_mii_bus_reset(struct mii_bus *bus)
+{
+   struct emac_instance *dev = netdev_priv(bus->priv);
+
+   emac_mii_reset_phy(>phy);
+   return 0;
+}
+
+static int emac_mdio_probe(struct emac_instance *dev)
+{
+   struct device_node *mii_np;
+   struct mii_bus *bus;
+   int res;
+
+   bus = mdiobus_alloc();
+   if (!bus)
+   return -ENOMEM;
+
+   mii_np = of_get_child_by_name(dev->ofdev->dev.of_node, "mdio");
+   if (!mii_np) {
+   dev_err(>ndev->dev, "no mdio definition found.");
+   return -ENODEV;
+   }
+
+   if (!of_device_is_available(mii_np))
+   return 0;
+
+   bus->priv = dev->ndev;
+   bus->parent = dev->ndev->dev.parent;
+   bus->name = "emac_mdio";
+   bus->read = _mii_bus_read;
+   bus->write = _mii_bus_write;
+   bus->reset = _mii_bus_reset;
+
+   snprintf(bus->id, MII_BUS_ID_SIZE, "%s", bus->name);
+
+   res = of_mdiobus_register(bus, mii_np);
+   if (res) {
+   dev_err(>ndev->dev, "cannot register MDIO bus %s\n",
+   bus->name);
+   mdiobus_free(bus);
+   }
+
+   dev->mii_bus = bus;
+   return res;
+}
+
+static void emac_mdio_cleanup(struct emac_instance *dev)
+{
+   if (dev->mii_bus) {
+   if (dev->mii_bus->state == MDIOBUS_REGISTERED)
+   mdiobus_unregister(dev->mii_bus);
+   mdiobus_free(dev->mii_bus);
+   dev->mii_bus = NULL;
+   kfree(dev->phy.def);
+   }
+}
+
+static int stub_setup_aneg(struct mii_phy *phy, u32 advertise)
+{
+   return 0;
+}
+
+static int stub_setup_forced(struct mii_phy *phy, int speed, int fd)
+{
+   return 0;
+}
+
+static int stub_poll_link(struct mii_phy *phy)
+{
+   struct net_device *ndev = phy->dev;
+   struct emac_instance *dev = netdev_priv(ndev);
+
+   return dev->opened;
+}
+
+static int stub_read_link(struct mii_phy *phy)
+{
+   struct net_device *ndev = phy->dev;
+   struct emac_instance *dev = netdev_priv(ndev);
+
+   phy_start(dev->phy_dev);
+   return 0;
+}
+
+static const struct mii_phy_ops emac_stub_phy_ops = {
+   .setup_aneg = stub_setup_aneg,
+   .setup_forced   = stub_setup_forced,
+   .poll_link  = stub_poll_link,
+   .read_link  = stub_read_link,
+};
+
+static int emac_probe_dt_phy(struct emac_instance *dev)
+{
+   struct device_node *np = dev->ofdev->dev.of_node;
+   struct device_node *phy_handle;
+   struct net_device *ndev = dev->ndev;
+   int res;
+
+   phy_handle = of_parse_phandle(np, "phy-handle", 0);
+
+   if (phy_handle) {
+   res = emac_mdio_probe(dev);
+   if (res)
+   goto err_cleanup;
+
+   dev->phy.def = 

[PATCH net 2/2] net: phy: Fix PHY driver bind and unbind events

2017-02-05 Thread Florian Fainelli
The PHY library does not deal very well with bind and unbind events. The first
thing we would see is that we were not properly canceling the PHY state machine
workqueue, so we would be crashing while dereferencing phydev->drv since there
is no driver attached anymore.

Once we fix that, there are several things that did not quite work as expected:

- if the PHY state machine was running, we were not stopping it properly, and
  the state machine state would not be marked as such
- when we rebind the driver, nothing would happen, since we would not know which
  state we were before the unbind

This patch takes the following approach:

- if the PHY was attached, and the state machine was running we would stop it,
  remember where we left, and schedule the state machine for restart upong
  driver bind
- if the PHY was attached, but HALTED, we would let it in that state, and do not
  alter the state upon driver bind
- in all other cases (detached) we would keep the PHY in DOWN state waiting for
  a network driver to show up, and set PHY_READY on driver bind

Suggested-by: Russell King 
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phy_device.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 0d8f4d3847f6..05888bd17b97 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1709,6 +1709,7 @@ static int phy_probe(struct device *dev)
struct phy_device *phydev = to_phy_device(dev);
struct device_driver *drv = phydev->mdio.dev.driver;
struct phy_driver *phydrv = to_phy_driver(drv);
+   bool should_start = false;
int err = 0;
 
phydev->drv = phydrv;
@@ -1758,24 +1759,46 @@ static int phy_probe(struct device *dev)
}
 
/* Set the state to READY by default */
-   phydev->state = PHY_READY;
+   if (phydev->state > PHY_UP && phydev->state != PHY_HALTED)
+   should_start = true;
+   else
+   phydev->state = PHY_READY;
 
if (phydev->drv->probe)
err = phydev->drv->probe(phydev);
 
mutex_unlock(>lock);
 
+   if (should_start)
+   phy_start(phydev);
+
return err;
 }
 
 static int phy_remove(struct device *dev)
 {
struct phy_device *phydev = to_phy_device(dev);
+   bool should_stop = false;
+   enum phy_state state;
+
+   cancel_delayed_work_sync(>state_queue);
 
mutex_lock(>lock);
-   phydev->state = PHY_DOWN;
+   state = phydev->state;
+   if (state > PHY_UP && state != PHY_HALTED)
+   should_stop = true;
+   else
+   phydev->state = PHY_DOWN;
mutex_unlock(>lock);
 
+   /* phy_stop() sets the state to HALTED, undo that for the ->probe() 
function
+* to have a chance to resume where we left
+*/
+   if (should_stop) {
+   phy_stop(phydev);
+   phydev->state = state;
+   }
+
if (phydev->drv->remove)
phydev->drv->remove(phydev);
phydev->drv = NULL;
-- 
2.9.3



[PATCH net 1/2] net: phy: Check phydev->drv

2017-02-05 Thread Florian Fainelli
In preparation for supporting driver bind/unbind properly, sprinkle checks on
phydev->drv where we may call into PHYLIB from user-space or other parts of the
kernel.

Suggested-by: Russell King 
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phy.c | 26 ++
 include/linux/phy.h   |  3 +++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 7cc1b7dcfe05..d6f7838455dd 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -580,7 +580,7 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd)
return 0;
 
case SIOCSHWTSTAMP:
-   if (phydev->drv->hwtstamp)
+   if (phydev->drv && phydev->drv->hwtstamp)
return phydev->drv->hwtstamp(phydev, ifr);
/* fall through */
 
@@ -603,6 +603,9 @@ int phy_start_aneg(struct phy_device *phydev)
 {
int err;
 
+   if (!phydev->drv)
+   return -EIO;
+
mutex_lock(>lock);
 
if (AUTONEG_DISABLE == phydev->autoneg)
@@ -975,7 +978,7 @@ void phy_state_machine(struct work_struct *work)
 
old_state = phydev->state;
 
-   if (phydev->drv->link_change_notify)
+   if (phydev->drv && phydev->drv->link_change_notify)
phydev->drv->link_change_notify(phydev);
 
switch (phydev->state) {
@@ -1286,6 +1289,9 @@ EXPORT_SYMBOL(phy_write_mmd_indirect);
  */
 int phy_init_eee(struct phy_device *phydev, bool clk_stop_enable)
 {
+   if (!phydev->drv)
+   return -EIO;
+
/* According to 802.3az,the EEE is supported only in full duplex-mode.
 * Also EEE feature is active when core is operating with MII, GMII
 * or RGMII (all kinds). Internal PHYs are also allowed to proceed and
@@ -1363,6 +1369,9 @@ EXPORT_SYMBOL(phy_init_eee);
  */
 int phy_get_eee_err(struct phy_device *phydev)
 {
+   if (!phydev->drv)
+   return -EIO;
+
return phy_read_mmd_indirect(phydev, MDIO_PCS_EEE_WK_ERR, MDIO_MMD_PCS);
 }
 EXPORT_SYMBOL(phy_get_eee_err);
@@ -1379,6 +1388,9 @@ int phy_ethtool_get_eee(struct phy_device *phydev, struct 
ethtool_eee *data)
 {
int val;
 
+   if (!phydev->drv)
+   return -EIO;
+
/* Get Supported EEE */
val = phy_read_mmd_indirect(phydev, MDIO_PCS_EEE_ABLE, MDIO_MMD_PCS);
if (val < 0)
@@ -1412,6 +1424,9 @@ int phy_ethtool_set_eee(struct phy_device *phydev, struct 
ethtool_eee *data)
 {
int val = ethtool_adv_to_mmd_eee_adv_t(data->advertised);
 
+   if (!phydev->drv)
+   return -EIO;
+
/* Mask prohibited EEE modes */
val &= ~phydev->eee_broken_modes;
 
@@ -1423,7 +1438,7 @@ EXPORT_SYMBOL(phy_ethtool_set_eee);
 
 int phy_ethtool_set_wol(struct phy_device *phydev, struct ethtool_wolinfo *wol)
 {
-   if (phydev->drv->set_wol)
+   if (phydev->drv && phydev->drv->set_wol)
return phydev->drv->set_wol(phydev, wol);
 
return -EOPNOTSUPP;
@@ -1432,7 +1447,7 @@ EXPORT_SYMBOL(phy_ethtool_set_wol);
 
 void phy_ethtool_get_wol(struct phy_device *phydev, struct ethtool_wolinfo 
*wol)
 {
-   if (phydev->drv->get_wol)
+   if (phydev->drv && phydev->drv->get_wol)
phydev->drv->get_wol(phydev, wol);
 }
 EXPORT_SYMBOL(phy_ethtool_get_wol);
@@ -1468,6 +1483,9 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
if (!phydev)
return -ENODEV;
 
+   if (!phydev->drv)
+   return -EIO;
+
return genphy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 7fc1105605bf..231e07bb0d76 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -802,6 +802,9 @@ int phy_stop_interrupts(struct phy_device *phydev);
 
 static inline int phy_read_status(struct phy_device *phydev)
 {
+   if (!phydev->drv)
+   return -EIO;
+
return phydev->drv->read_status(phydev);
 }
 
-- 
2.9.3



[PATCH net 0/2] net: phy: Unbind/bind fixes

2017-02-05 Thread Florian Fainelli
Hi all,

This patch series addresses the inability to safely unbind and bind
PHY drivers by making the appropriate checks throught PHYLIB where we
may be directly responding to user-space queries, as well as from within
the kernel state machine.

The second patch makes the unbind -> bind working by taking care of the
PHY state machine state.

Florian Fainelli (2):
  net: phy: Check phydev->drv
  net: phy: Fix PHY driver bind and unbind events

 drivers/net/phy/phy.c| 26 ++
 drivers/net/phy/phy_device.c | 27 +--
 include/linux/phy.h  |  3 +++
 3 files changed, 50 insertions(+), 6 deletions(-)

-- 
2.9.3



Re: [PATCH net] ip6_gre: fix ip6gre_err() invalid reads

2017-02-05 Thread David Miller
From: Eric Dumazet 
Date: Sat, 04 Feb 2017 23:18:55 -0800

> From: Eric Dumazet 
> 
> Andrey Konovalov reported out of bound accesses in ip6gre_err()
> 
> If GRE flags contains GRE_KEY, the following expression
> *(((__be32 *)p) + (grehlen / 4) - 1)
> 
> accesses data ~40 bytes after the expected point, since
> grehlen includes the size of IPv6 headers.
> 
> Let's use a "struct gre_base_hdr *greh" pointer to make this
> code more readable. 
> 
> p[1] becomes greh->protocol.
> grhlen is the GRE header length.
> 
> Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
> Signed-off-by: Eric Dumazet 
> Reported-by: Andrey Konovalov 

So the bug is that we include offset twice in the calculation.

Applied and queued up for -stable, thanks.



Re: [PATCH v2 net-next 00/12] net: get rid of __napi_complete()

2017-02-05 Thread David Miller
From: Eric Dumazet 
Date: Sat,  4 Feb 2017 15:24:50 -0800

> This patch series removes __napi_complete() calls, in an effort
> to make NAPI API simpler and generalize GRO and napi_complete_done()

Nice cleanup, series applied, thanks Eric.


Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Florian Fainelli
Le 02/05/17 à 12:22, Yotam Gigi a écrit :
>> -Original Message-
>> From: Florian Fainelli [mailto:f.faine...@gmail.com]
>> Sent: Sunday, February 05, 2017 8:37 PM
>> To: Yotam Gigi ; step...@networkplumber.org;
>> netdev@vger.kernel.org; Jiri Pirko ; Elad Raz
>> 
>> Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc 
>> action
>>
>> On 02/04/2017 11:58 PM, Yotam Gigi wrote:
>>> The sample tc action allows sampling packets matching a classifier. It
>>> peeks randomly packets, and samples them using the psample netlink
>>> channel. The user can specify the psample group, which the packet will be
>>> sampled to, the sampling rate and the packet truncation (to save
>>> kernel-user traffic).
>>>
>>> The sampled packets contain informative metadata, for example, the input
>>> interface and the original packet length.
>>>
>>> The action syntax:
>>> tc filter add [...] \
>>> action sample rate  group  [trunc ]
>>> [...]
>>>
>>> Where:
>>>   RATE := The sampling rate which is the ratio of packets observed at the
>>>   data source to the samples generated
>>>   GROUP := the psample module sampling group
>>>   SIZE := optional truncation size
>>>
>>> An example for a common usecase of the sample tc action: to sample ingress
>>> traffic from interface eth1, one may use the commands:
>>>
>>> tc qdisc add dev eth1 handle : ingress
>>>
>>> tc filter add dev eth1 parent : \
>>>matchall action sample rate 12 group 4
>>>
>>> Where the first command adds an ingress qdisc and the second starts
>>> sampling randomly with an average of one sampled packet per 12 packets
>>> on dev eth1 to psample group 4.
>>
>> The group argument seems to be mandatory from looking at the code, but
>> what if just wanted to have a port mirroring between, say sw0p1 and
>> sw0p2 with the sample rate specified instead (without using the psample
>> netlink channel at all)? Could we make this group an optional argument
>> instead?
> 
> The kernel action currently don't support it, and I am not sure it should.
> 
> If I understand you correctly, you want to make the sample action identical
> to mirred-mirror, only with random behavior. This can be done using the 
> matchall and mirred action, plus the 'random' gact keyword.

It sounds like we can indeed, with random determ and using the VAL
argument we should be able to configure the capture divider; thanks!

> 
> The sample action attaches some metadata in addition to the original packet
> data, and that cannot be achieved by mirroring the packets, thus making it
> unusable for our usecase. In the former version we attached the metadata
> using the IFE protocol, but we decided to use a dedicated netlink channel 
> instead.

Yeah I see that now, thanks for the explanation!
-- 
Florian


RE: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Yotam Gigi
>-Original Message-
>From: Florian Fainelli [mailto:f.faine...@gmail.com]
>Sent: Sunday, February 05, 2017 8:37 PM
>To: Yotam Gigi ; step...@networkplumber.org;
>netdev@vger.kernel.org; Jiri Pirko ; Elad Raz
>
>Subject: Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc 
>action
>
>On 02/04/2017 11:58 PM, Yotam Gigi wrote:
>> The sample tc action allows sampling packets matching a classifier. It
>> peeks randomly packets, and samples them using the psample netlink
>> channel. The user can specify the psample group, which the packet will be
>> sampled to, the sampling rate and the packet truncation (to save
>> kernel-user traffic).
>>
>> The sampled packets contain informative metadata, for example, the input
>> interface and the original packet length.
>>
>> The action syntax:
>> tc filter add [...] \
>>  action sample rate  group  [trunc ]
>>  [...]
>>
>> Where:
>>   RATE := The sampling rate which is the ratio of packets observed at the
>>data source to the samples generated
>>   GROUP := the psample module sampling group
>>   SIZE := optional truncation size
>>
>> An example for a common usecase of the sample tc action: to sample ingress
>> traffic from interface eth1, one may use the commands:
>>
>> tc qdisc add dev eth1 handle : ingress
>>
>> tc filter add dev eth1 parent : \
>>matchall action sample rate 12 group 4
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling randomly with an average of one sampled packet per 12 packets
>> on dev eth1 to psample group 4.
>
>The group argument seems to be mandatory from looking at the code, but
>what if just wanted to have a port mirroring between, say sw0p1 and
>sw0p2 with the sample rate specified instead (without using the psample
>netlink channel at all)? Could we make this group an optional argument
>instead?

The kernel action currently don't support it, and I am not sure it should.

If I understand you correctly, you want to make the sample action identical
to mirred-mirror, only with random behavior. This can be done using the 
matchall and mirred action, plus the 'random' gact keyword. 

The sample action attaches some metadata in addition to the original packet
data, and that cannot be achieved by mirroring the packets, thus making it
unusable for our usecase. In the former version we attached the metadata
using the IFE protocol, but we decided to use a dedicated netlink channel 
instead.

>
>Thanks!
>--
>Florian


Re: [PATCH v3 net] bpf: add bpf_sk_netns_id() helper

2017-02-05 Thread Andy Lutomirski
On Sun, Feb 5, 2017 at 11:24 AM, David Ahern  wrote:
> On 2/3/17 8:34 PM, Alexei Starovoitov wrote:
>> Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
>> unique value that identifies netns of given socket or dev_net(skb->dev)
>> The upper 32-bits of the return value contain device id where namespace
>> filesystem resides and lower 32-bits contain inode number within that 
>> filesystem.
>> It's the same as
>>  struct stat st;
>>  stat("/proc/pid/ns/net", );
>>  return (st->st_dev << 32)  | st->st_ino;
>>
> ...
>
>> can be considered a new feature, whereas for cgroup_sock
>> programs that modify sk->bound_dev_if (like 'ip vrf' does)
>> it's a bug fix, since 'ip vrf' needs to be netns aware.
>>
>
>
> LGTM.
>
> Reviewed-by: David Ahern 
> Tested-by: David Ahern 
>
> Updated patches for vrf in network namespaces are here:
> https://github.com/dsahern/iproute2 vrf/ip-vrf
>
>
> root@kenny-jessie2:~# ip vrf exec red bash
>
> root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
> PING 10.100.1.254 (10.100.1.254) 56(84) bytes of data.
> 64 bytes from 10.100.1.254: icmp_seq=1 ttl=64 time=0.230 ms
>
> --- 10.100.1.254 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.230/0.230/0.230/0.000 ms
>
> root@kenny-jessie2:red:~# unshare -n
>
> root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
> connect: Network is unreachable
>
>
> Andy: thank you for the feedback on the 'ip vrf' use case. I believe this 
> kernel patch + my iproute2 patches address the issues mentioned so far. 
> Specifically, the transcript above shows your concern about 'unshare -n' case 
> is handled. In one of the many responses last night, you mentioned I have 'a 
> somewhat kludgey fix that gets the "ip netns" case'. If you can elaborate on 
> 'somewhat kludgey', I can fix those this week as well.

What I meant was: fixing it in iproute2 without kernel changes would
be kludgey and incomplete.  You seem to have fixed it by depending on
this kernel patch.

--Andy


Re: [PATCH v3 net] bpf: add bpf_sk_netns_id() helper

2017-02-05 Thread David Ahern
On 2/3/17 8:34 PM, Alexei Starovoitov wrote:
> Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
> unique value that identifies netns of given socket or dev_net(skb->dev)
> The upper 32-bits of the return value contain device id where namespace
> filesystem resides and lower 32-bits contain inode number within that 
> filesystem.
> It's the same as
>  struct stat st;
>  stat("/proc/pid/ns/net", );
>  return (st->st_dev << 32)  | st->st_ino;
> 
...

> can be considered a new feature, whereas for cgroup_sock
> programs that modify sk->bound_dev_if (like 'ip vrf' does)
> it's a bug fix, since 'ip vrf' needs to be netns aware.
> 


LGTM.

Reviewed-by: David Ahern 
Tested-by: David Ahern 

Updated patches for vrf in network namespaces are here:
https://github.com/dsahern/iproute2 vrf/ip-vrf


root@kenny-jessie2:~# ip vrf exec red bash

root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
PING 10.100.1.254 (10.100.1.254) 56(84) bytes of data.
64 bytes from 10.100.1.254: icmp_seq=1 ttl=64 time=0.230 ms

--- 10.100.1.254 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.230/0.230/0.230/0.000 ms

root@kenny-jessie2:red:~# unshare -n

root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
connect: Network is unreachable


Andy: thank you for the feedback on the 'ip vrf' use case. I believe this 
kernel patch + my iproute2 patches address the issues mentioned so far. 
Specifically, the transcript above shows your concern about 'unshare -n' case 
is handled. In one of the many responses last night, you mentioned I have 'a 
somewhat kludgey fix that gets the "ip netns" case'. If you can elaborate on 
'somewhat kludgey', I can fix those this week as well.


[PATCH 3.10 257/319] net: sctp, forbid negative length

2017-02-05 Thread Willy Tarreau
From: Jiri Slaby 

commit a4b8e71b05c27bae6bad3bdecddbc6b68a3ad8cf upstream.

Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
if (len < sizeof(int))
return -EINVAL;

On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.

Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.

Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.

If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  88006d99f2a8 b2f7bdea 41b58ab3
 b4363c14 b2f7bcde 88006d99f2d0 88006d99f270
   0034 b5096422
Call Trace:
 [] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
 [] ? kmalloc_order+0x24/0x90
 [] ? kmalloc_order_trace+0x24/0x220
 [] ? __kmalloc+0x330/0x540
 [] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
 [] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
 [] ? sock_common_getsockopt+0xb9/0x150
 [] ? SyS_getsockopt+0x1a5/0x270

Signed-off-by: Jiri Slaby 
Cc: Vlad Yasevich 
Cc: Neil Horman 
Cc: "David S. Miller" 
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman 
Signed-off-by: David S. Miller 
Signed-off-by: Willy Tarreau 
---
 net/sctp/socket.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index bdc3fb6..86e7352 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4259,7 +4259,7 @@ static int sctp_getsockopt_disable_fragments(struct sock 
*sk, int len,
 static int sctp_getsockopt_events(struct sock *sk, int len, char __user 
*optval,
  int __user *optlen)
 {
-   if (len <= 0)
+   if (len == 0)
return -EINVAL;
if (len > sizeof(struct sctp_event_subscribe))
len = sizeof(struct sctp_event_subscribe);
@@ -5770,6 +5770,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int 
level, int optname,
if (get_user(len, optlen))
return -EFAULT;
 
+   if (len < 0)
+   return -EINVAL;
+
sctp_lock_sock(sk);
 
switch (optname) {
-- 
2.8.0.rc2.1.gbe9624a



Re: [PATCH iproute2/net-next 1/3] tc: Add support for the sample tc action

2017-02-05 Thread Florian Fainelli
On 02/04/2017 11:58 PM, Yotam Gigi wrote:
> The sample tc action allows sampling packets matching a classifier. It
> peeks randomly packets, and samples them using the psample netlink
> channel. The user can specify the psample group, which the packet will be
> sampled to, the sampling rate and the packet truncation (to save
> kernel-user traffic).
> 
> The sampled packets contain informative metadata, for example, the input
> interface and the original packet length.
> 
> The action syntax:
> tc filter add [...] \
>   action sample rate  group  [trunc ]
>   [...]
> 
> Where:
>   RATE := The sampling rate which is the ratio of packets observed at the
> data source to the samples generated
>   GROUP := the psample module sampling group
>   SIZE := optional truncation size
> 
> An example for a common usecase of the sample tc action: to sample ingress
> traffic from interface eth1, one may use the commands:
> 
> tc qdisc add dev eth1 handle : ingress
> 
> tc filter add dev eth1 parent : \
>matchall action sample rate 12 group 4
> 
> Where the first command adds an ingress qdisc and the second starts
> sampling randomly with an average of one sampled packet per 12 packets
> on dev eth1 to psample group 4.

The group argument seems to be mandatory from looking at the code, but
what if just wanted to have a port mirroring between, say sw0p1 and
sw0p2 with the sample rate specified instead (without using the psample
netlink channel at all)? Could we make this group an optional argument
instead?

Thanks!
-- 
Florian


Re: [PATCH net-next] net: dsa: mv88e6xxx: Add watchdog interrupt handler

2017-02-05 Thread Florian Fainelli


On 02/05/2017 08:52 AM, Andrew Lunn wrote:
>>> +static irqreturn_t mv88e6xxx_g2_watchdog_thread_fn(int irq, void *dev_id)
>>> +{
>>> +   u16 reg;
>>> +
>>> +   struct mv88e6xxx_chip *chip = dev_id;
>>> +
>>> +   mv88e6xxx_g2_read(chip, GLOBAL2_WDOG_CONTROL, );
>>> +
>>> +   dev_info(chip->dev, "Watchdog event: %04x", reg);
>>
>> Should this be 0x%04x just to illustrate the value is hexadecimal? And
> 
> Yes, the prefix would make sense.
> 
>> should this be dev_info_once()?
> 
> Since the next action is to disable the interrupt, and there is no
> code path to reenable it, _once does not seem needed. And if for some
> reason the kernel log is spammed with these messages, i want to
> know. It means my interrupt handling code is badly broken!

Sure, that makes sense then, thanks!
-- 
Florian


[PATCH] net: intel: igb: use new api ethtool_{get|set}_link_ksettings

2017-02-05 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c |  108 ++
 1 files changed, 59 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 737b664..3e6281a 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -144,7 +144,8 @@ enum igb_diagnostics_results {
 };
 #define IGB_TEST_LEN (sizeof(igb_gstrings_test) / ETH_GSTRING_LEN)
 
-static int igb_get_settings(struct net_device *netdev, struct ethtool_cmd 
*ecmd)
+static int igb_get_link_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *cmd)
 {
struct igb_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
@@ -152,11 +153,12 @@ static int igb_get_settings(struct net_device *netdev, 
struct ethtool_cmd *ecmd)
struct e1000_sfp_flags *eth_flags = _spec->eth_flags;
u32 status;
u32 speed;
+   u32 supported, advertising;
 
status = rd32(E1000_STATUS);
if (hw->phy.media_type == e1000_media_type_copper) {
 
-   ecmd->supported = (SUPPORTED_10baseT_Half |
+   supported = (SUPPORTED_10baseT_Half |
   SUPPORTED_10baseT_Full |
   SUPPORTED_100baseT_Half |
   SUPPORTED_100baseT_Full |
@@ -164,63 +166,61 @@ static int igb_get_settings(struct net_device *netdev, 
struct ethtool_cmd *ecmd)
   SUPPORTED_Autoneg |
   SUPPORTED_TP |
   SUPPORTED_Pause);
-   ecmd->advertising = ADVERTISED_TP;
+   advertising = ADVERTISED_TP;
 
if (hw->mac.autoneg == 1) {
-   ecmd->advertising |= ADVERTISED_Autoneg;
+   advertising |= ADVERTISED_Autoneg;
/* the e1000 autoneg seems to match ethtool nicely */
-   ecmd->advertising |= hw->phy.autoneg_advertised;
+   advertising |= hw->phy.autoneg_advertised;
}
 
-   ecmd->port = PORT_TP;
-   ecmd->phy_address = hw->phy.addr;
-   ecmd->transceiver = XCVR_INTERNAL;
+   cmd->base.port = PORT_TP;
+   cmd->base.phy_address = hw->phy.addr;
} else {
-   ecmd->supported = (SUPPORTED_FIBRE |
+   supported = (SUPPORTED_FIBRE |
   SUPPORTED_1000baseKX_Full |
   SUPPORTED_Autoneg |
   SUPPORTED_Pause);
-   ecmd->advertising = (ADVERTISED_FIBRE |
+   advertising = (ADVERTISED_FIBRE |
 ADVERTISED_1000baseKX_Full);
if (hw->mac.type == e1000_i354) {
if ((hw->device_id ==
 E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) &&
!(status & E1000_STATUS_2P5_SKU_OVER)) {
-   ecmd->supported |= SUPPORTED_2500baseX_Full;
-   ecmd->supported &=
+   supported |= SUPPORTED_2500baseX_Full;
+   supported &=
~SUPPORTED_1000baseKX_Full;
-   ecmd->advertising |= ADVERTISED_2500baseX_Full;
-   ecmd->advertising &=
+   advertising |= ADVERTISED_2500baseX_Full;
+   advertising &=
~ADVERTISED_1000baseKX_Full;
}
}
if (eth_flags->e100_base_fx) {
-   ecmd->supported |= SUPPORTED_100baseT_Full;
-   ecmd->advertising |= ADVERTISED_100baseT_Full;
+   supported |= SUPPORTED_100baseT_Full;
+   advertising |= ADVERTISED_100baseT_Full;
}
if (hw->mac.autoneg == 1)
-   ecmd->advertising |= ADVERTISED_Autoneg;
+   advertising |= ADVERTISED_Autoneg;
 
-   ecmd->port = PORT_FIBRE;
-   ecmd->transceiver = XCVR_EXTERNAL;
+   cmd->base.port = PORT_FIBRE;
}
if (hw->mac.autoneg != 1)
-   ecmd->advertising &= ~(ADVERTISED_Pause |
+   advertising &= ~(ADVERTISED_Pause |
   ADVERTISED_Asym_Pause);
 
switch (hw->fc.requested_mode) {
case e1000_fc_full:
-   

[PATCH net] udp: properly cope with csum errors

2017-02-05 Thread Eric Dumazet
From: Eric Dumazet 

Dmitry reported that UDP sockets being destroyed would trigger the
WARN_ON(atomic_read(>sk_rmem_alloc)); in inet_sock_destruct()

It turns out we do not properly destroy skb(s) that have wrong UDP
checksum.

Thanks again to syzkaller team.

Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
Reported-by: Dmitry Vyukov 
Signed-off-by: Eric Dumazet 
Cc: Paolo Abeni 
Cc: Hannes Frederic Sowa 
---
 include/net/sock.h  |4 +++-
 net/core/datagram.c |8 ++--
 net/ipv4/udp.c  |2 +-
 net/ipv6/udp.c  |2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index f0e867f58722..c4f5e6fca17c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2006,7 +2006,9 @@ void sk_reset_timer(struct sock *sk, struct timer_list 
*timer,
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
 int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-   unsigned int flags);
+   unsigned int flags,
+   void (*destructor)(struct sock *sk,
+  struct sk_buff *skb));
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 662bea587165..ea633342ab0d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -332,7 +332,9 @@ void __skb_free_datagram_locked(struct sock *sk, struct 
sk_buff *skb, int len)
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
 int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-   unsigned int flags)
+   unsigned int flags,
+   void (*destructor)(struct sock *sk,
+  struct sk_buff *skb))
 {
int err = 0;
 
@@ -342,6 +344,8 @@ int __sk_queue_drop_skb(struct sock *sk, struct sk_buff 
*skb,
if (skb == skb_peek(>sk_receive_queue)) {
__skb_unlink(skb, >sk_receive_queue);
atomic_dec(>users);
+   if (destructor)
+   destructor(sk, skb);
err = 0;
}
spin_unlock_bh(>sk_receive_queue.lock);
@@ -375,7 +379,7 @@ EXPORT_SYMBOL(__sk_queue_drop_skb);
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-   int err = __sk_queue_drop_skb(sk, skb, flags);
+   int err = __sk_queue_drop_skb(sk, skb, flags, NULL);
 
kfree_skb(skb);
sk_mem_reclaim_partial(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1307a7c2e544..8aab7d78d25b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1501,7 +1501,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
return err;
 
 csum_copy_err:
-   if (!__sk_queue_drop_skb(sk, skb, flags)) {
+   if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4d5c4eee4b3f..8990856f5101 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -441,7 +441,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
return err;
 
 csum_copy_err:
-   if (!__sk_queue_drop_skb(sk, skb, flags)) {
+   if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
if (is_udp4) {
UDP_INC_STATS(sock_net(sk),
  UDP_MIB_CSUMERRORS, is_udplite);




Re: [PATCH net-next] net: dsa: mv88e6xxx: Add watchdog interrupt handler

2017-02-05 Thread Andrew Lunn
> > +static irqreturn_t mv88e6xxx_g2_watchdog_thread_fn(int irq, void *dev_id)
> > +{
> > +   u16 reg;
> > +
> > +   struct mv88e6xxx_chip *chip = dev_id;
> > +
> > +   mv88e6xxx_g2_read(chip, GLOBAL2_WDOG_CONTROL, );
> > +
> > +   dev_info(chip->dev, "Watchdog event: %04x", reg);
> 
> Should this be 0x%04x just to illustrate the value is hexadecimal? And

Yes, the prefix would make sense.

> should this be dev_info_once()?

Since the next action is to disable the interrupt, and there is no
code path to reenable it, _once does not seem needed. And if for some
reason the kernel log is spammed with these messages, i want to
know. It means my interrupt handling code is badly broken!

  Andrew


ATTT

2017-02-05 Thread Reserve Bank Of India



--
http://www.emailmeform.com/builder/form/f075qleMkAL1H311Gef
click the weblink above


[PATCH v2 2/2] mac80211: aes-cmac: switch to shash CMAC driver

2017-02-05 Thread Ard Biesheuvel
Instead of open coding the CMAC algorithm in the mac80211 driver using
byte wide xors and calls into the crypto layer for each block of data,
instantiate a cmac(aes) synchronous hash and pass all the data into it
directly. This does not only simplify the code, it also allows the use
of more efficient and more secure implementations, especially on
platforms where SIMD ciphers have considerable setup time.

Signed-off-by: Ard Biesheuvel 
---
 net/mac80211/aes_cmac.c | 130 +---
 net/mac80211/aes_cmac.h |  11 +-
 net/mac80211/key.h  |   2 +-
 3 files changed, 35 insertions(+), 108 deletions(-)

diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index d0bd5fff5f0a..0d4d2af52a56 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -22,126 +22,52 @@
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
 #define AAD_LEN 20
 
-
-void gf_mulx(u8 *pad)
-{
-   int i, carry;
-
-   carry = pad[0] & 0x80;
-   for (i = 0; i < AES_BLOCK_SIZE - 1; i++)
-   pad[i] = (pad[i] << 1) | (pad[i + 1] >> 7);
-   pad[AES_BLOCK_SIZE - 1] <<= 1;
-   if (carry)
-   pad[AES_BLOCK_SIZE - 1] ^= 0x87;
-}
-
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len)
-{
-   u8 cbc[AES_BLOCK_SIZE], pad[AES_BLOCK_SIZE];
-   const u8 *pos, *end;
-   size_t i, e, left, total_len;
-
-   memset(cbc, 0, AES_BLOCK_SIZE);
-
-   total_len = 0;
-   for (e = 0; e < num_elem; e++)
-   total_len += len[e];
-   left = total_len;
-
-   e = 0;
-   pos = addr[0];
-   end = pos + len[0];
-
-   while (left >= AES_BLOCK_SIZE) {
-   for (i = 0; i < AES_BLOCK_SIZE; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   if (left > AES_BLOCK_SIZE)
-   crypto_cipher_encrypt_one(tfm, cbc, cbc);
-   left -= AES_BLOCK_SIZE;
-   }
-
-   memset(pad, 0, AES_BLOCK_SIZE);
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   gf_mulx(pad);
-
-   if (left || total_len == 0) {
-   for (i = 0; i < left; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   cbc[left] ^= 0x80;
-   gf_mulx(pad);
-   }
-
-   for (i = 0; i < AES_BLOCK_SIZE; i++)
-   pad[i] ^= cbc[i];
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   memcpy(mac, pad, mac_len);
-}
-
-
-void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
+   u8 out[crypto_shash_digestsize(tfm)];
 
-   memset(zero, 0, CMAC_TLEN);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN;
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN);
+   crypto_shash_finup(desc, (u8[CMAC_TLEN]){}, CMAC_TLEN, out);
+
+   memcpy(mic, out, CMAC_TLEN);
 }
 
-void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac_256(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN_256];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
 
-   memset(zero, 0, CMAC_TLEN_256);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN_256;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN_256;
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN_256);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN_256);
+   crypto_shash_finup(desc, (u8[CMAC_TLEN_256]){}, CMAC_TLEN_256, mic);
 }
 

[PATCH v2 0/2] mac80211: use crypto shash for AES cmac

2017-02-05 Thread Ard Biesheuvel
This is something I spotted while working on AES in various modes for
ARM and arm64.

The mac80211 aes_cmac code reimplements the CMAC algorithm based on the
core AES cipher, which is rather restrictive in how platforms can satisfy
the dependency on this algorithm. For instance, SIMD implementations may
have a considerable setup time, which cannot be amortized over the entire
input when calling into the crypto API one block at a time. Also, it prevents
the use of more secure fixed time implementations, since not all AES drivers
expose the cipher interface.

So switch aes_cmac to use a cmac(aes) shash. Before updating the aes_cmac code
in patch #2, the FILS AEAD code is moved to using a cmac(aes) shash supplied by
the crypto API so that we can remove the open coded version entirely in the
second patch.

NOTE: Jouni has been so kind to test patch #2, and confirmed that it is working.
  I have not tested patch #1 myself, mainly because the test methodology
  requires downloading Ubuntu installer images, and I am currently on a
  metered 3G connection (and will be for another couple of weeks)

Ard Biesheuvel (2):
  mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher
  mac80211: aes-cmac: switch to shash CMAC driver

 net/mac80211/Kconfig |   1 +
 net/mac80211/aes_cmac.c  | 130 +---
 net/mac80211/aes_cmac.h  |  15 +--
 net/mac80211/fils_aead.c |  74 +--
 net/mac80211/key.h   |   2 +-
 5 files changed, 70 insertions(+), 152 deletions(-)

-- 
2.7.4



[PATCH v2 1/2] mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher

2017-02-05 Thread Ard Biesheuvel
Switch the FILS AEAD code to use a cmac(aes) shash instantiated by the
crypto API rather than reusing the open coded implementation in
aes_cmac_vector(). This makes the code more understandable, and allows
platforms to implement cmac(aes) in a more secure (*) and efficient way
than is typically possible when using the AES cipher directly.

So replace the crypto_cipher by a crypto_shash, and update the aes_s2v()
routine to call the shash interface directly.

* In particular, the generic table based AES implementation is sensitive
  to known-plaintext timing attacks on the key, to which AES based MAC
  algorithms are especially vulnerable, given that their plaintext is not
  usually secret. Time invariant alternatives are available (e.g., based
  on SIMD algorithms), but may incur a setup cost that is prohibitive when
  operating on a single block at a time, which is why they don't usually
  expose the cipher API.

Signed-off-by: Ard Biesheuvel 
---
 net/mac80211/Kconfig |  1 +
 net/mac80211/aes_cmac.h  |  4 --
 net/mac80211/fils_aead.c | 74 +---
 3 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/net/mac80211/Kconfig b/net/mac80211/Kconfig
index 3891cbd2adea..76e30f4797fb 100644
--- a/net/mac80211/Kconfig
+++ b/net/mac80211/Kconfig
@@ -6,6 +6,7 @@ config MAC80211
select CRYPTO_AES
select CRYPTO_CCM
select CRYPTO_GCM
+   select CRYPTO_CMAC
select CRC32
---help---
  This option enables the hardware independent IEEE 802.11
diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index c827e1d5de8b..3702041f44fd 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,10 +11,6 @@
 
 #include 
 
-void gf_mulx(u8 *pad);
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len);
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index ecfdd97758a3..a294a57e856d 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -9,66 +9,60 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include "ieee80211_i.h"
 #include "aes_cmac.h"
 #include "fils_aead.h"
 
-static int aes_s2v(struct crypto_cipher *tfm,
+static void gf_mulx(u8 *pad)
+{
+   u64 a = get_unaligned_be64(pad);
+   u64 b = get_unaligned_be64(pad + 8);
+
+   put_unaligned_be64((a << 1) | (b >> 63), pad);
+   put_unaligned_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0), pad + 8);
+}
+
+static int aes_s2v(struct crypto_shash *tfm,
   size_t num_elem, const u8 *addr[], size_t len[], u8 *v)
 {
u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
size_t i;
-   const u8 *data[2];
-   size_t data_len[2], data_elems;
+
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
/* D = AES-CMAC(K, ) */
-   memset(tmp, 0, AES_BLOCK_SIZE);
-   data[0] = tmp;
-   data_len[0] = AES_BLOCK_SIZE;
-   aes_cmac_vector(tfm, 1, data, data_len, d, AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, (u8[AES_BLOCK_SIZE]){}, AES_BLOCK_SIZE, d);
 
for (i = 0; i < num_elem - 1; i++) {
/* D = dbl(D) xor AES_CMAC(K, Si) */
gf_mulx(d); /* dbl */
-   aes_cmac_vector(tfm, 1, [i], [i], tmp,
-   AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, addr[i], len[i], tmp);
crypto_xor(d, tmp, AES_BLOCK_SIZE);
}
 
+   crypto_shash_init(desc);
+
if (len[i] >= AES_BLOCK_SIZE) {
/* len(Sn) >= 128 */
-   size_t j;
-   const u8 *pos;
-
/* T = Sn xorend D */
-
-   /* Use a temporary buffer to perform xorend on Sn (addr[i]) to
-* avoid modifying the const input argument.
-*/
-   data[0] = addr[i];
-   data_len[0] = len[i] - AES_BLOCK_SIZE;
-   pos = addr[i] + data_len[0];
-   for (j = 0; j < AES_BLOCK_SIZE; j++)
-   tmp[j] = pos[j] ^ d[j];
-   data[1] = tmp;
-   data_len[1] = AES_BLOCK_SIZE;
-   data_elems = 2;
+   crypto_shash_update(desc, addr[i], len[i] - AES_BLOCK_SIZE);
+   crypto_xor(d, addr[i] + len[i] - AES_BLOCK_SIZE,
+  AES_BLOCK_SIZE);
} else {
/* len(Sn) < 128 */
/* T = dbl(D) xor pad(Sn) */
gf_mulx(d); /* dbl */
-   memset(tmp, 0, AES_BLOCK_SIZE);
-   memcpy(tmp, addr[i], len[i]);
- 

Re: [PATCH v2 1/2] net: phy: dp83867: Port mirroring support in the DP83867 TI's PHY driver

2017-02-05 Thread Lukasz Majewski
Hi Andrew,

> On Sat, Feb 04, 2017 at 05:02:11PM +0100, Lukasz Majewski wrote:
> > This patch adds support for enabling or disabling the port
> > mirroring (at CFG4 register) feature of the DP83867 TI's PHY device.
> 
> As we discussed before, "port mirroring" is bad naming. Yes, we should
> use it, because that is what the datasheet calls this feature.

That was my goal - to use naming from datasheet.

> But the
> commit message should also contain a description of what this means,
> and reference that the linux name for this concept is lane swapping.

Ok. No problem with that.

> 
> > +enum {
> 
> Maybe give the 0 value a name. DP83867_PORT_MIRROING_KEEP?

I can add this - no problem.

> 
> > +   DP83867_PORT_MIRROING_EN = 1,
> > +   DP83867_PORT_MIRROING_DIS,
> > +};
> > +
> 
> That extra enum value can then make this more obvious:
>   
> if (dp83867->port_mirroring != DP83867_PORT_MIRROING_KEEP)
>   dp83867_config_port_mirroring(phydev);
> 
> On the first reading of the patch, i though you were setting mirroring
> on/off under all conditions, but in fact you don't. This makes it
> clearer.

Ok. I see your point.

> 
>   Thanks
>   Andrew

Thanks for review :-)


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: w...@denx.de


Re: [net-next 2/8] net/mlx5: Configure cache line size for start and end padding

2017-02-05 Thread Saeed Mahameed
On Thu, Feb 2, 2017 at 4:47 PM, Daniel Jurgens  wrote:
> On 2/1/2017 5:12 AM, David Laight wrote:
>> From: Saeed Mahameed
>>> Sent: 31 January 2017 20:59
>>> From: Daniel Jurgens 
>>>
>>> There is a hardware feature that will pad the start or end of a DMA to
>>> be cache line aligned to avoid RMWs on the last cache line. The default
>>> cache line size setting for this feature is 64B. This change configures
>>> the hardware to use 128B alignment on systems with 128B cache lines.
>> What guarantees that the extra bytes are actually inside the receive skb's
>> head and tail room?
>>
>>   David
>>
>>
> The hardware won't over write the length of the posted buffer.  This feature 
> is already enabled and defaults to 64B stride, this patch just configures it 
> properly for 128B cache line sizes.
>
Right, and next patch will make sure RX stride is aligned to 128B in
case 128B cacheline size configured into the HW.

> Thanks for reviewing it.
>
> Dan
>


Re: [net-next 5/8] net/mlx5e: Calc vlan_tag_present only once on xmit

2017-02-05 Thread Saeed Mahameed
On Wed, Feb 1, 2017 at 1:20 PM, David Laight  wrote:
> From: Saeed Mahameed
>> Sent: 31 January 2017 20:59
>> Cache skb_vlan_tag_present(skb) and pass it wherever needed in xmit
>> routines.
> ...
>
> Does this actually generate better code?

Only in case skb pointer is kept in memory (we will save up to 3
skb->vlan_tci dereferences in that case).

> It is quite likely that your 'vlan_present' variable ends up being on stack.
> Whereas the 'skb' is likely to be in a register.

can i assume this to be likely true on all archs ?

> In which case the two loads are likely to be must the same and your
> change has added a write to the stack.
>
> David
>
>


Re: [PATCH] [net-next] net/mlx5e: fix another maybe-uninitialized false-positive

2017-02-05 Thread Or Gerlitz
On Fri, Feb 3, 2017 at 6:37 PM, Arnd Bergmann  wrote:
> In commit abeffce ("net/mlx5e: Fix a -Wmaybe-uninitialized warning"), I fixed 
> a
> gcc warning for the ipv4 offload handling. Now we get the same warning for the
> added ipv6 support:
>
> drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:815:40: warning: 'out_dev' 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>
> We can apply the same workaround here as well.
>
> Fixes: ce99f6b97fcd ("net/mlx5e: Support SRIOV TC encapsulation offloads for 
> IPv6 tunnels")
> Signed-off-by: Arnd Bergmann 

frustrating... I don't see the warning with gcc 5.3.1 [1], but still,
the patch is OKay

Acked-by: Or Gerlitz 

[1] I used #make KCFLAGS='-Wmaybe-uninitialized'
M=drivers/net/ethernet/mellanox/mlx5/core   -j something


Re: [PATCHv3 net-next 11/12] net: mvpp2: switch to build_skb() in the RX path

2017-02-05 Thread Marcin Wojtas
Hi Thomas,

How about switching to napi_alloc_frag() in mvpp2_rx_refill(), which
is called in hotpath? In easy way, it may give some performance gain.

Best regards,
Marcin

2017-02-02 16:51 GMT+01:00 Thomas Petazzoni
:
> This commit adapts the mvpp2 RX path to use the build_skb() method. Not
> only build_skb() is now the recommended mechanism, but it also
> simplifies the addition of support for the PPv2.2 variant.
>
> Indeed, without build_skb(), we have to keep track for each RX
> descriptor of the physical address of the packet buffer, and the virtual
> address of the SKB. However, in PPv2.2 running on 64 bits platform,
> there is not enough space in the descriptor to store the virtual address
> of the SKB. So having to take care only of the address of the packet
> buffer, and building the SKB upon reception helps in supporting PPv2.2.
>
> The implementation is fairly straightforward:
>
>  - mvpp2_skb_alloc() is renamed to mvpp2_buf_alloc() and no longer
>allocates a SKB. Instead, it allocates a buffer using the new
>mvpp2_frag_alloc() function, with enough space for the data and SKB.
>
>  - The initialization of the RX buffers in mvpp2_bm_bufs_add() as well
>as the refill of the RX buffers in mvpp2_rx_refill() is adjusted
>accordingly.
>
>  - Finally, the mvpp2_rx() is modified to use build_skb().
>
> Signed-off-by: Thomas Petazzoni 
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 77 
> +---
>  1 file changed, 55 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
> b/drivers/net/ethernet/marvell/mvpp2.c
> index ec8f452..4132dc8 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -918,6 +918,7 @@ struct mvpp2_bm_pool {
> int buf_size;
> /* Packet size */
> int pkt_size;
> +   int frag_size;
>
> /* BPPE virtual base address */
> u32 *virt_addr;
> @@ -3354,6 +3355,22 @@ static void mvpp2_cls_oversize_rxq_set(struct 
> mvpp2_port *port)
> mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val);
>  }
>
> +static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool)
> +{
> +   if (likely(pool->frag_size <= PAGE_SIZE))
> +   return netdev_alloc_frag(pool->frag_size);
> +   else
> +   return kmalloc(pool->frag_size, GFP_ATOMIC);
> +}
> +
> +static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, void *data)
> +{
> +   if (likely(pool->frag_size <= PAGE_SIZE))
> +   skb_free_frag(data);
> +   else
> +   kfree(data);
> +}
> +
>  /* Buffer Manager configuration routines */
>
>  /* Create pool */
> @@ -3428,7 +3445,8 @@ static void mvpp2_bm_bufs_free(struct device *dev, 
> struct mvpp2 *priv,
>
> if (!vaddr)
> break;
> -   dev_kfree_skb_any((struct sk_buff *)vaddr);
> +
> +   mvpp2_frag_free(bm_pool, (void *)vaddr);
> }
>
> /* Update BM driver with number of buffers removed from pool */
> @@ -3542,29 +3560,28 @@ static void mvpp2_rxq_short_pool_set(struct 
> mvpp2_port *port,
> mvpp2_write(port->priv, MVPP2_RXQ_CONFIG_REG(prxq), val);
>  }
>
> -/* Allocate skb for BM pool */
> -static struct sk_buff *mvpp2_skb_alloc(struct mvpp2_port *port,
> -  struct mvpp2_bm_pool *bm_pool,
> -  dma_addr_t *buf_phys_addr,
> -  gfp_t gfp_mask)
> +static void *mvpp2_buf_alloc(struct mvpp2_port *port,
> +struct mvpp2_bm_pool *bm_pool,
> +dma_addr_t *buf_phys_addr,
> +gfp_t gfp_mask)
>  {
> -   struct sk_buff *skb;
> dma_addr_t phys_addr;
> +   void *data;
>
> -   skb = __dev_alloc_skb(bm_pool->pkt_size, gfp_mask);
> -   if (!skb)
> +   data = mvpp2_frag_alloc(bm_pool);
> +   if (!data)
> return NULL;
>
> -   phys_addr = dma_map_single(port->dev->dev.parent, skb->head,
> +   phys_addr = dma_map_single(port->dev->dev.parent, data,
>MVPP2_RX_BUF_SIZE(bm_pool->pkt_size),
> DMA_FROM_DEVICE);
> if (unlikely(dma_mapping_error(port->dev->dev.parent, phys_addr))) {
> -   dev_kfree_skb_any(skb);
> +   mvpp2_frag_free(bm_pool, data);
> return NULL;
> }
> *buf_phys_addr = phys_addr;
>
> -   return skb;
> +   return data;
>  }
>
>  /* Set pool number in a BM cookie */
> @@ -3620,9 +3637,9 @@ static void mvpp2_pool_refill(struct mvpp2_port *port, 
> u32 bm,
>  static int mvpp2_bm_bufs_add(struct mvpp2_port *port,
>  struct mvpp2_bm_pool *bm_pool, int buf_num)
>  {
> -   struct sk_buff *skb;
> int i, 

Re: [patch net-next] net/mlx4_en: fix a condition

2017-02-05 Thread Tariq Toukan


On 03/02/2017 11:54 AM, Dan Carpenter wrote:

There is a "||" vs "|" typo here so we test 0x1 instead of 0x6.

Fixes: 1f8176f7352a ("net/mlx4_en: Check the enabling pptx/pprx flags in SET_PORT 
wrapper flow")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c 
b/drivers/net/ethernet/mellanox/mlx4/port.c
index 5053c949148f..4e36e287d605 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -1395,7 +1395,7 @@ static int mlx4_common_set_port(struct mlx4_dev *dev, int 
slave, u32 in_mod,
  gen_context);
  
  			if (gen_context->flags &

-   (MLX4_FLAG_V_PPRX_MASK || MLX4_FLAG_V_PPTX_MASK))
+   (MLX4_FLAG_V_PPRX_MASK | MLX4_FLAG_V_PPTX_MASK))
mlx4_en_set_port_global_pause(dev, slave,
  gen_context);
  
--

To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reviewed-by: Tariq Toukan 
Thanks Dan.