[PATCH net] net: pktgen: fix pkt_size

2016-09-29 Thread Paolo Abeni
The commit 879c7220e828 ("net: pktgen: Observe needed_headroom
of the device") increased the 'pkt_overhead' field value by
LL_RESERVED_SPACE.
As a side effect the generated packet size, computed as:

/* Eth + IPh + UDPh + mpls */
datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
  pkt_dev->pkt_overhead;

is decreased by the same value.
The above changed slightly the behavior of existing pktgen users,
and made the interface somewhat inconsistent.
Fix it by restoring the previous pkt_overhead value and using
LL_RESERVED_SPACE as extralen in skb allocation.
Also, change pktgen_alloc_skb() to only partially reserve
the headroom to allow the caller to prefetch from ll header
start.

Fixes: 879c7220e828 ("net: pktgen: Observe needed_headroom of the device")
Suggested-by: Ben Greear 
Signed-off-by: Paolo Abeni 
---
 net/core/pktgen.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index bbd118b..5c9b397 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2286,7 +2286,7 @@ out:
 
 static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev)
 {
-   pkt_dev->pkt_overhead = LL_RESERVED_SPACE(pkt_dev->odev);
+   pkt_dev->pkt_overhead = 0;
pkt_dev->pkt_overhead += pkt_dev->nr_labels*sizeof(u32);
pkt_dev->pkt_overhead += VLAN_TAG_SIZE(pkt_dev);
pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev);
@@ -2777,13 +2777,13 @@ static void pktgen_finalize_skb(struct pktgen_dev 
*pkt_dev, struct sk_buff *skb,
 }
 
 static struct sk_buff *pktgen_alloc_skb(struct net_device *dev,
-   struct pktgen_dev *pkt_dev,
-   unsigned int extralen)
+   struct pktgen_dev *pkt_dev)
 {
+   unsigned int extralen = LL_RESERVED_SPACE(dev);
struct sk_buff *skb = NULL;
-   unsigned int size = pkt_dev->cur_pkt_size + 64 + extralen +
-   pkt_dev->pkt_overhead;
+   unsigned int size;
 
+   size = pkt_dev->cur_pkt_size + 64 + extralen + pkt_dev->pkt_overhead;
if (pkt_dev->flags & F_NODE) {
int node = pkt_dev->node >= 0 ? pkt_dev->node : numa_node_id();
 
@@ -2796,8 +2796,9 @@ static struct sk_buff *pktgen_alloc_skb(struct net_device 
*dev,
 skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT);
}
 
+   /* the caller prefetch from skb->data and reserve for the mac hdr */
if (likely(skb))
-   skb_reserve(skb, LL_RESERVED_SPACE(dev));
+   skb_reserve(skb, extralen - 16);
 
return skb;
 }
@@ -2830,16 +2831,14 @@ static struct sk_buff *fill_packet_ipv4(struct 
net_device *odev,
mod_cur_headers(pkt_dev);
queue_map = pkt_dev->cur_queue_map;
 
-   datalen = (odev->hard_header_len + 16) & ~0xf;
-
-   skb = pktgen_alloc_skb(odev, pkt_dev, datalen);
+   skb = pktgen_alloc_skb(odev, pkt_dev);
if (!skb) {
sprintf(pkt_dev->result, "No memory");
return NULL;
}
 
prefetchw(skb->data);
-   skb_reserve(skb, datalen);
+   skb_reserve(skb, 16);
 
/*  Reserve for ethernet and IP header  */
eth = (__u8 *) skb_push(skb, 14);
@@ -2959,7 +2958,7 @@ static struct sk_buff *fill_packet_ipv6(struct net_device 
*odev,
mod_cur_headers(pkt_dev);
queue_map = pkt_dev->cur_queue_map;
 
-   skb = pktgen_alloc_skb(odev, pkt_dev, 16);
+   skb = pktgen_alloc_skb(odev, pkt_dev);
if (!skb) {
sprintf(pkt_dev->result, "No memory");
return NULL;
-- 
1.8.3.1



Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Eric Dumazet
On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:

> It addresses  the issue that Rick Jones pointed out was happening with
> XPS. When packets are sent for a flow that has no socket and XPS is
> enabled then each packet uses the XPS queue based on the running CPU.
> Since the thread sending on a flow can be rescheduled on different
> CPUs this is creating ooo packets. In this case the ooo is being
> caused by interaction with XPS.
> 

Nope, your patch does not address the problem properly.

I am not sure I want to spend more time explaining the issue.

Lets talk about this in Tokyo next week.




Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Tom Herbert
On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet  wrote:
> On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> xps_flows maintains a per device flow table that is indexed by the
>> skbuff hash. The table is only consulted when there is no queue saved in
>> a transmit socket for an skbuff.
>>
>> Each entry in the flow table contains a queue index and a queue
>> pointer. The queue pointer is set when a queue is chosen using a
>> flow table entry. This pointer is set to the head pointer in the
>> transmit queue (which is maintained by BQL).
>>
>> The new function get_xps_flows_index that looks up flows in the
>> xps_flows table. The entry returned gives the last queue a matching flow
>> used. The returned queue is compared against the normal XPS queue. If
>> they are different, then we only switch if the tail pointer in the TX
>> queue has advanced past the pointer saved in the entry. In this
>> way OOO should be avoided when XPS wants to use a different queue.
>
> There is something I do not understand with this.
>
> If this OOO avoidance is tied to BQL, it means that packets sitting in a
> qdisc wont be part of the detection.
>
> So packets of flow X could possibly be queued on multiple qdiscs.
>
Hi Eric,

I'm not sure I understand your concern. If packets for flow X can be
queued on multiple qdiscs wouldn't that mean that flow will be subject
to ooo transmission regardless of this patch? In other words here
we're trying to keep packets for the flow in order as they are emitted
from the qdiscs which we assume maintains ordering of packets in a
flow.

Tom

>
>


Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Eric Dumazet
On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet  wrote:
> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> >> xps_flows maintains a per device flow table that is indexed by the
> >> skbuff hash. The table is only consulted when there is no queue saved in
> >> a transmit socket for an skbuff.
> >>
> >> Each entry in the flow table contains a queue index and a queue
> >> pointer. The queue pointer is set when a queue is chosen using a
> >> flow table entry. This pointer is set to the head pointer in the
> >> transmit queue (which is maintained by BQL).
> >>
> >> The new function get_xps_flows_index that looks up flows in the
> >> xps_flows table. The entry returned gives the last queue a matching flow
> >> used. The returned queue is compared against the normal XPS queue. If
> >> they are different, then we only switch if the tail pointer in the TX
> >> queue has advanced past the pointer saved in the entry. In this
> >> way OOO should be avoided when XPS wants to use a different queue.
> >
> > There is something I do not understand with this.
> >
> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
> > qdisc wont be part of the detection.
> >
> > So packets of flow X could possibly be queued on multiple qdiscs.
> >
> Hi Eric,
> 
> I'm not sure I understand your concern. If packets for flow X can be
> queued on multiple qdiscs wouldn't that mean that flow will be subject
> to ooo transmission regardless of this patch? In other words here
> we're trying to keep packets for the flow in order as they are emitted
> from the qdiscs which we assume maintains ordering of packets in a
> flow.

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.

Your patch aims to detect this and steer packets to one TXQ, but not
using a static hash() % num_real_queues (as RSS would have done on a NIC
at RX), but trying to please XPS (as : selecting a queue close to the
CPU producing the packet. VM with one vcpu, and cpu bounded, would
really be happy to not spread packets all over the queues)

Your mechanism relies on counters updated at the time packets are given
to the NIC, not at the time packets are given to the txq (and eventually
sit for a while in the qdisc right before BQL layer)

dev_queue_xmit() is not talking to the device directly...


All I am saying is that the producer/consumer counters you added are not
at the right place.

You tried hard to not change the drivers, by adding something to
existing BQL. But packets can sit in a qdisc for a while...

So you added 2 potential cache lines misses per outgoing packet, but
with no guarantee that OOO are really avoided.





Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Eric Dumazet
On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote:
> On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote:
> > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote:
> > 
> > > +static void udp_rmem_release(struct sock *sk, int partial)
> > > +{
> > > + struct udp_sock *up = udp_sk(sk);
> > > + int fwd, amt;
> > > +
> > > + if (partial && !udp_under_memory_pressure(sk))
> > > + return;
> > > +
> > > + /* we can have concurrent release; if we catch any conflict
> > > +  * we let only one of them do the work
> > > +  */
> > > + if (atomic_dec_if_positive(>can_reclaim) < 0)
> > > + return;
> > > +
> > > + fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc));
> > > + if (fwd < SK_MEM_QUANTUM + partial) {
> > > + atomic_inc(>can_reclaim);
> > > + return;
> > > + }
> > > +
> > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> > > + atomic_sub(amt, >mem_allocated);
> > > + atomic_inc(>can_reclaim);
> > > +
> > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > > + sk->sk_forward_alloc = fwd - amt;
> > > +}
> > 
> > 
> > This is racy... all these atomics make me nervous...
> 
> Ah, perhaps I got it: if we have a concurrent memory scheduling, we
> could end up with a value of mem_allocated below the real need. 
> 
> That mismatch will not drift: at worst we can end up with mem_allocated
> being single SK_MEM_QUANTUM below what is strictly needed.
> 
> A possible alternative could be:
> 
> static void udp_rmem_release(struct sock *sk, int partial)
> {
>   struct udp_sock *up = udp_sk(sk);
>   int fwd, amt, alloc_old, alloc;
> 
>   if (partial && !udp_under_memory_pressure(sk))
>   return;
> 
>   alloc = atomic_read(>mem_allocated);
>   fwd = alloc - atomic_read(>sk_rmem_alloc);
>   if (fwd < SK_MEM_QUANTUM + partial)
>   return;
> 
>   amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
>   alloc_old = atomic_cmpxchg(>mem_allocated, alloc, alloc - amt);
>   /* if a concurrent update is detected, just do nothing; if said update
>* is due to another memory release, that release take care of
>* reclaiming the memory for us, too.
>* Otherwise we will be able to release on later dequeue, since
>* we will eventually stop colliding with the writer when it will
>* consume all the fwd allocated memory
>*/
>   if (alloc_old != alloc)
>   return;
> 
>   __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
>   sk->sk_forward_alloc = fwd - amt;

Can still be done from multiple cpus.

Add some ndelay() or udelay() before to simulate fact that current cpu
could be interrupted by an NMI handler (perf for example)... or hard IRQ
handler...

Then make sure your tests involve 16 concurrent cpus dealing with one
udp socket...

> }
> 
> which is even more lazy in reclaiming but should never underestimate the
> needed forward allocation, and under pressure should eventually free the
> needed memory.


If this code is rarely used, why don't you simply use a real spinlock,
so that we do not have to worry about all this ?

A spinlock  acquisition/release is a _single_ locked operation.
Faster than the 3 atomic you got in last version.
spinlock code (ticket or MCS) avoids starvation.

Then, you can safely update multiple fields in the socket.

And you get nice lockdep support as a bonus.

cmpxchg() is fine when a single field need an exclusion. But there you
have multiple fields to update at once :

sk_memory_allocated_add() and sk_memory_allocated_sub() can work using 
atomic_long_add_return() and atomic_long_sub() because their caller owns
the socket lock and can safely update sk->sk_forward_alloc without
additional locking, but UDP wont have this luxury after your patches.





[NetDev] [ANNOUNCE] Netdev 1.2 program announced

2016-09-29 Thread Hajime Tazaki

Hello,

The program of Linux netdev 1.2 is now online !

http://netdevconf.org/1.2/schedule.html

As we stated, all of registered people is now in the mailing
list, people at netdevconf.org.  If you wish to communicate
with the attendees of netdev 1.2, feel free to exchange
messages.


-- Hajime



Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Eric Dumazet
On Thu, Sep 29, 2016 at 7:34 AM, Paolo Abeni  wrote:
> On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote:
>> On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote:
>>
>> > When we reach __sk_mem_reduce_allocated() we are sure we can free the
>> > specified amount of memory, so we only need to ensure consistent
>> > sk_prot->memory_allocated updates. The current atomic operation suffices
>> > to this.
>>
>> Then why are you updating sk->sk_forward_alloc using racy operations ?
>>
>> If this is not needed or racy, do not do it.
>
> Thank you for all the feedback.
>
> The actual forward allocated memory value is:
>
> atomic_read(>mem_allocated) - atomic_read(>sk_rmem_alloc).
>
> sk_forward_alloc is updated only to hint to the user space the forward
> allocated memory value via the diag interface.
>
> If such information is not needed we can drop the update, and
> sk_forward_alloc will always be seen as 0 even when the socket has some
> forward allocation.

The information is needed and we want an accurate one, really.

Think about debugging on a live server, some stuck or mad sockets ;)

Please consider adding a proper accessor, able to deal with the UDP
peculiarities.


int sk_forward_alloc_get(const struct sock *sk)
{
if (sk is not UDP)
   return sk->sk_forward_alloc;

   return the precise amount using the private fields that UDP maintains,
}


Then use this accessor in these places :

net/ipv4/af_inet.c:155: WARN_ON(sk->sk_forward_alloc);
net/core/sock_diag.c:66:mem[SK_MEMINFO_FWD_ALLOC] =
sk->sk_forward_alloc;
net/ipv4/inet_diag.c:191:   .idiag_fmem =
sk->sk_forward_alloc,
net/sched/em_meta.c:462:dst->value = sk->sk_forward_alloc;

Thanks.


Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Eric Dumazet
On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote:

> When we reach __sk_mem_reduce_allocated() we are sure we can free the
> specified amount of memory, so we only need to ensure consistent
> sk_prot->memory_allocated updates. The current atomic operation suffices
> to this.

Then why are you updating sk->sk_forward_alloc using racy operations ?

If this is not needed or racy, do not do it.

So that we can remove this annoying thing that a dynamic checker could
very well detect.





Re: [PATCH RFC iproute2] iplink: Support envhdrlen

2016-09-29 Thread Jiri Benc
On Tue, 27 Sep 2016 17:55:40 +0900, Toshiaki Makita wrote:
> This adds support for envhdrlen.
> 
> Example:
>  # ip link set eno1 envhdrlen 8

I don't see why this should be user visible, let alone requiring user
to set it. This should be transparent, kernel should compute the value
as needed based on the configuration and set it up. Requiring the
administrator to pick up a calculator and sum up all the vlan, mpls and
whatever header lengths is silly.

I realize that we currently have no easy way to do that. Especially
with lwtunnels and stuff line MPLS where we don't easily know the
number of tags. But every uAPI we introduce will have to be supported
forever and going a particular way just because it is easy to implement
is not sustainable.

At the very least, it should be configurable from the other direction.
I.e. telling which interfaces can be used by vlans or MPLS (if it
cannot be inferred automatically) and configuring maximum number of
tags on the given vlan/mpls/whatever interface/route/whatever.

 Jiri


Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Paolo Abeni
On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote:
> 
> > When we reach __sk_mem_reduce_allocated() we are sure we can free the
> > specified amount of memory, so we only need to ensure consistent
> > sk_prot->memory_allocated updates. The current atomic operation suffices
> > to this.
> 
> Then why are you updating sk->sk_forward_alloc using racy operations ?
> 
> If this is not needed or racy, do not do it.

Thank you for all the feedback.

The actual forward allocated memory value is:

atomic_read(>mem_allocated) - atomic_read(>sk_rmem_alloc).

sk_forward_alloc is updated only to hint to the user space the forward
allocated memory value via the diag interface.

If such information is not needed we can drop the update, and
sk_forward_alloc will always be seen as 0 even when the socket has some
forward allocation.

Cheers,

Paolo







Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Tom Herbert
On Thu, Sep 29, 2016 at 9:18 AM, Eric Dumazet  wrote:
> On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
>> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet  
>> wrote:
>> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> >> xps_flows maintains a per device flow table that is indexed by the
>> >> skbuff hash. The table is only consulted when there is no queue saved in
>> >> a transmit socket for an skbuff.
>> >>
>> >> Each entry in the flow table contains a queue index and a queue
>> >> pointer. The queue pointer is set when a queue is chosen using a
>> >> flow table entry. This pointer is set to the head pointer in the
>> >> transmit queue (which is maintained by BQL).
>> >>
>> >> The new function get_xps_flows_index that looks up flows in the
>> >> xps_flows table. The entry returned gives the last queue a matching flow
>> >> used. The returned queue is compared against the normal XPS queue. If
>> >> they are different, then we only switch if the tail pointer in the TX
>> >> queue has advanced past the pointer saved in the entry. In this
>> >> way OOO should be avoided when XPS wants to use a different queue.
>> >
>> > There is something I do not understand with this.
>> >
>> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
>> > qdisc wont be part of the detection.
>> >
>> > So packets of flow X could possibly be queued on multiple qdiscs.
>> >
>> Hi Eric,
>>
>> I'm not sure I understand your concern. If packets for flow X can be
>> queued on multiple qdiscs wouldn't that mean that flow will be subject
>> to ooo transmission regardless of this patch? In other words here
>> we're trying to keep packets for the flow in order as they are emitted
>> from the qdiscs which we assume maintains ordering of packets in a
>> flow.
>
> Well, then what this patch series is solving ?
>
It addresses  the issue that Rick Jones pointed out was happening with
XPS. When packets are sent for a flow that has no socket and XPS is
enabled then each packet uses the XPS queue based on the running CPU.
Since the thread sending on a flow can be rescheduled on different
CPUs this is creating ooo packets. In this case the ooo is being
caused by interaction with XPS.

> You have a producer of packets running on 8 vcpus in a VM.
>
> Packets are exiting the VM and need to be queued on a mq NIC in the
> hypervisor.
>
> Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
> selecting different TXQ.
>
> Your patch aims to detect this and steer packets to one TXQ, but not
> using a static hash() % num_real_queues (as RSS would have done on a NIC
> at RX), but trying to please XPS (as : selecting a queue close to the
> CPU producing the packet. VM with one vcpu, and cpu bounded, would
> really be happy to not spread packets all over the queues)
>
You're not describing this patch, you're describing how XPS works in
the first place. This patch is done to make socketless packets work
well with XPS. If all you want is a static hash() % num_real_queues
then just disable XPS to get that.

> Your mechanism relies on counters updated at the time packets are given
> to the NIC, not at the time packets are given to the txq (and eventually
> sit for a while in the qdisc right before BQL layer)
>
> dev_queue_xmit() is not talking to the device directly...
>
>
> All I am saying is that the producer/consumer counters you added are not
> at the right place.
>
> You tried hard to not change the drivers, by adding something to
> existing BQL. But packets can sit in a qdisc for a while...
>
But again, this patch is to prevent ooo being caused by an interaction
with XPS. It does not address ooo being caused by qdiscs or VMs, but
then I don't see that as being the issue reported by Rick.

> So you added 2 potential cache lines misses per outgoing packet, but
> with no guarantee that OOO are really avoided.
>
>
>


Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Paolo Abeni
On Thu, 2016-09-29 at 06:24 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote:
> > On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote:
> > > 
> > > > +static void udp_rmem_release(struct sock *sk, int partial)
> > > > +{
> > > > +   struct udp_sock *up = udp_sk(sk);
> > > > +   int fwd, amt;
> > > > +
> > > > +   if (partial && !udp_under_memory_pressure(sk))
> > > > +   return;
> > > > +
> > > > +   /* we can have concurrent release; if we catch any conflict
> > > > +* we let only one of them do the work
> > > > +*/
> > > > +   if (atomic_dec_if_positive(>can_reclaim) < 0)
> > > > +   return;
> > > > +
> > > > +   fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc));
> > > > +   if (fwd < SK_MEM_QUANTUM + partial) {
> > > > +   atomic_inc(>can_reclaim);
> > > > +   return;
> > > > +   }
> > > > +
> > > > +   amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> > > > +   atomic_sub(amt, >mem_allocated);
> > > > +   atomic_inc(>can_reclaim);
> > > > +
> > > > +   __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > > > +   sk->sk_forward_alloc = fwd - amt;
> > > > +}
> > > 
> > > 
> > > This is racy... all these atomics make me nervous...
> > 
> > Ah, perhaps I got it: if we have a concurrent memory scheduling, we
> > could end up with a value of mem_allocated below the real need. 
> > 
> > That mismatch will not drift: at worst we can end up with mem_allocated
> > being single SK_MEM_QUANTUM below what is strictly needed.
> > 
> > A possible alternative could be:
> > 
> > static void udp_rmem_release(struct sock *sk, int partial)
> > {
> > struct udp_sock *up = udp_sk(sk);
> > int fwd, amt, alloc_old, alloc;
> > 
> > if (partial && !udp_under_memory_pressure(sk))
> > return;
> > 
> > alloc = atomic_read(>mem_allocated);
> > fwd = alloc - atomic_read(>sk_rmem_alloc);
> > if (fwd < SK_MEM_QUANTUM + partial)
> > return;
> > 
> > amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> > alloc_old = atomic_cmpxchg(>mem_allocated, alloc, alloc - amt);
> > /* if a concurrent update is detected, just do nothing; if said update
> >  * is due to another memory release, that release take care of
> >  * reclaiming the memory for us, too.
> >  * Otherwise we will be able to release on later dequeue, since
> >  * we will eventually stop colliding with the writer when it will
> >  * consume all the fwd allocated memory
> >  */
> > if (alloc_old != alloc)
> > return;
> > 
> > __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > sk->sk_forward_alloc = fwd - amt;
> 
> Can still be done from multiple cpus.
> 
> Add some ndelay() or udelay() before to simulate fact that current cpu
> could be interrupted by an NMI handler (perf for example)... or hard IRQ
> handler...
> 
> Then make sure your tests involve 16 concurrent cpus dealing with one
> udp socket...

Thank you again reviewing this.

I'm working to this sort of tests right now.

> 
> > }
> > 
> > which is even more lazy in reclaiming but should never underestimate the
> > needed forward allocation, and under pressure should eventually free the
> > needed memory.
> 
> 
> If this code is rarely used, why don't you simply use a real spinlock,
> so that we do not have to worry about all this ?
> 
> A spinlock  acquisition/release is a _single_ locked operation.
> Faster than the 3 atomic you got in last version.
> spinlock code (ticket or MCS) avoids starvation.

I'd like to avoid adding a lock, if possible, to avoid any possible
source of contention.

> Then, you can safely update multiple fields in the socket.
> 
> And you get nice lockdep support as a bonus.
> 
> cmpxchg() is fine when a single field need an exclusion. But there you
> have multiple fields to update at once :
> 
> sk_memory_allocated_add() and sk_memory_allocated_sub() can work using 
> atomic_long_add_return() and atomic_long_sub() because their caller owns
> the socket lock and can safely update sk->sk_forward_alloc without
> additional locking, but UDP wont have this luxury after your patches.

When we reach __sk_mem_reduce_allocated() we are sure we can free the
specified amount of memory, so we only need to ensure consistent
sk_prot->memory_allocated updates. The current atomic operation suffices
to this.

Paolo




Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Paolo Abeni
On Thu, 2016-09-29 at 07:49 -0700, Eric Dumazet wrote:
> On Thu, Sep 29, 2016 at 7:34 AM, Paolo Abeni  wrote:
> > On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote:
> >> On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote:
> >>
> >> > When we reach __sk_mem_reduce_allocated() we are sure we can free the
> >> > specified amount of memory, so we only need to ensure consistent
> >> > sk_prot->memory_allocated updates. The current atomic operation suffices
> >> > to this.
> >>
> >> Then why are you updating sk->sk_forward_alloc using racy operations ?
> >>
> >> If this is not needed or racy, do not do it.
> >
> > Thank you for all the feedback.
> >
> > The actual forward allocated memory value is:
> >
> > atomic_read(>mem_allocated) - atomic_read(>sk_rmem_alloc).
> >
> > sk_forward_alloc is updated only to hint to the user space the forward
> > allocated memory value via the diag interface.
> >
> > If such information is not needed we can drop the update, and
> > sk_forward_alloc will always be seen as 0 even when the socket has some
> > forward allocation.
> 
> The information is needed and we want an accurate one, really.
> 
> Think about debugging on a live server, some stuck or mad sockets ;)
> 
> Please consider adding a proper accessor, able to deal with the UDP
> peculiarities.

Nice suggestion, thanks! I'll try that in v4. Perhaps is worth adding a
struct proto helper for this ?

I'm sorry, but I'll not be in Tokyo, so I'll probably produce some
traffic on netdev in the meanwhile.

Cheers,

Paolo



pull-request: wireless-drivers-next 2016-09-29

2016-09-29 Thread Kalle Valo
Hi Dave,

this should be the last wireless-drivers-next pull request for 4.9, from
now on only important bugfixes. Nothing really special stands out,
iwlwifi being most active but other drivers also getting attention. More
details in the signed tag. Please let me know if there are any problems.

Or actually I had one problem. While doing a test merge I noticed that
net-next fails to compile for me, but I don't think this is anything
wireless related:

  CC  net/netfilter/core.o
net/netfilter/core.c: In function 'nf_set_hooks_head':
net/netfilter/core.c:96:149: error: 'struct net_device' has no member named 
'nf_hooks_ingress'

Kalle

The following changes since commit fd9527f404d51e50f40dac0d9a69f2eff3dac33e:

  Merge branch 'ip_tunnel-collect_md' (2016-09-17 10:13:16 -0400)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git 
tags/wireless-drivers-next-for-davem-2016-09-29

for you to fetch changes up to 15b95a15950238eff4d7f24be1716086eea67835:

  Merge ath-next from 
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2016-09-28 
16:37:33 +0300)



wireless-drivers-next patches for 4.9

Major changes:

iwlwifi

* work for new hardware support continues
* dynamic queue allocation stabilization
* improvements in the MSIx code
* multiqueue support work continues
* new firmware version support (API 26)
* add 8275 series support
* add 9560 series support
* add support for MU-MIMO sniffer
* add support for RRM by scan
* add support for "reverse" rx packet injection faking hw descriptors
* migrate to devm memory allocation handling
* Remove support for older firmwares (API older than -17 and -22)

wl12xx

* support booting the same rootfs with both wl12xx and wl18xx

hostap

* mark the driver as obsolete

ath9k

* disable RNG by default


Amitkumar Karwar (1):
  mwifiex: code rearrangement in mwifiex_usb_host_to_card()

Arend Van Spriel (4):
  brcmfmac: rework pointer trickery in brcmf_proto_bcdc_query_dcmd()
  brcmfmac: fix memory leak in brcmf_flowring_add_tdls_peer()
  brcmfmac: initialize variable in brcmf_sdiod_regrl()
  brcmfmac: remove worker from .ndo_set_mac_address() callback

Arik Nemtsov (1):
  iwlwifi: move BIOS MCC retrieval to common code

Aviya Erenfeld (1):
  iwlwifi: mvm: add support for MU-MIMO air sniffer

Avrahams Stern (1):
  iwlwifi: mvm: Add support for RRM by scan

Ben Greear (3):
  ath10k: fix typo in logging message
  ath10k: document cycle count related counters
  ath10k: support up to 64 vdevs

Bob Copeland (1):
  mwifiex: fix error handling in mwifiex_create_custom_regdomain

Cathy Luo (2):
  mwifiex: fix kernel crash for USB chipsets
  mwifiex: fix race condition causing tx timeout

Chaehyun Lim (1):
  ath6kl: fix return value in ath6kl_wmi_set_pvb_cmd

Colin Ian King (1):
  mwifiex: fix null pointer deference when adapter is null

Emmanuel Grumbach (4):
  iwlwifi: mvm: bump max API to 26
  iwlwifi: don't export trace points that are used in iwlwifi only
  iwlwifi: mvm: fix typo in TC_CMD_SEC_KEY_FROM_TABLE
  iwlwifi: mvm: initialise ADD_STA before sending it to the firmware

Ganapathi Bhat (1):
  mwifiex: cfg80211 set_default_mgmt_key handler

Haim Dreyfuss (4):
  iwlwifi: pcie: Configure shared interrupt vector in MSIX mode
  iwlwifi: pcie: Set affinity mask for rx interrupt vectors per cpu
  iwlwifi: pcie: replace possible_cpus() with online_cpus() in MSIX mode
  iwlwifi: check for valid ethernet address provided by OEM

Hante Meuleman (5):
  brcmfmac: ignore 11d configuration errors
  brcmfmac: remove unnecessary null pointer check
  brcmfmac: fix clearing entry IPv6 address
  brcmfmac: fix out of bound access on clearing wowl wake indicator
  brcmfmac: simplify mapping of auth type

Ido Yariv (1):
  iwlwifi: mvm: Add mem debugfs entry

Jes Sorensen (4):
  rtl8xxxu: Implement 8192e specific power down sequence
  rtl8xxxu: Fix off by one error calculating pubq
  rtl8xxxu: Clean up llt_init() API
  rtl8xxxu: Use a struct rtl8xxxu_fileops * in rtl8xxxu_init_device()

Joe Perches (2):
  ath10k: spelling and miscellaneous neatening
  rtlwifi: Add switch variable to 'switch case not processed' messages

Johannes Berg (11):
  iwlwifi: mvm: make RSS RX more robust
  iwlwifi: mvm: remove pointless _bh from spinlock in timer
  iwlwifi: mvm: tighten BAID range check
  iwlwifi: mvm: compare full command ID
  iwlwifi: mvm: make iwl_mvm_update_sta() an inline
  iwlwifi: mvm: document passing unexpected Block Ack Request frames
  iwlwifi: mvm: move AP-specific code to right function
  iwlwifi: mvm: use LIST_HEAD() macro
  iwlwifi: pcie: use LIST_HEAD() macro
  iwlwifi: pcie: avoid variable 

Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Eric Dumazet
On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:
> 
> > It addresses  the issue that Rick Jones pointed out was happening with
> > XPS. When packets are sent for a flow that has no socket and XPS is
> > enabled then each packet uses the XPS queue based on the running CPU.
> > Since the thread sending on a flow can be rescheduled on different
> > CPUs this is creating ooo packets. In this case the ooo is being
> > caused by interaction with XPS.
> > 
> 
> Nope, your patch does not address the problem properly.
> 
> I am not sure I want to spend more time explaining the issue.
> 
> Lets talk about this in Tokyo next week.
> 

Just as a reminder, sorry to bother you, stating some obvious facts for
both of us. We have public exchanges, so we also need to re-explain how
things work.

Queue selection on xmit happens before we hit the qdisc and its delays.

So when you access txq->dql.num_completed_ops and
txq->dql.num_enqueue_ops you can observe values that do not change for a
while.

Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow
(skb_get_hash() returns the same value for these 2 packets)

P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the
packet on its qdisc . Transmit does not happen because of some
constraints like rate limiting or scheduling constraints.

P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior
packet chose txq1. We check txq1->dql and decide it is fine to use txq2,
since the dql params of txq1 were not changed yet.

( txq->dql.num_completed_ops == ent.queue_ptr )

Note that in RFS case, we have the guarantee that we observe 'live
queues' since they are the per cpu backlog.

So input_queue_head_incr() and input_queue_tail_incr_save() are
correctly doing the OOO prevention, because a queued packet immediately
changes the state.

So really your patch works if you have no qdisc, or a non congested
qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really
want to avoid steering P2, P3, ..., PN on this full pfifo while maybe
other txq are idle). Strange attractors are back (check commit
9b462d02d6dd6 )

You could avoid (ab)using BQL with a different method, grabbing
skb->destructor for the packets that are socketless

The hash table would simply track the sum of skb->truesize to allow flow
migration. This would be self contained and not intrusive.






Re: [PATCH RFC net-next] bnx2x: avoid printing unnecessary messages during register dump

2016-09-29 Thread Guilherme G. Piccoli


On 09/27/2016 11:43 PM, David Miller wrote:
> From: "Guilherme G. Piccoli" 
> Date: Tue, 27 Sep 2016 15:33:54 -0300
> 
>> The bnx2x driver prints multiple error messages during register dump,
>> with "ethtool -d" for example. The driver even warn that many messages
>> might be seen during the register dump, but they are harmless. A typical
>> kernel log after register dump looks like this:
>>
>>   [9.375] bnx2x: [bnx2x_get_regs:987(net0)]Generating register dump. Might 
>> trigger harmless GRC timeouts
>>   [9.439] bnx2x: [bnx2x_attn_int_deasserted3:4342(net0)]LATCHED attention 
>> 0x0400 (masked)
>>   [9.439] bnx2x: [bnx2x_attn_int_deasserted3:4346(net0)]GRC time-out 
>> 0x010580cd
>>   [...]
>>
>> The notation [...] means that some messages were supressed - in our
>> tests we saw 78 more "LATCHED attention" and "GRC time-out" messages,
>> supressed here.
>>
>> This patch avoid these messages to be printed on register dump instead
>> of just warn they are harmless.
>>
>> Signed-off-by: Guilherme G. Piccoli 
> 
> Although "ethtool -d" is really a debugging facility, I still think that
> serious care should be placed into arranging what gets dumped in such
> a way that such access timeouts and errors are minimized.
> 

David, thanks for your comment. I confess I didn't understand your
statement quite well. You say we shouldn't dump registers that will
cause timeouts, that's it?

If yes, I guess this is a valid point. We will however loose some debug
information (as you mentioned, 'ethtool -d' is a debug facility). Now,
since I'm no expert in QLogic adapter hw/fw, I want to ask Yuval/Ariel
why those timeouts are hit anyway. Are they completely harmless?

In my understanding/opinion, hiding the messages entirely (as this patch
does) OR avoid the timeouts by disabling some registers' dump are both
better alternatives than the current behavior of the driver.

Thanks,



Guilherme



[PATCH net-next 06/10] net: dsa: mv88e6xxx: rename mv88e6xxx_vtu_stu_entry

2016-09-29 Thread Vivien Didelot
The STU (if the switch has one) is abstracted and accessed through the
VTU operations and data registers.

Thus rename the mv88e6xxx_vtu_stu_entry struct to mv88e6xxx_vtu_entry.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 46 +--
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  5 +---
 2 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9056d9e..d805661 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1265,7 +1265,7 @@ static int _mv88e6xxx_vtu_stu_flush(struct mv88e6xxx_chip 
*chip)
 }
 
 static int _mv88e6xxx_vtu_stu_data_read(struct mv88e6xxx_chip *chip,
-   struct mv88e6xxx_vtu_stu_entry *entry,
+   struct mv88e6xxx_vtu_entry *entry,
unsigned int nibble_offset)
 {
u16 regs[3];
@@ -1290,19 +1290,19 @@ static int _mv88e6xxx_vtu_stu_data_read(struct 
mv88e6xxx_chip *chip,
 }
 
 static int mv88e6xxx_vtu_data_read(struct mv88e6xxx_chip *chip,
-  struct mv88e6xxx_vtu_stu_entry *entry)
+  struct mv88e6xxx_vtu_entry *entry)
 {
return _mv88e6xxx_vtu_stu_data_read(chip, entry, 0);
 }
 
 static int mv88e6xxx_stu_data_read(struct mv88e6xxx_chip *chip,
-  struct mv88e6xxx_vtu_stu_entry *entry)
+  struct mv88e6xxx_vtu_entry *entry)
 {
return _mv88e6xxx_vtu_stu_data_read(chip, entry, 2);
 }
 
 static int _mv88e6xxx_vtu_stu_data_write(struct mv88e6xxx_chip *chip,
-struct mv88e6xxx_vtu_stu_entry *entry,
+struct mv88e6xxx_vtu_entry *entry,
 unsigned int nibble_offset)
 {
u16 regs[3] = { 0 };
@@ -1327,13 +1327,13 @@ static int _mv88e6xxx_vtu_stu_data_write(struct 
mv88e6xxx_chip *chip,
 }
 
 static int mv88e6xxx_vtu_data_write(struct mv88e6xxx_chip *chip,
-   struct mv88e6xxx_vtu_stu_entry *entry)
+   struct mv88e6xxx_vtu_entry *entry)
 {
return _mv88e6xxx_vtu_stu_data_write(chip, entry, 0);
 }
 
 static int mv88e6xxx_stu_data_write(struct mv88e6xxx_chip *chip,
-   struct mv88e6xxx_vtu_stu_entry *entry)
+   struct mv88e6xxx_vtu_entry *entry)
 {
return _mv88e6xxx_vtu_stu_data_write(chip, entry, 2);
 }
@@ -1345,9 +1345,9 @@ static int _mv88e6xxx_vtu_vid_write(struct mv88e6xxx_chip 
*chip, u16 vid)
 }
 
 static int _mv88e6xxx_vtu_getnext(struct mv88e6xxx_chip *chip,
- struct mv88e6xxx_vtu_stu_entry *entry)
+ struct mv88e6xxx_vtu_entry *entry)
 {
-   struct mv88e6xxx_vtu_stu_entry next = { 0 };
+   struct mv88e6xxx_vtu_entry next = { 0 };
u16 val;
int err;
 
@@ -1407,7 +1407,7 @@ static int mv88e6xxx_port_vlan_dump(struct dsa_switch 
*ds, int port,
int (*cb)(struct switchdev_obj *obj))
 {
struct mv88e6xxx_chip *chip = ds->priv;
-   struct mv88e6xxx_vtu_stu_entry next;
+   struct mv88e6xxx_vtu_entry next;
u16 pvid;
int err;
 
@@ -1458,7 +1458,7 @@ unlock:
 }
 
 static int _mv88e6xxx_vtu_loadpurge(struct mv88e6xxx_chip *chip,
-   struct mv88e6xxx_vtu_stu_entry *entry)
+   struct mv88e6xxx_vtu_entry *entry)
 {
u16 op = GLOBAL_VTU_OP_VTU_LOAD_PURGE;
u16 reg = 0;
@@ -1507,9 +1507,9 @@ loadpurge:
 }
 
 static int _mv88e6xxx_stu_getnext(struct mv88e6xxx_chip *chip, u8 sid,
- struct mv88e6xxx_vtu_stu_entry *entry)
+ struct mv88e6xxx_vtu_entry *entry)
 {
-   struct mv88e6xxx_vtu_stu_entry next = { 0 };
+   struct mv88e6xxx_vtu_entry next = { 0 };
u16 val;
int err;
 
@@ -1549,7 +1549,7 @@ static int _mv88e6xxx_stu_getnext(struct mv88e6xxx_chip 
*chip, u8 sid,
 }
 
 static int _mv88e6xxx_stu_loadpurge(struct mv88e6xxx_chip *chip,
-   struct mv88e6xxx_vtu_stu_entry *entry)
+   struct mv88e6xxx_vtu_entry *entry)
 {
u16 reg = 0;
int err;
@@ -1652,7 +1652,7 @@ static int _mv88e6xxx_port_fid_set(struct mv88e6xxx_chip 
*chip,
 static int _mv88e6xxx_fid_new(struct mv88e6xxx_chip *chip, u16 *fid)
 {
DECLARE_BITMAP(fid_bitmap, MV88E6XXX_N_FID);
-   struct mv88e6xxx_vtu_stu_entry vlan;
+   struct mv88e6xxx_vtu_entry vlan;
int i, err;
 
bitmap_zero(fid_bitmap, MV88E6XXX_N_FID);
@@ -1694,10 +1694,10 @@ static int _mv88e6xxx_fid_new(struct mv88e6xxx_chip 
*chip, u16 *fid)
 }
 
 

Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Rick Jones

On 09/29/2016 06:18 AM, Eric Dumazet wrote:

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.


Just for completeness, in my testing, the VMs were single-vCPU.

rick jones


[PATCH net-next 08/10] net: dsa: mv88e6xxx: add chip-wide ops

2016-09-29 Thread Vivien Didelot
Introduce a mv88e6xxx_ops structure to describe supported chip-wide
functions and assign the correct variant to the chip models.

For the moment, add only PHY access routines. This allows to get rid of
the PHY ops structures and the usage of PHY flags.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 136 +++---
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  28 ---
 2 files changed, 121 insertions(+), 43 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ad31d3e..83a3769 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -238,10 +238,10 @@ static int mv88e6xxx_phy_read(struct mv88e6xxx_chip 
*chip, int phy,
 {
int addr = phy; /* PHY devices addresses start at 0x0 */
 
-   if (!chip->phy_ops)
+   if (!chip->info->ops->phy_read)
return -EOPNOTSUPP;
 
-   return chip->phy_ops->read(chip, addr, reg, val);
+   return chip->info->ops->phy_read(chip, addr, reg, val);
 }
 
 static int mv88e6xxx_phy_write(struct mv88e6xxx_chip *chip, int phy,
@@ -249,10 +249,10 @@ static int mv88e6xxx_phy_write(struct mv88e6xxx_chip 
*chip, int phy,
 {
int addr = phy; /* PHY devices addresses start at 0x0 */
 
-   if (!chip->phy_ops)
+   if (!chip->info->ops->phy_write)
return -EOPNOTSUPP;
 
-   return chip->phy_ops->write(chip, addr, reg, val);
+   return chip->info->ops->phy_write(chip, addr, reg, val);
 }
 
 static int mv88e6xxx_phy_page_get(struct mv88e6xxx_chip *chip, int phy, u8 
page)
@@ -515,11 +515,6 @@ static int mv88e6xxx_phy_ppu_write(struct mv88e6xxx_chip 
*chip, int addr,
return err;
 }
 
-static const struct mv88e6xxx_bus_ops mv88e6xxx_phy_ppu_ops = {
-   .read = mv88e6xxx_phy_ppu_read,
-   .write = mv88e6xxx_phy_ppu_write,
-};
-
 static bool mv88e6xxx_6065_family(struct mv88e6xxx_chip *chip)
 {
return chip->info->family == MV88E6XXX_FAMILY_6065;
@@ -3214,6 +3209,91 @@ static int mv88e6xxx_set_eeprom(struct dsa_switch *ds,
return err;
 }
 
+static const struct mv88e6xxx_ops mv88e6085_ops = {
+   .phy_read = mv88e6xxx_phy_ppu_read,
+   .phy_write = mv88e6xxx_phy_ppu_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6095_ops = {
+   .phy_read = mv88e6xxx_phy_ppu_read,
+   .phy_write = mv88e6xxx_phy_ppu_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6123_ops = {
+   .phy_read = mv88e6xxx_read,
+   .phy_write = mv88e6xxx_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6131_ops = {
+   .phy_read = mv88e6xxx_phy_ppu_read,
+   .phy_write = mv88e6xxx_phy_ppu_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6161_ops = {
+   .phy_read = mv88e6xxx_read,
+   .phy_write = mv88e6xxx_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6165_ops = {
+   .phy_read = mv88e6xxx_read,
+   .phy_write = mv88e6xxx_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6171_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6172_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6175_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6176_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6185_ops = {
+   .phy_read = mv88e6xxx_phy_ppu_read,
+   .phy_write = mv88e6xxx_phy_ppu_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6240_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6320_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6321_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6350_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6351_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
+static const struct mv88e6xxx_ops mv88e6352_ops = {
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+};
+
 static const struct mv88e6xxx_info mv88e6xxx_table[] = {
[MV88E6085] = {
.prod_num = PORT_SWITCH_ID_PROD_NUM_6085,
@@ -3225,6 +3305,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
 

[PATCH net-next 10/10] net: dsa: mv88e6xxx: add eeprom ops

2016-09-29 Thread Vivien Didelot
Remove EEPROM flags in favor of new {get,set}_eeprom chip-wide
functions in the mv88e6xxx_ops structure.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 34 +-
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 16 +---
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index e40b71b..883fd98 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3168,13 +3168,11 @@ static int mv88e6xxx_get_eeprom(struct dsa_switch *ds,
struct mv88e6xxx_chip *chip = ds->priv;
int err;
 
+   if (!chip->info->ops->get_eeprom)
+   return -EOPNOTSUPP;
+
mutex_lock(>reg_lock);
-
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAGS_EEPROM16))
-   err = mv88e6xxx_g2_get_eeprom16(chip, eeprom, data);
-   else
-   err = -EOPNOTSUPP;
-
+   err = chip->info->ops->get_eeprom(chip, eeprom, data);
mutex_unlock(>reg_lock);
 
if (err)
@@ -3191,16 +3189,14 @@ static int mv88e6xxx_set_eeprom(struct dsa_switch *ds,
struct mv88e6xxx_chip *chip = ds->priv;
int err;
 
+   if (!chip->info->ops->set_eeprom)
+   return -EOPNOTSUPP;
+
if (eeprom->magic != 0xc3ec4951)
return -EINVAL;
 
mutex_lock(>reg_lock);
-
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAGS_EEPROM16))
-   err = mv88e6xxx_g2_set_eeprom16(chip, eeprom, data);
-   else
-   err = -EOPNOTSUPP;
-
+   err = chip->info->ops->set_eeprom(chip, eeprom, data);
mutex_unlock(>reg_lock);
 
return err;
@@ -3249,6 +3245,8 @@ static const struct mv88e6xxx_ops mv88e6171_ops = {
 };
 
 static const struct mv88e6xxx_ops mv88e6172_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
@@ -3261,6 +3259,8 @@ static const struct mv88e6xxx_ops mv88e6175_ops = {
 };
 
 static const struct mv88e6xxx_ops mv88e6176_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
@@ -3273,18 +3273,24 @@ static const struct mv88e6xxx_ops mv88e6185_ops = {
 };
 
 static const struct mv88e6xxx_ops mv88e6240_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6320_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6321_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
@@ -3303,6 +3309,8 @@ static const struct mv88e6xxx_ops mv88e6351_ops = {
 };
 
 static const struct mv88e6xxx_ops mv88e6352_ops = {
+   .get_eeprom = mv88e6xxx_g2_get_eeprom16,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom16,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
@@ -3825,7 +3833,7 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
if (IS_ERR(chip->reset))
return PTR_ERR(chip->reset);
 
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAGS_EEPROM16) &&
+   if (chip->info->ops->get_eeprom &&
!of_property_read_u32(np, "eeprom-length", _len))
chip->eeprom_len = eeprom_len;
 
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index d04184c..e572121 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -424,8 +424,6 @@ enum mv88e6xxx_cap {
MV88E6XXX_CAP_G2_PVT_ADDR,  /* (0x0b) Cross Chip Port VLAN Addr */
MV88E6XXX_CAP_G2_PVT_DATA,  /* (0x0c) Cross Chip Port VLAN Data */
MV88E6XXX_CAP_G2_POT,   /* (0x0f) Priority Override Table */
-   MV88E6XXX_CAP_G2_EEPROM_CMD,/* (0x14) EEPROM Command */
-   MV88E6XXX_CAP_G2_EEPROM_DATA,   /* (0x15) EEPROM Data */
 
/* PHY Polling Unit.
 * See GLOBAL_CONTROL_PPU_ENABLE and GLOBAL_STATUS_PPU_POLLING.
@@ -473,8 

[PATCH net-next 04/10] net: dsa: mv88e6xxx: expose mv88e6xxx_num_databases

2016-09-29 Thread Vivien Didelot
The mv88e6xxx_num_databases will be used by shared code, so move it
inline to the header file.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 5 -
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index b7eecc9..6a55bba 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -560,11 +560,6 @@ static bool mv88e6xxx_6352_family(struct mv88e6xxx_chip 
*chip)
return chip->info->family == MV88E6XXX_FAMILY_6352;
 }
 
-static unsigned int mv88e6xxx_num_databases(struct mv88e6xxx_chip *chip)
-{
-   return chip->info->num_databases;
-}
-
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 6c8584f..20fe6c6 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -732,6 +732,11 @@ static inline bool mv88e6xxx_has(struct mv88e6xxx_chip 
*chip,
return (chip->info->flags & flags) == flags;
 }
 
+static inline unsigned int mv88e6xxx_num_databases(struct mv88e6xxx_chip *chip)
+{
+   return chip->info->num_databases;
+}
+
 int mv88e6xxx_read(struct mv88e6xxx_chip *chip, int addr, int reg, u16 *val);
 int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val);
 int mv88e6xxx_update(struct mv88e6xxx_chip *chip, int addr, int reg,
-- 
2.10.0



Re: [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2

2016-09-29 Thread James Chapman
On 22/09/16 21:52, R. Parameswaran wrote:
> From ed585bdd6d3d2b3dec58d414f514cd764d89159d Mon Sep 17 00:00:00 2001
> From: "R. Parameswaran" 
> Date: Thu, 22 Sep 2016 13:19:25 -0700
> Subject: [PATCH] L2TP:Adjust intf MTU,factor underlay L3,overlay L2
>
> Take into account all of the tunnel encapsulation headers when setting
> up the MTU on the L2TP logical interface device. Otherwise, packets
> created by the applications on top of the L2TP layer are larger
> than they ought to be, relative to the underlay MTU, leading to
> needless fragmentation once the outer IP encap is added.
>
> Specifically, take into account the (outer, underlay) IP header
> imposed on the encapsulated L2TP packet, and the Layer 2 header
> imposed on the inner IP packet prior to L2TP encapsulation.
>
> Do not assume an Ethernet (non-jumbo) underlay. Use the PMTU mechanism
> and the dst entry in the L2TP tunnel socket to directly pull up
> the underlay MTU (as the baseline number on top of which the
> encapsulation headers are factored in).  Fall back to Ethernet MTU
> if this fails.
>
> Signed-off-by: R. Parameswaran 
>
> Reviewed-by: "N. Prachanda" ,
> Reviewed-by: "R. Shearman" ,
> Reviewed-by: "D. Fawcus" 
> ---
>  net/l2tp/l2tp_eth.c | 48 
>  1 file changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> index 57fc5a4..dbcd6bd 100644
> --- a/net/l2tp/l2tp_eth.c
> +++ b/net/l2tp/l2tp_eth.c
> @@ -30,6 +30,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include "l2tp_core.h"
>  
> @@ -206,6 +209,46 @@ static void l2tp_eth_show(struct seq_file *m, void *arg)
>  }
>  #endif
>  
> +static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
> + struct l2tp_session *session,
> + struct net_device *dev)
> +{
> + unsigned int overhead = 0;
> + struct dst_entry *dst;
> +
> + if (session->mtu != 0) {
> + dev->mtu = session->mtu;
> + dev->needed_headroom += session->hdr_len;
> + if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
> + dev->needed_headroom += sizeof(struct udphdr);
> + return;
> + }
> + overhead = session->hdr_len;
> + /* Adjust MTU, factor overhead - underlay L3 hdr, overlay L2 hdr*/
> + if (tunnel->sock->sk_family == AF_INET)
> + overhead += (ETH_HLEN + sizeof(struct iphdr));
> + else if (tunnel->sock->sk_family == AF_INET6)
> + overhead += (ETH_HLEN + sizeof(struct ipv6hdr));
What about options in the IP header? If certain options are set on the
socket, the IP header may be larger.

> + /* Additionally, if the encap is UDP, account for UDP header size */
> + if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
> + overhead += sizeof(struct udphdr);
> + /* If PMTU discovery was enabled, use discovered MTU on L2TP device */
> + dst = sk_dst_get(tunnel->sock);
> + if (dst) {
> + u32 pmtu = dst_mtu(dst);
> +
> + if (pmtu != 0)
> + dev->mtu = pmtu;
> + dst_release(dst);
> + }
> + /* else (no PMTUD) L2TP dev MTU defaulted to Ethernet MTU in caller */
> + session->mtu = dev->mtu - overhead;
> + dev->mtu = session->mtu;
> + dev->needed_headroom += session->hdr_len;
> + if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
> + dev->needed_headroom += sizeof(struct udphdr);
> +}
> +
>  static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, 
> u32 peer_session_id, struct l2tp_session_cfg *cfg)
>  {
>   struct net_device *dev;
> @@ -255,11 +298,8 @@ static int l2tp_eth_create(struct net *net, u32 
> tunnel_id, u32 session_id, u32 p
>   }
>  
>   dev_net_set(dev, net);
> - if (session->mtu == 0)
> - session->mtu = dev->mtu - session->hdr_len;
> - dev->mtu = session->mtu;
> - dev->needed_headroom += session->hdr_len;
>  
> + l2tp_eth_adjust_mtu(tunnel, session, dev);
>   priv = netdev_priv(dev);
>   priv->dev = dev;
>   priv->session = session;





Re: [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2

2016-09-29 Thread James Chapman
On 29/09/16 03:36, R. Parameswaran wrote:
> I agree that something like 2. below would be needed in the long run (it 
> will need some effort and redesign -e.g. how do I lookup the parent tunnel 
> from the socket when receiving a PMTU update, existing pointer chain runs 
> from tunnel to socket).
>> 2) Add code to handle PMTU events that land on the UDP tunnel
>>socket.

Another function pointer could be added to struct udp_sock, similar to
encap_rcv,  such that the pmtu event could be handled by the UDP encap
protocol implementation.

James




[PATCH net-next 05/10] net: dsa: mv88e6xxx: add mv88e6xxx_num_ports helper

2016-09-29 Thread Vivien Didelot
Add an mv88e6xxx_num_ports helper instead of digging in the chip info
structure.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 30 +++---
 drivers/net/dsa/mv88e6xxx/global2.c   |  8 
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  5 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6a55bba..9056d9e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -613,7 +613,7 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
reg |= PORT_PCS_CTRL_DUPLEX_FULL;
 
if ((mv88e6xxx_6352_family(chip) || mv88e6xxx_6351_family(chip)) &&
-   (port >= chip->info->num_ports - 2)) {
+   (port >= mv88e6xxx_num_ports(chip) - 2)) {
if (phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID)
reg |= PORT_PCS_CTRL_RGMII_DELAY_RXCLK;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID)
@@ -1112,7 +1112,7 @@ static int _mv88e6xxx_port_state(struct mv88e6xxx_chip 
*chip, int port,
 static int _mv88e6xxx_port_based_vlan_map(struct mv88e6xxx_chip *chip, int 
port)
 {
struct net_device *bridge = chip->ports[port].bridge_dev;
-   const u16 mask = (1 << chip->info->num_ports) - 1;
+   const u16 mask = (1 << mv88e6xxx_num_ports(chip)) - 1;
struct dsa_switch *ds = chip->ds;
u16 output_ports = 0;
u16 reg;
@@ -1123,7 +1123,7 @@ static int _mv88e6xxx_port_based_vlan_map(struct 
mv88e6xxx_chip *chip, int port)
if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
output_ports = mask;
} else {
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
/* allow sending frames to every group member */
if (bridge && chip->ports[i].bridge_dev == bridge)
output_ports |= BIT(i);
@@ -1279,7 +1279,7 @@ static int _mv88e6xxx_vtu_stu_data_read(struct 
mv88e6xxx_chip *chip,
return err;
}
 
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
unsigned int shift = (i % 4) * 4 + nibble_offset;
u16 reg = regs[i / 4];
 
@@ -1308,7 +1308,7 @@ static int _mv88e6xxx_vtu_stu_data_write(struct 
mv88e6xxx_chip *chip,
u16 regs[3] = { 0 };
int i, err;
 
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
unsigned int shift = (i % 4) * 4 + nibble_offset;
u8 data = entry->data[i];
 
@@ -1658,7 +1658,7 @@ static int _mv88e6xxx_fid_new(struct mv88e6xxx_chip 
*chip, u16 *fid)
bitmap_zero(fid_bitmap, MV88E6XXX_N_FID);
 
/* Set every FID bit used by the (un)bridged ports */
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
err = _mv88e6xxx_port_fid_get(chip, i, fid);
if (err)
return err;
@@ -1708,7 +1708,7 @@ static int _mv88e6xxx_vtu_new(struct mv88e6xxx_chip 
*chip, u16 vid,
return err;
 
/* exclude all ports except the CPU and DSA ports */
-   for (i = 0; i < chip->info->num_ports; ++i)
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i)
vlan.data[i] = dsa_is_cpu_port(ds, i) || dsa_is_dsa_port(ds, i)
? GLOBAL_VTU_DATA_MEMBER_TAG_UNMODIFIED
: GLOBAL_VTU_DATA_MEMBER_TAG_NON_MEMBER;
@@ -1797,7 +1797,7 @@ static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch 
*ds, int port,
if (vlan.vid > vid_end)
break;
 
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
if (dsa_is_dsa_port(ds, i) || dsa_is_cpu_port(ds, i))
continue;
 
@@ -1959,7 +1959,7 @@ static int _mv88e6xxx_port_vlan_del(struct mv88e6xxx_chip 
*chip,
 
/* keep the VLAN unless all ports are excluded */
vlan.valid = false;
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
if (dsa_is_cpu_port(ds, i) || dsa_is_dsa_port(ds, i))
continue;
 
@@ -2340,7 +2340,7 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch 
*ds, int port,
/* Assign the bridge and remap each port's VLANTable */
chip->ports[port].bridge_dev = bridge;
 
-   for (i = 0; i < chip->info->num_ports; ++i) {
+   for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
if (chip->ports[i].bridge_dev == bridge) {
  

[PATCH net-next 07/10] net: dsa: mv88e6xxx: rename mv88e6xxx_ops

2016-09-29 Thread Vivien Didelot
The mv88e6xxx_ops is used to describe how to access the chip registers.
It can be through SMI (via an MDIO bus), or via another interface such
as crafted remote management frames.

The correct BUS operations structure is chosen at runtime, depending on
the chip address and connectivity.

We will need the mv88e6xxx_ops name for future chip-wide operation
structure, thus rename mv88e6xxx_ops to more explicit mv88e6xxx_bus_ops.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 10 +-
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index d805661..ad31d3e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -98,7 +98,7 @@ static int mv88e6xxx_smi_single_chip_write(struct 
mv88e6xxx_chip *chip,
return 0;
 }
 
-static const struct mv88e6xxx_ops mv88e6xxx_smi_single_chip_ops = {
+static const struct mv88e6xxx_bus_ops mv88e6xxx_smi_single_chip_ops = {
.read = mv88e6xxx_smi_single_chip_read,
.write = mv88e6xxx_smi_single_chip_write,
 };
@@ -180,7 +180,7 @@ static int mv88e6xxx_smi_multi_chip_write(struct 
mv88e6xxx_chip *chip,
return 0;
 }
 
-static const struct mv88e6xxx_ops mv88e6xxx_smi_multi_chip_ops = {
+static const struct mv88e6xxx_bus_ops mv88e6xxx_smi_multi_chip_ops = {
.read = mv88e6xxx_smi_multi_chip_read,
.write = mv88e6xxx_smi_multi_chip_write,
 };
@@ -515,7 +515,7 @@ static int mv88e6xxx_phy_ppu_write(struct mv88e6xxx_chip 
*chip, int addr,
return err;
 }
 
-static const struct mv88e6xxx_ops mv88e6xxx_phy_ppu_ops = {
+static const struct mv88e6xxx_bus_ops mv88e6xxx_phy_ppu_ops = {
.read = mv88e6xxx_phy_ppu_read,
.write = mv88e6xxx_phy_ppu_write,
 };
@@ -3479,12 +3479,12 @@ static struct mv88e6xxx_chip 
*mv88e6xxx_alloc_chip(struct device *dev)
return chip;
 }
 
-static const struct mv88e6xxx_ops mv88e6xxx_g2_smi_phy_ops = {
+static const struct mv88e6xxx_bus_ops mv88e6xxx_g2_smi_phy_ops = {
.read = mv88e6xxx_g2_smi_phy_read,
.write = mv88e6xxx_g2_smi_phy_write,
 };
 
-static const struct mv88e6xxx_ops mv88e6xxx_phy_ops = {
+static const struct mv88e6xxx_bus_ops mv88e6xxx_phy_ops = {
.read = mv88e6xxx_read,
.write = mv88e6xxx_write,
 };
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index ee7873a..b9ef769 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -648,7 +648,7 @@ struct mv88e6xxx_vtu_entry {
u8  data[DSA_MAX_PORTS];
 };
 
-struct mv88e6xxx_ops;
+struct mv88e6xxx_bus_ops;
 
 struct mv88e6xxx_priv_port {
struct net_device *bridge_dev;
@@ -669,14 +669,14 @@ struct mv88e6xxx_chip {
/* The MII bus and the address on the bus that is used to
 * communication with the switch
 */
-   const struct mv88e6xxx_ops *smi_ops;
+   const struct mv88e6xxx_bus_ops *smi_ops;
struct mii_bus *bus;
int sw_addr;
 
/* Handles automatic disabling and re-enabling of the PHY
 * polling unit.
 */
-   const struct mv88e6xxx_ops *phy_ops;
+   const struct mv88e6xxx_bus_ops *phy_ops;
struct mutexppu_mutex;
int ppu_disabled;
struct work_struct  ppu_work;
@@ -705,7 +705,7 @@ struct mv88e6xxx_chip {
struct mii_bus *mdio_bus;
 };
 
-struct mv88e6xxx_ops {
+struct mv88e6xxx_bus_ops {
int (*read)(struct mv88e6xxx_chip *chip, int addr, int reg, u16 *val);
int (*write)(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val);
 };
-- 
2.10.0



Re: [PATCH v2 3/3] net: make net namespace sysctls belong to container's owner

2016-09-29 Thread Dmitry Torokhov
Hi David,

On Wed, Aug 10, 2016 at 2:36 PM, Dmitry Torokhov
 wrote:
> If net namespace is attached to a user namespace let's make container's
> root owner of sysctls affecting said network namespace instead of global
> root.
>
> This also allows us to clean up net_ctl_permissions() because we do not
> need to fudge permissions anymore for the container's owner since it now
> owns the objects in question.
>
> Acked-by: "Eric W. Biederman" 
> Signed-off-by: Dmitry Torokhov 

I was looking at linux-next today, and I noticed that, when you merged
my patch, you basically reverted the following commit:

commit d6e0d306449bcb5fa3c80e7a3edf11d45abf9ae9
Author: Tyler Hicks 
Date:   Thu Jun 2 23:43:22 2016 -0500

net: Use ns_capable_noaudit() when determining net sysctl permissions

The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.

The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.

BugLink: https://launchpad.net/bugs/1465724

Signed-off-by: Tyler Hicks 
Acked-by: Serge E. Hallyn 
Signed-off-by: James Morris 

Thanks!

> ---
>  fs/proc/proc_sysctl.c  |  5 +
>  include/linux/sysctl.h |  4 
>  net/sysctl_net.c   | 29 -
>  3 files changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 5e57c3e..28f9085 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -430,6 +430,7 @@ static int sysctl_perm(struct ctl_table_header *head, 
> struct ctl_table *table, i
>  static struct inode *proc_sys_make_inode(struct super_block *sb,
> struct ctl_table_header *head, struct ctl_table *table)
>  {
> +   struct ctl_table_root *root = head->root;
> struct inode *inode;
> struct proc_inode *ei;
>
> @@ -457,6 +458,10 @@ static struct inode *proc_sys_make_inode(struct 
> super_block *sb,
> if (is_empty_dir(head))
> make_empty_dir_inode(inode);
> }
> +
> +   if (root->set_ownership)
> +   root->set_ownership(head, table, >i_uid, 
> >i_gid);
> +
>  out:
> return inode;
>  }
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index fa7bc29..55bec2f 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  /* For the /proc/sys support */
> @@ -156,6 +157,9 @@ struct ctl_table_root {
> struct ctl_table_set default_set;
> struct ctl_table_set *(*lookup)(struct ctl_table_root *root,
>struct nsproxy *namespaces);
> +   void (*set_ownership)(struct ctl_table_header *head,
> + struct ctl_table *table,
> + kuid_t *uid, kgid_t *gid);
> int (*permissions)(struct ctl_table_header *head, struct ctl_table 
> *table);
>  };
>
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index ed98c1f..5bc1a3d 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -42,26 +42,37 @@ static int net_ctl_permissions(struct ctl_table_header 
> *head,
>struct ctl_table *table)
>  {
> struct net *net = container_of(head->set, struct net, sysctls);
> -   kuid_t root_uid = make_kuid(net->user_ns, 0);
> -   kgid_t root_gid = make_kgid(net->user_ns, 0);
>
> /* Allow network administrator to have same access as root. */
> -   if (ns_capable(net->user_ns, CAP_NET_ADMIN) ||
> -   uid_eq(root_uid, current_euid())) {
> +   if (ns_capable(net->user_ns, CAP_NET_ADMIN)) {
> int mode = (table->mode >> 6) & 7;
> return (mode << 6) | (mode << 3) | mode;
> }
> -   /* Allow netns root group to have the same access as the root group */
> -   if (in_egroup_p(root_gid)) {
> -   int mode = (table->mode >> 3) & 7;
> -   return (mode << 3) | mode;
> -   }
> +
> return table->mode;
>  }
>
> +static void net_ctl_set_ownership(struct ctl_table_header *head,
> + struct ctl_table *table,
> + kuid_t *uid, kgid_t *gid)
> +{
> +   struct net *net = container_of(head->set, struct net, sysctls);
> +   kuid_t ns_root_uid;
> +   kgid_t ns_root_gid;
> +
> +   ns_root_uid = make_kuid(net->user_ns, 0);
> + 

[PATCH net-next 09/10] net: dsa: mv88e6xxx: add set_switch_mac to ops

2016-09-29 Thread Vivien Didelot
Add a set_switch_mac chip-wide function to mv88e6xxx_ops and remove
MV88E6XXX_FLAG_G2_SWITCH_MAC flags.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 28 +---
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  8 ++--
 2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 83a3769..e40b71b 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2909,14 +2909,11 @@ static int mv88e6xxx_set_addr(struct dsa_switch *ds, u8 
*addr)
struct mv88e6xxx_chip *chip = ds->priv;
int err;
 
+   if (!chip->info->ops->set_switch_mac)
+   return -EOPNOTSUPP;
+
mutex_lock(>reg_lock);
-
-   /* Has an indirect Switch MAC/WoL/WoF register in Global 2? */
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G2_SWITCH_MAC))
-   err = mv88e6xxx_g2_set_switch_mac(chip, addr);
-   else
-   err = mv88e6xxx_g1_set_switch_mac(chip, addr);
-
+   err = chip->info->ops->set_switch_mac(chip, addr);
mutex_unlock(>reg_lock);
 
return err;
@@ -3210,86 +3207,103 @@ static int mv88e6xxx_set_eeprom(struct dsa_switch *ds,
 }
 
 static const struct mv88e6xxx_ops mv88e6085_ops = {
+   .set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6xxx_phy_ppu_read,
.phy_write = mv88e6xxx_phy_ppu_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6095_ops = {
+   .set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6xxx_phy_ppu_read,
.phy_write = mv88e6xxx_phy_ppu_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6123_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_read,
.phy_write = mv88e6xxx_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6131_ops = {
+   .set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6xxx_phy_ppu_read,
.phy_write = mv88e6xxx_phy_ppu_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6161_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_read,
.phy_write = mv88e6xxx_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6165_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_read,
.phy_write = mv88e6xxx_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6171_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6172_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6175_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6176_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6185_ops = {
+   .set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6xxx_phy_ppu_read,
.phy_write = mv88e6xxx_phy_ppu_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6240_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6320_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6321_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6350_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6351_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
 
 static const struct mv88e6xxx_ops mv88e6352_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
 };
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 8e12902..d04184c 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -423,7 +423,6 @@ 

[PATCH net-next 00/10] net: dsa: mv88e6xxx: Global (1) cosmetics

2016-09-29 Thread Vivien Didelot
The Global (1) internal SMI device of Marvell switches is a set of
registers providing support to different units for MAC addresses (ATU),
VLANs (VTU), PHY polling (PPU), etc.

Chips (like 88E6060) may use a different address for it, or have
subtleties in the units (e.g. different number of databases, changing
how registers must be accessed), making it hard to maintain properly.

This patchset is a first step to polish the Global (1) support, with no
functional changes though. Here's basically what it does:

  - add helpers to access Global1 registers (same for Global2)
  - remove a few family checks (VTU/STU FID registers)
  - s/mv88e6xxx_vtu_stu_entry/mv88e6xxx_vtu_entry/
  - add a per-chip mv88e6xxx_ops structure of function pointers:

  struct mv88e6xxx_ops {
int (*get_eeprom)(struct mv88e6xxx_chip *chip,
  struct ethtool_eeprom *eeprom, u8 *data);
int (*set_eeprom)(struct mv88e6xxx_chip *chip,
  struct ethtool_eeprom *eeprom, u8 *data);

int (*set_switch_mac)(struct mv88e6xxx_chip *chip, u8 *addr);

int (*phy_read)(struct mv88e6xxx_chip *chip, int addr, int reg,
u16 *val);
int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
 u16 val);
  };

Future patchsets will add ATU/VTU ops, software reset, etc.

Vivien Didelot (10):
  net: dsa: mv88e6xxx: add global1 helpers
  net: dsa: mv88e6xxx: abstract REG_GLOBAL2
  net: dsa: mv88e6xxx: add flags for FID registers
  net: dsa: mv88e6xxx: expose mv88e6xxx_num_databases
  net: dsa: mv88e6xxx: add mv88e6xxx_num_ports helper
  net: dsa: mv88e6xxx: rename mv88e6xxx_vtu_stu_entry
  net: dsa: mv88e6xxx: rename mv88e6xxx_ops
  net: dsa: mv88e6xxx: add chip-wide ops
  net: dsa: mv88e6xxx: add set_switch_mac to ops
  net: dsa: mv88e6xxx: add eeprom ops

 drivers/net/dsa/mv88e6xxx/Makefile|   1 +
 drivers/net/dsa/mv88e6xxx/chip.c  | 804 +++---
 drivers/net/dsa/mv88e6xxx/global1.c   |  34 ++
 drivers/net/dsa/mv88e6xxx/global1.h   |  23 +
 drivers/net/dsa/mv88e6xxx/global2.c   |  86 ++--
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 102 +++--
 6 files changed, 607 insertions(+), 443 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/global1.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/global1.h

-- 
2.10.0



[PATCH net-next 03/10] net: dsa: mv88e6xxx: add flags for FID registers

2016-09-29 Thread Vivien Didelot
Add flags to describe the presence of Global 1 ATU FID register (0x01)
and VTU FID register (0x02), instead of checking families.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 16 +++-
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 24 
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 98dee2c..b7eecc9 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -565,16 +565,6 @@ static unsigned int mv88e6xxx_num_databases(struct 
mv88e6xxx_chip *chip)
return chip->info->num_databases;
 }
 
-static bool mv88e6xxx_has_fid_reg(struct mv88e6xxx_chip *chip)
-{
-   /* Does the device have dedicated FID registers for ATU and VTU ops? */
-   if (mv88e6xxx_6097_family(chip) || mv88e6xxx_6165_family(chip) ||
-   mv88e6xxx_6351_family(chip) || mv88e6xxx_6352_family(chip))
-   return true;
-
-   return false;
-}
-
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
@@ -978,7 +968,7 @@ static int _mv88e6xxx_atu_cmd(struct mv88e6xxx_chip *chip, 
u16 fid, u16 cmd)
u16 val;
int err;
 
-   if (mv88e6xxx_has_fid_reg(chip)) {
+   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G1_ATU_FID)) {
err = mv88e6xxx_g1_write(chip, GLOBAL_ATU_FID, fid);
if (err)
return err;
@@ -1386,7 +1376,7 @@ static int _mv88e6xxx_vtu_getnext(struct mv88e6xxx_chip 
*chip,
if (err)
return err;
 
-   if (mv88e6xxx_has_fid_reg(chip)) {
+   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G1_VTU_FID)) {
err = mv88e6xxx_g1_read(chip, GLOBAL_VTU_FID, );
if (err)
return err;
@@ -1498,7 +1488,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct mv88e6xxx_chip 
*chip,
return err;
}
 
-   if (mv88e6xxx_has_fid_reg(chip)) {
+   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G1_VTU_FID)) {
reg = entry->fid & GLOBAL_VTU_FID_MASK;
err = mv88e6xxx_g1_write(chip, GLOBAL_VTU_FID, reg);
if (err)
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 2f10108..6c8584f 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -170,8 +170,8 @@
 #define GLOBAL_MAC_01  0x01
 #define GLOBAL_MAC_23  0x02
 #define GLOBAL_MAC_45  0x03
-#define GLOBAL_ATU_FID 0x01/* 6097 6165 6351 6352 */
-#define GLOBAL_VTU_FID 0x02/* 6097 6165 6351 6352 */
+#define GLOBAL_ATU_FID 0x01
+#define GLOBAL_VTU_FID 0x02
 #define GLOBAL_VTU_FID_MASK0xfff
 #define GLOBAL_VTU_SID 0x03/* 6097 6165 6351 6352 */
 #define GLOBAL_VTU_SID_MASK0x3f
@@ -408,6 +408,11 @@ enum mv88e6xxx_cap {
 */
MV88E6XXX_CAP_SERDES,
 
+   /* Switch Global (1) Registers.
+*/
+   MV88E6XXX_CAP_G1_ATU_FID,   /* (0x01) ATU FID Register */
+   MV88E6XXX_CAP_G1_VTU_FID,   /* (0x02) VTU FID Register */
+
/* Switch Global 2 Registers.
 * The device contains a second set of global 16-bit registers.
 */
@@ -460,6 +465,9 @@ enum mv88e6xxx_cap {
 
 #define MV88E6XXX_FLAG_SERDES  BIT_ULL(MV88E6XXX_CAP_SERDES)
 
+#define MV88E6XXX_FLAG_G1_ATU_FID  BIT_ULL(MV88E6XXX_CAP_G1_ATU_FID)
+#define MV88E6XXX_FLAG_G1_VTU_FID  BIT_ULL(MV88E6XXX_CAP_G1_VTU_FID)
+
 #define MV88E6XXX_FLAG_GLOBAL2 BIT_ULL(MV88E6XXX_CAP_GLOBAL2)
 #define MV88E6XXX_FLAG_G2_MGMT_EN_2X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_2X)
 #define MV88E6XXX_FLAG_G2_MGMT_EN_0X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_0X)
@@ -519,7 +527,9 @@ enum mv88e6xxx_cap {
 MV88E6XXX_FLAGS_MULTI_CHIP)
 
 #define MV88E6XXX_FLAGS_FAMILY_6097\
-   (MV88E6XXX_FLAG_GLOBAL2 |   \
+   (MV88E6XXX_FLAG_G1_ATU_FID |\
+MV88E6XXX_FLAG_G1_VTU_FID |\
+MV88E6XXX_FLAG_GLOBAL2 |   \
 MV88E6XXX_FLAG_G2_MGMT_EN_2X | \
 MV88E6XXX_FLAG_G2_MGMT_EN_0X | \
 MV88E6XXX_FLAG_G2_POT |\
@@ -531,7 +541,9 @@ enum mv88e6xxx_cap {
 MV88E6XXX_FLAGS_PVT)
 
 #define MV88E6XXX_FLAGS_FAMILY_6165\
-   (MV88E6XXX_FLAG_GLOBAL2 |   \
+   (MV88E6XXX_FLAG_G1_ATU_FID |\
+MV88E6XXX_FLAG_G1_VTU_FID |\
+MV88E6XXX_FLAG_GLOBAL2 |   \
 MV88E6XXX_FLAG_G2_MGMT_EN_2X | \
 MV88E6XXX_FLAG_G2_MGMT_EN_0X | \
 MV88E6XXX_FLAG_G2_SWITCH_MAC | \
@@ -570,6 +582,8 @@ enum mv88e6xxx_cap {
 
 #define MV88E6XXX_FLAGS_FAMILY_6351\
(MV88E6XXX_FLAG_EDSA |  \
+MV88E6XXX_FLAG_G1_ATU_FID 

[PATCH net-next 01/10] net: dsa: mv88e6xxx: add global1 helpers

2016-09-29 Thread Vivien Didelot
The Global (1) internal SMI device is an extended set of registers
containing ATU, PPU, VTU, STU, etc.

It is present on every switches, usually at SMI address 0x1B. But old
models such as 88E6060 access it at address 0xF, thus using REG_GLOBAL
is erroneous.

Add a global1_addr info member used by mv88e6xxx_g1_{read,write} and
mv88e6xxx_g1_wait helpers in a new global1.c file.

This patch finally removes _mv88e6xxx_reg_{read,write}, in favor on the
appropriate helpers. No functional changes here.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/Makefile|   1 +
 drivers/net/dsa/mv88e6xxx/chip.c  | 505 +-
 drivers/net/dsa/mv88e6xxx/global1.c   |  34 +++
 drivers/net/dsa/mv88e6xxx/global1.h   |  23 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |   2 +-
 5 files changed, 306 insertions(+), 259 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/global1.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/global1.h

diff --git a/drivers/net/dsa/mv88e6xxx/Makefile 
b/drivers/net/dsa/mv88e6xxx/Makefile
index 6971039..10ce820 100644
--- a/drivers/net/dsa/mv88e6xxx/Makefile
+++ b/drivers/net/dsa/mv88e6xxx/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_NET_DSA_MV88E6XXX) += mv88e6xxx.o
 mv88e6xxx-objs := chip.o
+mv88e6xxx-objs += global1.o
 mv88e6xxx-$(CONFIG_NET_DSA_MV88E6XXX_GLOBAL2) += global2.o
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index b2c25da..98dee2c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -31,6 +31,7 @@
 #include 
 
 #include "mv88e6xxx.h"
+#include "global1.h"
 #include "global2.h"
 
 static void assert_reg_lock(struct mv88e6xxx_chip *chip)
@@ -361,46 +362,27 @@ int mv88e6xxx_update(struct mv88e6xxx_chip *chip, int 
addr, int reg, u16 update)
return mv88e6xxx_write(chip, addr, reg, val);
 }
 
-static int _mv88e6xxx_reg_read(struct mv88e6xxx_chip *chip, int addr, int reg)
+static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
 {
u16 val;
-   int err;
+   int i, err;
 
-   err = mv88e6xxx_read(chip, addr, reg, );
+   err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, );
if (err)
return err;
 
-   return val;
-}
-
-static int _mv88e6xxx_reg_write(struct mv88e6xxx_chip *chip, int addr,
-   int reg, u16 val)
-{
-   return mv88e6xxx_write(chip, addr, reg, val);
-}
-
-static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
-{
-   int ret;
-   int i;
-
-   ret = _mv88e6xxx_reg_read(chip, REG_GLOBAL, GLOBAL_CONTROL);
-   if (ret < 0)
-   return ret;
-
-   ret = _mv88e6xxx_reg_write(chip, REG_GLOBAL, GLOBAL_CONTROL,
-  ret & ~GLOBAL_CONTROL_PPU_ENABLE);
-   if (ret)
-   return ret;
+   err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL,
+val & ~GLOBAL_CONTROL_PPU_ENABLE);
+   if (err)
+   return err;
 
for (i = 0; i < 16; i++) {
-   ret = _mv88e6xxx_reg_read(chip, REG_GLOBAL, GLOBAL_STATUS);
-   if (ret < 0)
-   return ret;
+   err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, );
+   if (err)
+   return err;
 
usleep_range(1000, 2000);
-   if ((ret & GLOBAL_STATUS_PPU_MASK) !=
-   GLOBAL_STATUS_PPU_POLLING)
+   if ((val & GLOBAL_STATUS_PPU_MASK) != GLOBAL_STATUS_PPU_POLLING)
return 0;
}
 
@@ -409,25 +391,25 @@ static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip 
*chip)
 
 static int mv88e6xxx_ppu_enable(struct mv88e6xxx_chip *chip)
 {
-   int ret, err, i;
+   u16 val;
+   int i, err;
 
-   ret = _mv88e6xxx_reg_read(chip, REG_GLOBAL, GLOBAL_CONTROL);
-   if (ret < 0)
-   return ret;
+   err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, );
+   if (err)
+   return err;
 
-   err = _mv88e6xxx_reg_write(chip, REG_GLOBAL, GLOBAL_CONTROL,
-  ret | GLOBAL_CONTROL_PPU_ENABLE);
+   err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL,
+val | GLOBAL_CONTROL_PPU_ENABLE);
if (err)
return err;
 
for (i = 0; i < 16; i++) {
-   ret = _mv88e6xxx_reg_read(chip, REG_GLOBAL, GLOBAL_STATUS);
-   if (ret < 0)
-   return ret;
+   err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, );
+   if (err)
+   return err;
 
usleep_range(1000, 2000);
-   if ((ret & GLOBAL_STATUS_PPU_MASK) ==
-   GLOBAL_STATUS_PPU_POLLING)
+   if ((val & GLOBAL_STATUS_PPU_MASK) == GLOBAL_STATUS_PPU_POLLING)
return 0;
}
 
@@ -663,12 +645,12 @@ out:
 
 static int 

[PATCH net-next 02/10] net: dsa: mv88e6xxx: abstract REG_GLOBAL2

2016-09-29 Thread Vivien Didelot
Similarly to the ports, phys, and Global SMI devices, abstract the SMI
device address of the Global 2 registers in a few g2 static helpers.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/global2.c   | 78 ++-
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  1 -
 2 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/global2.c 
b/drivers/net/dsa/mv88e6xxx/global2.c
index 99ed028..f31d553 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.c
+++ b/drivers/net/dsa/mv88e6xxx/global2.c
@@ -14,6 +14,28 @@
 #include "mv88e6xxx.h"
 #include "global2.h"
 
+#define ADDR_GLOBAL2   0x1c
+
+static int mv88e6xxx_g2_read(struct mv88e6xxx_chip *chip, int reg, u16 *val)
+{
+   return mv88e6xxx_read(chip, ADDR_GLOBAL2, reg, val);
+}
+
+static int mv88e6xxx_g2_write(struct mv88e6xxx_chip *chip, int reg, u16 val)
+{
+   return mv88e6xxx_write(chip, ADDR_GLOBAL2, reg, val);
+}
+
+static int mv88e6xxx_g2_update(struct mv88e6xxx_chip *chip, int reg, u16 
update)
+{
+   return mv88e6xxx_update(chip, ADDR_GLOBAL2, reg, update);
+}
+
+static int mv88e6xxx_g2_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask)
+{
+   return mv88e6xxx_wait(chip, ADDR_GLOBAL2, reg, mask);
+}
+
 /* Offset 0x06: Device Mapping Table register */
 
 static int mv88e6xxx_g2_device_mapping_write(struct mv88e6xxx_chip *chip,
@@ -21,7 +43,7 @@ static int mv88e6xxx_g2_device_mapping_write(struct 
mv88e6xxx_chip *chip,
 {
u16 val = (target << 8) | (port & 0xf);
 
-   return mv88e6xxx_update(chip, REG_GLOBAL2, GLOBAL2_DEVICE_MAPPING, val);
+   return mv88e6xxx_g2_update(chip, GLOBAL2_DEVICE_MAPPING, val);
 }
 
 static int mv88e6xxx_g2_set_device_mapping(struct mv88e6xxx_chip *chip)
@@ -58,7 +80,7 @@ static int mv88e6xxx_g2_trunk_mask_write(struct 
mv88e6xxx_chip *chip, int num,
if (hask)
val |= GLOBAL2_TRUNK_MASK_HASK;
 
-   return mv88e6xxx_update(chip, REG_GLOBAL2, GLOBAL2_TRUNK_MASK, val);
+   return mv88e6xxx_g2_update(chip, GLOBAL2_TRUNK_MASK, val);
 }
 
 /* Offset 0x08: Trunk Mapping Table register */
@@ -69,7 +91,7 @@ static int mv88e6xxx_g2_trunk_mapping_write(struct 
mv88e6xxx_chip *chip, int id,
const u16 port_mask = BIT(chip->info->num_ports) - 1;
u16 val = (id << 11) | (map & port_mask);
 
-   return mv88e6xxx_update(chip, REG_GLOBAL2, GLOBAL2_TRUNK_MAPPING, val);
+   return mv88e6xxx_g2_update(chip, GLOBAL2_TRUNK_MAPPING, val);
 }
 
 static int mv88e6xxx_g2_clear_trunk(struct mv88e6xxx_chip *chip)
@@ -105,15 +127,15 @@ static int mv88e6xxx_g2_clear_irl(struct mv88e6xxx_chip 
*chip)
/* Init all Ingress Rate Limit resources of all ports */
for (port = 0; port < chip->info->num_ports; ++port) {
/* XXX newer chips (like 88E6390) have different 2-bit ops */
-   err = mv88e6xxx_write(chip, REG_GLOBAL2, GLOBAL2_IRL_CMD,
- GLOBAL2_IRL_CMD_OP_INIT_ALL |
- (port << 8));
+   err = mv88e6xxx_g2_write(chip, GLOBAL2_IRL_CMD,
+GLOBAL2_IRL_CMD_OP_INIT_ALL |
+(port << 8));
if (err)
break;
 
/* Wait for the operation to complete */
-   err = mv88e6xxx_wait(chip, REG_GLOBAL2, GLOBAL2_IRL_CMD,
-GLOBAL2_IRL_CMD_BUSY);
+   err = mv88e6xxx_g2_wait(chip, GLOBAL2_IRL_CMD,
+   GLOBAL2_IRL_CMD_BUSY);
if (err)
break;
}
@@ -128,7 +150,7 @@ static int mv88e6xxx_g2_switch_mac_write(struct 
mv88e6xxx_chip *chip,
 {
u16 val = (pointer << 8) | data;
 
-   return mv88e6xxx_update(chip, REG_GLOBAL2, GLOBAL2_SWITCH_MAC, val);
+   return mv88e6xxx_g2_update(chip, GLOBAL2_SWITCH_MAC, val);
 }
 
 int mv88e6xxx_g2_set_switch_mac(struct mv88e6xxx_chip *chip, u8 *addr)
@@ -151,7 +173,7 @@ static int mv88e6xxx_g2_pot_write(struct mv88e6xxx_chip 
*chip, int pointer,
 {
u16 val = (pointer << 8) | (data & 0x7);
 
-   return mv88e6xxx_update(chip, REG_GLOBAL2, GLOBAL2_PRIO_OVERRIDE, val);
+   return mv88e6xxx_g2_update(chip, GLOBAL2_PRIO_OVERRIDE, val);
 }
 
 static int mv88e6xxx_g2_clear_pot(struct mv88e6xxx_chip *chip)
@@ -174,16 +196,16 @@ static int mv88e6xxx_g2_clear_pot(struct mv88e6xxx_chip 
*chip)
 
 static int mv88e6xxx_g2_eeprom_wait(struct mv88e6xxx_chip *chip)
 {
-   return mv88e6xxx_wait(chip, REG_GLOBAL2, GLOBAL2_EEPROM_CMD,
- GLOBAL2_EEPROM_CMD_BUSY |
- GLOBAL2_EEPROM_CMD_RUNNING);
+   return mv88e6xxx_g2_wait(chip, GLOBAL2_EEPROM_CMD,
+GLOBAL2_EEPROM_CMD_BUSY |
+GLOBAL2_EEPROM_CMD_RUNNING);
 }
 
 static int 

[PATCH net v2] tg3: Avoid NULL pointer dereference in tg3_io_error_detected()

2016-09-29 Thread Guilherme G. Piccoli
From: Milton Miller 

While the driver is probing the adapter, an error may occur before the
netdev structure is allocated and attached to pci_dev. In this case,
not only netdev isn't available, but the tg3 private structure is also
not available as it is just math from the NULL pointer, so dereferences
must be skipped.

The following trace is seen when the error is triggered:

  [1.402247] Unable to handle kernel paging request for data at address 
0x1a99
  [1.402410] Faulting instruction address: 0xc07e33f8
  [1.402450] Oops: Kernel access of bad area, sig: 11 [#1]
  [1.402481] SMP NR_CPUS=2048 NUMA PowerNV
  [1.402513] Modules linked in:
  [1.402545] CPU: 0 PID: 651 Comm: eehd Not tainted 4.4.0-36-generic #55-Ubuntu
  [1.402591] task: c01fe4e42a20 ti: c01fe4e88000 task.ti: 
c01fe4e88000
  [1.402742] NIP: c07e33f8 LR: c07e3164 CTR: c0595ea0
  [1.402787] REGS: c01fe4e8b790 TRAP: 0300   Not tainted  (4.4.0-36-generic)
  [1.402832] MSR: 90019033   CR: 28000422  
XER: 2000
  [1.403058] CFAR: c0008468 DAR: 1a99 DSISR: 4200 
SOFTE: 1
  GPR00: c07e3164 c01fe4e8ba10 c15c5e00 
  GPR04: 0001  0039 0299
  GPR08:  0001 c01fe4e88000 0006
  GPR12:  cfb4 c00e6558 c03ca1bffd00
  GPR16:    
  GPR20:    c0d52768
  GPR24: c0d52740 0100 c03ca1b52000 0002
  GPR28: 0900  c152a0c0 c03ca1b52000
  [1.404226] NIP [c07e33f8] tg3_io_error_detected+0x308/0x340
  [1.404265] LR [c07e3164] tg3_io_error_detected+0x74/0x340

This patch avoids the NULL pointer dereference by moving the access after
the netdev NULL pointer check on tg3_io_error_detected(). Also, we add a
check for netdev being NULL on tg3_io_resume() [suggested by Michael Chan].

Fixes: 0486a063b1ff ("tg3: prevent ifup/ifdown during PCI error recovery")
Fixes: dfc8f370316b ("net/tg3: Release IRQs on permanent error")
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Milton Miller 
Signed-off-by: Guilherme G. Piccoli 
---

  * v2 changelog: added netdev NULL check on tg3_io_resume() as per
Michael Chan's suggestion.

 drivers/net/ethernet/broadcom/tg3.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c 
b/drivers/net/ethernet/broadcom/tg3.c
index a2551bc..ea967df 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -18122,14 +18122,14 @@ static pci_ers_result_t tg3_io_error_detected(struct 
pci_dev *pdev,
 
rtnl_lock();
 
-   /* We needn't recover from permanent error */
-   if (state == pci_channel_io_frozen)
-   tp->pcierr_recovery = true;
-
/* We probably don't have netdev yet */
if (!netdev || !netif_running(netdev))
goto done;
 
+   /* We needn't recover from permanent error */
+   if (state == pci_channel_io_frozen)
+   tp->pcierr_recovery = true;
+
tg3_phy_stop(tp);
 
tg3_netif_stop(tp);
@@ -18226,7 +18226,7 @@ static void tg3_io_resume(struct pci_dev *pdev)
 
rtnl_lock();
 
-   if (!netif_running(netdev))
+   if (!netdev || !netif_running(netdev))
goto done;
 
tg3_full_lock(tp, 0);
-- 
2.1.0



[PATCH v6 6/7] ipv6: Remove useless parameter in __snmp6_fill_statsdev

2016-09-29 Thread Jia He
The parameter items(is always ICMP6_MIB_MAX) is useless for 
__snmp6_fill_statsdev

Signed-off-by: Jia He 
---
 net/ipv6/addrconf.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2f1f5d4..35d4baa 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4961,18 +4961,18 @@ static inline size_t inet6_if_nlmsg_size(void)
 }
 
 static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
- int items, int bytes)
+   int bytes)
 {
int i;
-   int pad = bytes - sizeof(u64) * items;
+   int pad = bytes - sizeof(u64) * ICMP6_MIB_MAX;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, [0]);
-   for (i = 1; i < items; i++)
+   put_unaligned(ICMP6_MIB_MAX, [0]);
+   for (i = 1; i < ICMP6_MIB_MAX; i++)
put_unaligned(atomic_long_read([i]), [i]);
 
-   memset([items], 0, pad);
+   memset([ICMP6_MIB_MAX], 0, pad);
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
@@ -5005,7 +5005,7 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 offsetof(struct ipstats_mib, syncp));
break;
case IFLA_INET6_ICMP6STATS:
-   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
+   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
bytes);
break;
}
 }
-- 
2.5.5



[PATCH v6 3/7] proc: Reduce cache miss in snmp6_seq_show

2016-09-29 Thread Jia He
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/ipv6/proc.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..cc8e3ae 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,11 @@
 #include 
 #include 
 
+#define MAX4(a, b, c, d) \
+   max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
 static int sockstat6_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
@@ -191,25 +196,34 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
atomic_long_t *smib,
const struct snmp_mib *itemlist)
 {
+   unsigned long buff[SNMP_MIB_MAX];
int i;
-   unsigned long val;
 
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   if (pcpumib) {
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
+
+   snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
 }
 
 static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,
  const struct snmp_mib *itemlist, size_t 
syncpoff)
 {
+   u64 buff64[SNMP_MIB_MAX];
int i;
 
+   memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
+
+   snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);
for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
 }
 
 static int snmp6_seq_show(struct seq_file *seq, void *v)
-- 
2.5.5



[PATCH v6 0/7] Reduce cache miss for snmp_fold_field

2016-09-29 Thread Jia He
In a PowerPc server with large cpu number(160), besides commit
a3a773726c9f ("net: Optimize snmp stat aggregation by walking all
the percpu data at once"), I watched several other snmp_fold_field
callsites which would cause high cache miss rate.

test source code:

My simple test case, which read from the procfs items endlessly:
/***/
#include 
#include 
#include 
#include 
#include 
#define LINELEN  2560
int main(int argc, char **argv)
{
int i;
int fd = -1 ;
int rdsize = 0;
char buf[LINELEN+1];

buf[LINELEN] = 0;
memset(buf,0,LINELEN);

if(1 >= argc) {
printf("file name empty\n");
return -1;
}

fd = open(argv[1], O_RDWR, 0644);
if(0 > fd){
printf("open error\n");
return -2;
}

for(i=0;i<0x;i++) {
while(0 < (rdsize = read(fd,buf,LINELEN))){
//nothing here
}

lseek(fd, 0, SEEK_SET);
}

close(fd);
return 0;
}
/**/

compile and run:

gcc test.c -o test

perf stat -d -e cache-misses ./test /proc/net/snmp
perf stat -d -e cache-misses ./test /proc/net/snmp6
perf stat -d -e cache-misses ./test /proc/net/sctp/snmp
perf stat -d -e cache-misses ./test /proc/net/xfrm_stat

before the patch set:

 Performance counter stats for 'system wide':

 355911097  cache-misses
 [40.08%]
2356829300  L1-dcache-loads 
 [60.04%]
 355642645  L1-dcache-load-misses #   15.09% of all L1-dcache 
hits   [60.02%]
 346544541  LLC-loads   
 [59.97%]
389763  LLC-load-misses   #0.11% of all LL-cache 
hits[40.02%]

   6.245162638 seconds time elapsed

After the patch set:
===
 Performance counter stats for 'system wide':

 194992476  cache-misses
 [40.03%]
6718051877  L1-dcache-loads 
 [60.07%]
 194871921  L1-dcache-load-misses #2.90% of all L1-dcache 
hits   [60.11%]
 187632232  LLC-loads   
 [60.04%]
464466  LLC-load-misses   #0.25% of all LL-cache 
hits[39.89%]

   6.868422769 seconds time elapsed
The cache-miss rate can be reduced from 15% to 2.9%

changelog
=
v6:
- correct v5 
v5:
- order local variables from longest to shortest line
v4:
- move memset into one block of if statement in snmp6_seq_show_item
- remove the changes in netstat_seq_show considerred the stack usage is too 
large
v3:
- introduce generic interface (suggested by Marcelo Ricardo Leitner)
- use max_t instead of self defined macro (suggested by David Miller)
v2:
- fix bug in udplite statistics.
- snmp_seq_show is split into 2 parts

Jia He (7):
  net:snmp: Introduce generic interfaces for snmp_get_cpu_field{,64}
  proc: Reduce cache miss in snmp_seq_show
  proc: Reduce cache miss in snmp6_seq_show
  proc: Reduce cache miss in sctp_snmp_seq_show
  proc: Reduce cache miss in xfrm_statistics_seq_show
  ipv6: Remove useless parameter in __snmp6_fill_statsdev
  net: Suppress the "Comparison to NULL could be written" warnings

 include/net/ip.h |  23 
 net/ipv4/proc.c  | 102 +++
 net/ipv6/addrconf.c  |  12 +++---
 net/ipv6/proc.c  |  30 +++
 net/sctp/proc.c  |  10 +++--
 net/xfrm/xfrm_proc.c |  10 -
 6 files changed, 129 insertions(+), 58 deletions(-)

-- 
2.5.5



[PATCH v6 7/7] net: Suppress the "Comparison to NULL could be written" warnings

2016-09-29 Thread Jia He
This is to suppress the checkpatch.pl warning "Comparison to NULL
could be written". No functional changes here.

Signed-off-by: Jia He 
---
 net/ipv4/proc.c | 32 
 net/sctp/proc.c |  2 +-
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9d7a39a..b39faf6 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -357,22 +357,22 @@ static void icmp_put(struct seq_file *seq)
atomic_long_t *ptr = net->mib.icmpmsg_statistics->mibs;
 
seq_puts(seq, "\nIcmp: InMsgs InErrors InCsumErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " In%s", icmpmibmap[i].name);
seq_puts(seq, " OutMsgs OutErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " Out%s", icmpmibmap[i].name);
seq_printf(seq, "\nIcmp: %lu %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INERRORS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_CSUMERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + icmpmibmap[i].index));
seq_printf(seq, " %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + (icmpmibmap[i].index | 
0x100)));
 }
@@ -389,7 +389,7 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, void 
*v)
memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
seq_puts(seq, "Ip: Forwarding DefaultTTL");
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
seq_printf(seq, "\nIp: %d %d",
@@ -400,7 +400,7 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, void 
*v)
snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
   net->mib.ip_statistics,
   offsetof(struct ipstats_mib, syncp));
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %llu", buff64[i]);
 
return 0;
@@ -415,13 +415,13 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
void *v)
memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
seq_puts(seq, "\nTcp:");
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_tcp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
 net->mib.tcp_statistics);
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
+   for (i = 0; snmp4_tcp_list[i].name; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
seq_printf(seq, " %ld", buff[i]);
@@ -434,10 +434,10 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
void *v)
snmp_get_cpu_field_batch(buff, snmp4_udp_list,
 net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %lu", buff[i]);
 
memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
@@ -446,10 +446,10 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
void *v)
seq_puts(seq, "\nUdpLite:");
snmp_get_cpu_field_batch(buff, snmp4_udp_list,
 net->mib.udplite_statistics);
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
seq_puts(seq, "\nUdpLite:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %lu", buff[i]);
 
seq_putc(seq, '\n');
@@ -492,21 +492,21 @@ static int netstat_seq_show(struct seq_file *seq, void *v)
struct net *net = seq->private;
 

[PATCH v6 1/7] net:snmp: Introduce generic interfaces for snmp_get_cpu_field{,64}

2016-09-29 Thread Jia He
This is to introduce the generic interfaces for snmp_get_cpu_field{,64}.
It exchanges the two for-loops for collecting the percpu statistics data.
This can aggregate the data by going through all the items of each cpu
sequentially.

Signed-off-by: Jia He 
Suggested-by: Marcelo Ricardo Leitner 
---
 include/net/ip.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/include/net/ip.h b/include/net/ip.h
index 9742b92..bc43c0f 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -219,6 +219,29 @@ static inline u64 snmp_fold_field64(void __percpu *mib, 
int offt, size_t syncp_o
 }
 #endif
 
+#define snmp_get_cpu_field64_batch(buff64, stats_list, mib_statistic, offset) \
+{ \
+   int i, c; \
+   for_each_possible_cpu(c) { \
+   for (i = 0; stats_list[i].name; i++) \
+   buff64[i] += snmp_get_cpu_field64( \
+   mib_statistic, \
+   c, stats_list[i].entry, \
+   offset); \
+   } \
+}
+
+#define snmp_get_cpu_field_batch(buff, stats_list, mib_statistic) \
+{ \
+   int i, c; \
+   for_each_possible_cpu(c) { \
+   for (i = 0; stats_list[i].name; i++) \
+   buff[i] += snmp_get_cpu_field( \
+   mib_statistic, \
+   c, stats_list[i].entry); \
+   } \
+}
+
 void inet_get_local_port_range(struct net *net, int *low, int *high);
 
 #ifdef CONFIG_SYSCTL
-- 
2.5.5



[PATCH v6 5/7] proc: Reduce cache miss in xfrm_statistics_seq_show

2016-09-29 Thread Jia He
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/xfrm/xfrm_proc.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c
index 9c4fbd8..ba2b539 100644
--- a/net/xfrm/xfrm_proc.c
+++ b/net/xfrm/xfrm_proc.c
@@ -50,12 +50,18 @@ static const struct snmp_mib xfrm_mib_list[] = {
 
 static int xfrm_statistics_seq_show(struct seq_file *seq, void *v)
 {
+   unsigned long buff[LINUX_MIB_XFRMMAX];
struct net *net = seq->private;
int i;
+
+   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_XFRMMAX);
+
+   snmp_get_cpu_field_batch(buff, xfrm_mib_list,
+net->mib.xfrm_statistics);
for (i = 0; xfrm_mib_list[i].name; i++)
seq_printf(seq, "%-24s\t%lu\n", xfrm_mib_list[i].name,
-  snmp_fold_field(net->mib.xfrm_statistics,
-  xfrm_mib_list[i].entry));
+   buff[i]);
+
return 0;
 }
 
-- 
2.5.5



[PATCH v6 2/7] proc: Reduce cache miss in snmp_seq_show

2016-09-29 Thread Jia He
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show is split into 2 parts to avoid build warning "the frame
size" larger than 1024.

Signed-off-by: Jia He 
---
 net/ipv4/proc.c | 70 ++---
 1 file changed, 47 insertions(+), 23 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..9d7a39a 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
 #include 
 #include 
 
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)
+
 /*
  * Report socket allocation statistics [m...@utu.fi]
  */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
 /*
  * Called from the PROCfs module. This outputs /proc/net/snmp.
  */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
 {
-   int i;
struct net *net = seq->private;
+   u64 buff64[IPSTATS_MIB_MAX];
+   int i;
 
-   seq_puts(seq, "Ip: Forwarding DefaultTTL");
+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
+   seq_puts(seq, "Ip: Forwarding DefaultTTL");
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
   net->ipv4.sysctl_ip_default_ttl);
 
BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
 
-   icmp_put(seq);  /* RFC 2011 compatibility */
-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+   int i;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
seq_puts(seq, "\nTcp:");
for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
 
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
/* the UDP and UDP-Lite MIBs are the same */
seq_puts(seq, "\nUdpLite:");
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udplite_statistics);
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdpLite:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udplite_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
 
seq_putc(seq, '\n');
return 0;
 }
 
+static int 

[PATCH v6 4/7] proc: Reduce cache miss in sctp_snmp_seq_show

2016-09-29 Thread Jia He
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/sctp/proc.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index ef8ba77..09e16c2 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -73,13 +73,17 @@ static const struct snmp_mib sctp_snmp_list[] = {
 /* Display sctp snmp mib statistics(/proc/net/sctp/snmp). */
 static int sctp_snmp_seq_show(struct seq_file *seq, void *v)
 {
+   unsigned long buff[SCTP_MIB_MAX];
struct net *net = seq->private;
int i;
 
+   memset(buff, 0, sizeof(unsigned long) * SCTP_MIB_MAX);
+
+   snmp_get_cpu_field_batch(buff, sctp_snmp_list,
+net->sctp.sctp_statistics);
for (i = 0; sctp_snmp_list[i].name != NULL; i++)
seq_printf(seq, "%-32s\t%ld\n", sctp_snmp_list[i].name,
-  snmp_fold_field(net->sctp.sctp_statistics,
- sctp_snmp_list[i].entry));
+   buff[i]);
 
return 0;
 }
-- 
2.5.5



Re: [PATCH V2 for-next 0/8] Bug Fixes and Code Improvement in HNS driver

2016-09-29 Thread David Miller
From: Salil Mehta 
Date: Thu, 29 Sep 2016 18:09:08 +0100

> This patch-set introduces fix to some Bugs, potential problems
> and code improvements identified during internal review and
> testing of Hisilicon Network Subsystem driver.
> 
> Submit Change
> V1->V2: This addresses the feedbacks provided by David Miller
> and Doug Ledford

So Doug my understanding is if this makes it through review
this is going to be merged into your tree, you prepare a
branch for me, and then I pull from that?

Thanks in advance.


Re: [PATCH v1] mlx4: remove unused fields

2016-09-29 Thread David Miller
From: David Decotigny 
Date: Wed, 28 Sep 2016 11:00:04 -0700

> From: David Decotigny 
> 
> This also can address following UBSAN warnings:
> [   36.640343] 
> 
> [   36.648772] UBSAN: Undefined behaviour in 
> drivers/net/ethernet/mellanox/mlx4/fw.c:857:26
> [   36.656853] shift exponent 64 is too large for 32-bit type 'int'
> [   36.663348] 
> 
> [   36.671783] 
> 
> [   36.680213] UBSAN: Undefined behaviour in 
> drivers/net/ethernet/mellanox/mlx4/fw.c:861:27
> [   36.688297] shift exponent 35 is too large for 32-bit type 'int'
> [   36.694702] 
> 
> 
> Tested:
>   reboot with UBSAN, no warning.
> 
> Signed-off-by: David Decotigny 

Applied to net-next, thanks.


Re: [PATCH v8 0/8] thunderbolt: Introducing Thunderbolt(TM) Networking

2016-09-29 Thread David Miller
From: Amir Levy 
Date: Wed, 28 Sep 2016 17:44:22 +0300

> This driver enables Thunderbolt Networking on non-Apple platforms
> running Linux.

Greg, any idea where this should get merged once fully vetted?  I can
take it through the net-next tree, but I'm fine with another more
appropriate tree taking it as well.

Thanks!


Re: [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2

2016-09-29 Thread R. Parameswaran

Hi James,

On Thu, 29 Sep 2016, James Chapman wrote:

> On 22/09/16 21:52, R. Parameswaran wrote:
> > From ed585bdd6d3d2b3dec58d414f514cd764d89159d Mon Sep 17 00:00:00 2001
> > From: "R. Parameswaran" 
> > Date: Thu, 22 Sep 2016 13:19:25 -0700
> > Subject: [PATCH] L2TP:Adjust intf MTU,factor underlay L3,overlay L2
> >
> > Take into account all of the tunnel encapsulation headers when setting
> > up the MTU on the L2TP logical interface device. Otherwise, packets
> > created by the applications on top of the L2TP layer are larger
> > than they ought to be, relative to the underlay MTU, leading to
> > needless fragmentation once the outer IP encap is added.
> >
> > Specifically, take into account the (outer, underlay) IP header
> > imposed on the encapsulated L2TP packet, and the Layer 2 header
> > imposed on the inner IP packet prior to L2TP encapsulation.
> >
> > Do not assume an Ethernet (non-jumbo) underlay. Use the PMTU mechanism
> > and the dst entry in the L2TP tunnel socket to directly pull up
> > the underlay MTU (as the baseline number on top of which the
> > encapsulation headers are factored in).  Fall back to Ethernet MTU
> > if this fails.
> >
> > Signed-off-by: R. Parameswaran 
> >
> > Reviewed-by: "N. Prachanda" ,
> > Reviewed-by: "R. Shearman" ,
> > Reviewed-by: "D. Fawcus" 
> > ---
> >  net/l2tp/l2tp_eth.c | 48 
> >  1 file changed, 44 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> > index 57fc5a4..dbcd6bd 100644
> > --- a/net/l2tp/l2tp_eth.c
> > +++ b/net/l2tp/l2tp_eth.c
> > @@ -30,6 +30,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> >  
> >  #include "l2tp_core.h"
> >  
> > @@ -206,6 +209,46 @@ static void l2tp_eth_show(struct seq_file *m, void 
> > *arg)
> >  }
> >  #endif
> >  
> > +static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
> > +   struct l2tp_session *session,
> > +   struct net_device *dev)
> > +{
> > +   unsigned int overhead = 0;
> > +   struct dst_entry *dst;
> > +
> > +   if (session->mtu != 0) {
> > +   dev->mtu = session->mtu;
> > +   dev->needed_headroom += session->hdr_len;
> > +   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
> > +   dev->needed_headroom += sizeof(struct udphdr);
> > +   return;
> > +   }
> > +   overhead = session->hdr_len;
> > +   /* Adjust MTU, factor overhead - underlay L3 hdr, overlay L2 hdr*/
> > +   if (tunnel->sock->sk_family == AF_INET)
> > +   overhead += (ETH_HLEN + sizeof(struct iphdr));
> > +   else if (tunnel->sock->sk_family == AF_INET6)
> > +   overhead += (ETH_HLEN + sizeof(struct ipv6hdr));
> What about options in the IP header? If certain options are set on the
> socket, the IP header may be larger.
> 

Thanks for the reply - It looks like IP options can only be 
enabled through setsockopt on an application's socket (if there's any 
other way to turn on IP options, please let me know - didn't see any 
sysctl setting for transmit). This scenario would come 
into picture when an application opens a raw IP or UDP socket such that it 
routes into the L2TP logical interface.

If you take the case of a plain IP (ethernet) interface, even if an
application opened a socket turning on IP options, it would not change
the MTU of the underlying interface, and it would not affect other 
applications transacting packets on the same interface. I know its not an 
exact parallel to this case, but since the IP option control is per 
application, we probably should not factor it into the L2TP logical interface?
We cannot affect other applications/processes running on the same L2TP 
tunnel. Also, since the application  using IP options knows that it has turned 
on IP options, maybe we can count on it to factor the size of the options 
into the size of the payload it sends into the socket, or set the mtu on the 
L2TP interface through config? 

Other than this, I don't see keepalives or anything else in which the 
kernel will source its own packet into the L2TP interface, outside of 
an application injected packet - if there is something like that, please
let me know. The user space L2TP daemon would probably fall in the 
category of applications.

thanks,

Ramkumar 


> > +   /* Additionally, if the encap is UDP, account for UDP header size */
> > +   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
> > +   overhead += sizeof(struct udphdr);
> > +   /* If PMTU discovery was enabled, use discovered MTU on L2TP device */
> > +   dst = sk_dst_get(tunnel->sock);
> > +   if (dst) {
> > +   u32 pmtu = dst_mtu(dst);
> > +
> > +   if (pmtu != 0)
> > +   dev->mtu = pmtu;
> > +   dst_release(dst);
> > +   }
> > +   /* else (no PMTUD) L2TP dev MTU defaulted 

Re: [PATCH] ipv6 addrconf: implement RFC7559 router solicitation backoff

2016-09-29 Thread David Miller
From: Maciej Żenczykowski 
Date: Tue, 27 Sep 2016 23:57:58 -0700

> From: Maciej Żenczykowski 
> 
> This implements:
>   https://tools.ietf.org/html/rfc7559
> 
> Backoff is performed according to RFC3315 section 14:
>   https://tools.ietf.org/html/rfc3315#section-14
> 
> We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations
> to a negative value meaning an unlimited number of retransmits,
> and we make this the new default (inline with the RFC).
> 
> We also add a new setting:
>   /proc/sys/net/ipv6/conf/*/router_solicitation_max_interval
> defaulting to 1 hour (per RFC recommendation).
> 
> Signed-off-by: Maciej Żenczykowski 

Applied to net-next, thanks.


Re: [PATCH v2 3/3] net: make net namespace sysctls belong to container's owner

2016-09-29 Thread David Miller
From: Dmitry Torokhov 
Date: Thu, 29 Sep 2016 08:46:05 -0700

> Hi David,
> 
> On Wed, Aug 10, 2016 at 2:36 PM, Dmitry Torokhov
>  wrote:
>> If net namespace is attached to a user namespace let's make container's
>> root owner of sysctls affecting said network namespace instead of global
>> root.
>>
>> This also allows us to clean up net_ctl_permissions() because we do not
>> need to fudge permissions anymore for the container's owner since it now
>> owns the objects in question.
>>
>> Acked-by: "Eric W. Biederman" 
>> Signed-off-by: Dmitry Torokhov 
> 
> I was looking at linux-next today, and I noticed that, when you merged
> my patch, you basically reverted the following commit:
> 
> commit d6e0d306449bcb5fa3c80e7a3edf11d45abf9ae9
> Author: Tyler Hicks 
> Date:   Thu Jun 2 23:43:22 2016 -0500
> 
> net: Use ns_capable_noaudit() when determining net sysctl permissions

Please send me a fixup patch for this, sorry.


Re: [PATCH RFC net-next] bnx2x: avoid printing unnecessary messages during register dump

2016-09-29 Thread David Miller
From: "Guilherme G. Piccoli" 
Date: Thu, 29 Sep 2016 13:19:39 -0300

> David, thanks for your comment. I confess I didn't understand your
> statement quite well. You say we shouldn't dump registers that will
> cause timeouts, that's it?

Yes, basically.

If this happened infrequently for one or two registers maybe
that would be OK, but this seems to timeout on many registers
especially if the device is up and operational during the
"ethtool -d", and that's a bit much.


Re: [PATCH net-next 00/10] net: dsa: mv88e6xxx: Global (1) cosmetics

2016-09-29 Thread David Miller
From: Vivien Didelot 
Date: Thu, 29 Sep 2016 12:21:52 -0400

> The Global (1) internal SMI device of Marvell switches is a set of
> registers providing support to different units for MAC addresses (ATU),
> VLANs (VTU), PHY polling (PPU), etc.
 ...

Looks like a very nice set of cleanups to me.

Series applied, thanks!


Re: pull-request: wireless-drivers-next 2016-09-29

2016-09-29 Thread David Miller
From: Kalle Valo 
Date: Thu, 29 Sep 2016 19:57:28 +0300

> this should be the last wireless-drivers-next pull request for 4.9, from
> now on only important bugfixes. Nothing really special stands out,
> iwlwifi being most active but other drivers also getting attention. More
> details in the signed tag. Please let me know if there are any problems.

Pulled, thanks Kalle.

> Or actually I had one problem. While doing a test merge I noticed that
> net-next fails to compile for me, but I don't think this is anything
> wireless related:
> 
>   CC  net/netfilter/core.o
> net/netfilter/core.c: In function 'nf_set_hooks_head':
> net/netfilter/core.c:96:149: error: 'struct net_device' has no member named 
> 'nf_hooks_ingress'

Yes, I am aware of this build issue and will tackle it myself if someone
doesn't beat me to it.

Thanks again.


Re: [PATCH net v2] tg3: Avoid NULL pointer dereference in tg3_io_error_detected()

2016-09-29 Thread David Miller
From: "Guilherme G. Piccoli" 
Date: Thu, 29 Sep 2016 13:24:08 -0300

> From: Milton Miller 
> 
> While the driver is probing the adapter, an error may occur before the
> netdev structure is allocated and attached to pci_dev. In this case,
> not only netdev isn't available, but the tg3 private structure is also
> not available as it is just math from the NULL pointer, so dereferences
> must be skipped.
> 
> The following trace is seen when the error is triggered:
 ...
> This patch avoids the NULL pointer dereference by moving the access after
> the netdev NULL pointer check on tg3_io_error_detected(). Also, we add a
> check for netdev being NULL on tg3_io_resume() [suggested by Michael Chan].
> 
> Fixes: 0486a063b1ff ("tg3: prevent ifup/ifdown during PCI error recovery")
> Fixes: dfc8f370316b ("net/tg3: Release IRQs on permanent error")
> Tested-by: Guilherme G. Piccoli 
> Signed-off-by: Milton Miller 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
>   * v2 changelog: added netdev NULL check on tg3_io_resume() as per
> Michael Chan's suggestion.

Applied and queued up for -stable, thanks!


Re: [PATCH net-next 0/6] rxrpc: Fixes and adjustments

2016-09-29 Thread David Miller
From: David Howells <dhowe...@redhat.com>
Date: Thu, 29 Sep 2016 23:15:27 +0100

> This set of patches contains some fixes and adjustments:
 ...
> The patches can be found here also:
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
> 
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160929

Pulled, thanks David.


[PATCH V2 for-next 8/8] net: hns: delete redundant broadcast packet filter process

2016-09-29 Thread Salil Mehta
From: Daode Huang 

The broadcast packets is filtered in the hardware now, so this process
is no need in the driver, just delete it.

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   10 --
 1 file changed, 10 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 5494e0e..a7abe11 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -574,7 +574,6 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data 
*ring_data,
struct sk_buff *skb;
struct hnae_desc *desc;
struct hnae_desc_cb *desc_cb;
-   struct ethhdr *eh;
unsigned char *va;
int bnum, length, i;
int pull_len;
@@ -600,7 +599,6 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data 
*ring_data,
ring->stats.sw_err_cnt++;
return -ENOMEM;
}
-   skb_reset_mac_header(skb);
 
prefetchw(skb->data);
length = le16_to_cpu(desc->rx.pkt_len);
@@ -682,14 +680,6 @@ out_bnum_err:
return -EFAULT;
}
 
-   /* filter out multicast pkt with the same src mac as this port */
-   eh = eth_hdr(skb);
-   if (unlikely(is_multicast_ether_addr(eh->h_dest) &&
-ether_addr_equal(ndev->dev_addr, eh->h_source))) {
-   dev_kfree_skb_any(skb);
-   return -EFAULT;
-   }
-
ring->stats.rx_pkts++;
ring->stats.rx_bytes += skb->len;
 
-- 
1.7.9.5




[PATCH V2 for-next 4/8] net: hns: delete repeat read fbd num after while

2016-09-29 Thread Salil Mehta
From: Daode Huang 

Because we handle the received packets after napi, so delete the checking
before submitting. It delete the code of read the fbd number register,
which reduces the cpu usages while receiving packets

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index e6bfc51..09ed237 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -760,7 +760,7 @@ static int hns_nic_rx_poll_one(struct hns_nic_ring_data 
*ring_data,
 {
struct hnae_ring *ring = ring_data->ring;
struct sk_buff *skb;
-   int num, bnum, ex_num;
+   int num, bnum;
 #define RCB_NOF_ALLOC_RX_BUFF_ONCE 16
int recv_pkts, recv_bds, clean_count, err;
int unused_count = hns_desc_unused(ring);
@@ -770,7 +770,7 @@ static int hns_nic_rx_poll_one(struct hns_nic_ring_data 
*ring_data,
 
recv_pkts = 0, recv_bds = 0, clean_count = 0;
num -= unused_count;
-recv:
+
while (recv_pkts < budget && recv_bds < num) {
/* reuse or realloc buffers */
if (clean_count + unused_count >= RCB_NOF_ALLOC_RX_BUFF_ONCE) {
@@ -798,17 +798,6 @@ recv:
recv_pkts++;
}
 
-   /* make all data has been write before submit */
-   if (recv_pkts < budget) {
-   ex_num = readl_relaxed(ring->io_base + RCB_REG_FBDNUM);
-   ex_num -= unused_count;
-   if (ex_num > clean_count) {
-   num += ex_num - clean_count;
-   rmb(); /*complete read rx ring bd number*/
-   goto recv;
-   }
-   }
-
 out:
/* make all data has been write before submit */
if (clean_count + unused_count > 0)
-- 
1.7.9.5




[PATCH V2 for-next 6/8] net: hns: fix the bug of forwarding table

2016-09-29 Thread Salil Mehta
From: Daode Huang 

As the sub queue id in the broadcast forwarding table is always
set to absolute queue 0 rather than the interface's relative queue 0,
this will cause the received broadcast packets loopback to rcb.
This patch sets the sub queue id to relative queue 0 of each port.

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
PATCH V2: Addressed comments by David Miller
  Link: https://lkml.org/lkml/2016/9/28/390
PATCH V1: Initial Submit
---
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c |8 ++--
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c |   13 ++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h |2 ++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index e0f9cdc..2d0cb60 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -207,6 +207,7 @@ static int hns_ae_set_multicast_one(struct hnae_handle 
*handle, void *addr)
int ret;
char *mac_addr = (char *)addr;
struct hns_mac_cb *mac_cb = hns_get_mac_cb(handle);
+   u8 port_num;
 
assert(mac_cb);
 
@@ -221,8 +222,11 @@ static int hns_ae_set_multicast_one(struct hnae_handle 
*handle, void *addr)
return ret;
}
 
-   ret = hns_mac_set_multi(mac_cb, DSAF_BASE_INNER_PORT_NUM,
-   mac_addr, true);
+   ret = hns_mac_get_inner_port_num(mac_cb, handle->vf_id, _num);
+   if (ret)
+   return ret;
+
+   ret = hns_mac_set_multi(mac_cb, port_num, mac_addr, true);
if (ret)
dev_err(handle->owner_dev,
"mac add mul_mac:%pM port%d  fail, ret = %#x!\n",
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
index a68eef0..151fd6e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
@@ -141,9 +141,10 @@ void hns_mac_adjust_link(struct hns_mac_cb *mac_cb, int 
speed, int duplex)
  *@port_num:port number
  *
  */
-static int hns_mac_get_inner_port_num(struct hns_mac_cb *mac_cb,
- u8 vmid, u8 *port_num)
+int hns_mac_get_inner_port_num(struct hns_mac_cb *mac_cb, u8 vmid, u8 
*port_num)
 {
+   int q_num_per_vf, vf_num_per_port;
+   int vm_queue_id;
u8 tmp_port;
 
if (mac_cb->dsaf_dev->dsaf_mode <= DSAF_MODE_ENABLE) {
@@ -174,6 +175,12 @@ static int hns_mac_get_inner_port_num(struct hns_mac_cb 
*mac_cb,
return -EINVAL;
}
 
+   q_num_per_vf = mac_cb->dsaf_dev->rcb_common[0]->max_q_per_vf;
+   vf_num_per_port = mac_cb->dsaf_dev->rcb_common[0]->max_vfn;
+
+   vm_queue_id = vmid * q_num_per_vf +
+   vf_num_per_port * q_num_per_vf * mac_cb->mac_id;
+
switch (mac_cb->dsaf_dev->dsaf_mode) {
case DSAF_MODE_ENABLE_FIX:
tmp_port = 0;
@@ -193,7 +200,7 @@ static int hns_mac_get_inner_port_num(struct hns_mac_cb 
*mac_cb,
case DSAF_MODE_DISABLE_6PORT_2VM:
case DSAF_MODE_DISABLE_6PORT_4VM:
case DSAF_MODE_DISABLE_6PORT_16VM:
-   tmp_port = vmid;
+   tmp_port = vm_queue_id;
break;
default:
dev_err(mac_cb->dev, "dsaf mode invalid,%s mac%d!\n",
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
index 4cbdf14..d3a1f72 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
@@ -461,5 +461,7 @@ void hns_set_led_opt(struct hns_mac_cb *mac_cb);
 int hns_cpld_led_set_id(struct hns_mac_cb *mac_cb,
enum hnae_led_state status);
 void hns_mac_set_promisc(struct hns_mac_cb *mac_cb, u8 en);
+int hns_mac_get_inner_port_num(struct hns_mac_cb *mac_cb,
+  u8 vmid, u8 *port_num);
 
 #endif /* _HNS_DSAF_MAC_H */
-- 
1.7.9.5




Re: [PATCH -next] net: macb: fix missing unlock on error in macb_start_xmit()

2016-09-29 Thread Nicolas Ferre
Le 12/09/2016 à 19:40, Helmut Buchsbaum a écrit :
> On 09/10/2016 01:17 PM, Wei Yongjun wrote:
>> From: Wei Yongjun 
>>
>> Fix missing unlock before return from function macb_start_xmit()
>> in the error handling case.
>>
>> Fixes: 007e4ba3ee13 ("net: macb: initialize checksum when using
>> checksum offloading")
>> Signed-off-by: Wei Yongjun 
>> ---
>>  drivers/net/ethernet/cadence/macb.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/cadence/macb.c 
>> b/drivers/net/ethernet/cadence/macb.c
>> index 0294b6a..63144bb 100644
>> --- a/drivers/net/ethernet/cadence/macb.c
>> +++ b/drivers/net/ethernet/cadence/macb.c
>> @@ -1398,7 +1398,7 @@ static int macb_start_xmit(struct sk_buff *skb, struct 
>> net_device *dev)
>>
>>  if (macb_clear_csum(skb)) {
>>  dev_kfree_skb_any(skb);
>> -return NETDEV_TX_OK;
>> +goto unlock;
>>  }
>>
>>  /* Map socket buffer for DMA transfer */
>>
> You are definitely right. Sorry I missed that obvious point and for 
> causing any inconveniences.
> 
> BTW, I see there are obviously quite a few users of MACB 
> implementations. I'm just curious if anybody else ever encountered the 
> checksum problem or if this a matter of Zynq implementation only.

I've just verified that we are affected by this issue as well on sama5d2
(Microchip / Atmel cortex-A5 MPUs).

Thanks for the fix,
-- 
Nicolas Ferre


[PATCH V2 for-next 2/8] net: hns: bug fix about setting coalsecs-usecs to 0

2016-09-29 Thread Salil Mehta
From: Daode Huang 

When set rx/tx coalesce usecs to 0, the interrupt coalesce will be
disabled, but there is a interrupt rate limit which set to 1us, it
will cause no interrupt occurs. This patch disable interrupt limit
when sets coalsecs usecs to 0, and restores it to 1 in other case.

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c |   16 
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h |4 
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
index ef11077..f0ed80d6 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
@@ -543,6 +543,22 @@ int hns_rcb_set_coalesce_usecs(
"error: coalesce_usecs setting supports 0~1023us\n");
return -EINVAL;
}
+
+   if (!AE_IS_VER1(rcb_common->dsaf_dev->dsaf_ver)) {
+   if (timeout == 0)
+   /* set timeout to 0, Disable gap time */
+   dsaf_set_reg_field(rcb_common->io_base,
+  RCB_INT_GAP_TIME_REG + port_idx * 4,
+  PPE_INT_GAPTIME_M, PPE_INT_GAPTIME_B,
+  0);
+   else
+   /* set timeout non 0, restore gap time to 1 */
+   dsaf_set_reg_field(rcb_common->io_base,
+  RCB_INT_GAP_TIME_REG + port_idx * 4,
+  PPE_INT_GAPTIME_M, PPE_INT_GAPTIME_B,
+  1);
+   }
+
hns_rcb_set_port_timeout(rcb_common, port_idx, timeout);
return 0;
 }
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
index 4b8b803..878950a 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
@@ -417,6 +417,7 @@
 #define RCB_CFG_OVERTIME_REG   0x9300
 #define RCB_CFG_PKTLINE_INT_NUM_REG0x9304
 #define RCB_CFG_OVERTIME_INT_NUM_REG   0x9308
+#define RCB_INT_GAP_TIME_REG   0x9400
 #define RCB_PORT_CFG_OVERTIME_REG  0x9430
 
 #define RCB_RING_RX_RING_BASEADDR_L_REG0x0
@@ -898,6 +899,9 @@
 #define PPE_CNT_CLR_CE_B   0
 #define PPE_CNT_CLR_SNAP_EN_B  1
 
+#define PPE_INT_GAPTIME_B  0
+#define PPE_INT_GAPTIME_M  0x3ff
+
 #define PPE_COMMON_CNT_CLR_CE_B0
 #define PPE_COMMON_CNT_CLR_SNAP_EN_B   1
 #define RCB_COM_TSO_MODE_B 0
-- 
1.7.9.5




[PATCH V2 for-next 0/8] Bug Fixes and Code Improvement in HNS driver

2016-09-29 Thread Salil Mehta
This patch-set introduces fix to some Bugs, potential problems
and code improvements identified during internal review and
testing of Hisilicon Network Subsystem driver.

Submit Change
V1->V2: This addresses the feedbacks provided by David Miller
and Doug Ledford

Daode Huang (6):
  net: hns: bug fix about setting coalsecs-usecs to 0
  net: hns: add fini_process for v2 napi process
  net: hns: delete repeat read fbd num after while
  net: hns: fix the bug of forwarding table
  net: hns: bug fix about broadcast/multicast packets
  net: hns: delete redundant broadcast packet filter process

Kejian Yan (1):
  net: hns: fix port not available after testing loopback

lipeng (1):
  net: hns: fix port unavailable after hnae_reserve_buffer_map fail

 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c  |   11 +-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c  |   13 ++-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h  |2 +
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c |   10 --
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h |1 -
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c  |   16 +++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h  |4 +
 drivers/net/ethernet/hisilicon/hns/hns_enet.c  |  107 +---
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c   |7 ++
 9 files changed, 118 insertions(+), 53 deletions(-)

-- 
1.7.9.5




[PATCH V2 for-next 1/8] net: hns: fix port unavailable after hnae_reserve_buffer_map fail

2016-09-29 Thread Salil Mehta
From: lipeng 

When hnae_reserve_buffer_map fail, it will break cycle and some
buffer description has no available memory, therefore the port will
be unavailable.

Signed-off-by: Peng Li 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index d7e1f8c..32ff270 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -747,6 +747,14 @@ static void hns_nic_rx_up_pro(struct hns_nic_ring_data 
*ring_data,
ndev->last_rx = jiffies;
 }
 
+static int hns_desc_unused(struct hnae_ring *ring)
+{
+   int ntc = ring->next_to_clean;
+   int ntu = ring->next_to_use;
+
+   return ((ntc >= ntu) ? 0 : ring->desc_num) + ntc - ntu;
+}
+
 static int hns_nic_rx_poll_one(struct hns_nic_ring_data *ring_data,
   int budget, void *v)
 {
@@ -755,17 +763,21 @@ static int hns_nic_rx_poll_one(struct hns_nic_ring_data 
*ring_data,
int num, bnum, ex_num;
 #define RCB_NOF_ALLOC_RX_BUFF_ONCE 16
int recv_pkts, recv_bds, clean_count, err;
+   int unused_count = hns_desc_unused(ring);
 
num = readl_relaxed(ring->io_base + RCB_REG_FBDNUM);
rmb(); /* make sure num taken effect before the other data is touched */
 
recv_pkts = 0, recv_bds = 0, clean_count = 0;
+   num -= unused_count;
 recv:
while (recv_pkts < budget && recv_bds < num) {
/* reuse or realloc buffers */
-   if (clean_count >= RCB_NOF_ALLOC_RX_BUFF_ONCE) {
-   hns_nic_alloc_rx_buffers(ring_data, clean_count);
+   if (clean_count + unused_count >= RCB_NOF_ALLOC_RX_BUFF_ONCE) {
+   hns_nic_alloc_rx_buffers(ring_data,
+clean_count + unused_count);
clean_count = 0;
+   unused_count = hns_desc_unused(ring);
}
 
/* poll one pkt */
@@ -789,7 +801,7 @@ recv:
/* make all data has been write before submit */
if (recv_pkts < budget) {
ex_num = readl_relaxed(ring->io_base + RCB_REG_FBDNUM);
-
+   ex_num -= unused_count;
if (ex_num > clean_count) {
num += ex_num - clean_count;
rmb(); /*complete read rx ring bd number*/
@@ -799,8 +811,9 @@ recv:
 
 out:
/* make all data has been write before submit */
-   if (clean_count > 0)
-   hns_nic_alloc_rx_buffers(ring_data, clean_count);
+   if (clean_count + unused_count > 0)
+   hns_nic_alloc_rx_buffers(ring_data,
+clean_count + unused_count);
 
return recv_pkts;
 }
-- 
1.7.9.5




[PATCH V2 for-next 5/8] net: hns: fix port not available after testing loopback

2016-09-29 Thread Salil Mehta
From: Kejian Yan 

After running command "ethtool -t eth0", eth0 can not be connected to
network. It is caused by the changing the inner loopback register and
this register cannot be changed when hns connected to network. The
routine of setting this register needs to be removed and using promisc
mode to let the packet looped back pass by dsaf mode.

Reported-by: Jun He 
Signed-off-by: Kejian Yan 
Reviewed-by: Yisen Zhaung 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c  |3 ---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c |   10 --
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h |1 -
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c   |7 +++
 4 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index e28d960..e0f9cdc 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -678,9 +678,6 @@ static int hns_ae_config_loopback(struct hnae_handle 
*handle,
ret = -EINVAL;
}
 
-   if (!ret)
-   hns_dsaf_set_inner_lb(mac_cb->dsaf_dev, mac_cb->mac_id, en);
-
return ret;
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index 9283bc6..827d8fb 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -768,16 +768,6 @@ void hns_dsaf_set_promisc_mode(struct dsaf_device 
*dsaf_dev, u32 en)
 DSAF_CFG_MIX_MODE_S, !!en);
 }
 
-void hns_dsaf_set_inner_lb(struct dsaf_device *dsaf_dev, u32 mac_id, u32 en)
-{
-   if (AE_IS_VER1(dsaf_dev->dsaf_ver) ||
-   dsaf_dev->mac_cb[mac_id]->mac_type == HNAE_PORT_DEBUG)
-   return;
-
-   dsaf_set_dev_bit(dsaf_dev, DSAFV2_SERDES_LBK_0_REG + 4 * mac_id,
-DSAFV2_SERDES_LBK_EN_B, !!en);
-}
-
 /**
  * hns_dsaf_tbl_stat_en - tbl
  * @dsaf_id: dsa fabric id
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
index 35df187..c494fc5 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
@@ -466,6 +466,5 @@ void hns_dsaf_get_rx_mac_pause_en(struct dsaf_device 
*dsaf_dev, int mac_id,
  u32 *en);
 int hns_dsaf_set_rx_mac_pause_en(struct dsaf_device *dsaf_dev, int mac_id,
 u32 en);
-void hns_dsaf_set_inner_lb(struct dsaf_device *dsaf_dev, u32 mac_id, u32 en);
 
 #endif /* __HNS_DSAF_MAIN_H__ */
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
index ab33487..fa91ce3 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
@@ -342,6 +342,13 @@ static int __lb_setup(struct net_device *ndev,
break;
}
 
+   if (!ret) {
+   if (loop == MAC_LOOP_NONE)
+   h->dev->ops->set_promisc_mode(
+   h, ndev->flags & IFF_PROMISC);
+   else
+   h->dev->ops->set_promisc_mode(h, 1);
+   }
return ret;
 }
 
-- 
1.7.9.5




[PATCH V2 for-next 3/8] net: hns: add fini_process for v2 napi process

2016-09-29 Thread Salil Mehta
From: Daode Huang 

This patch adds fini_process for v2, it handles the packets recevied
by the hardware in the napi porcess. With this patch, the hardware irq
numbers will drop 50% per sec.

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   45 +
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 32ff270..e6bfc51 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -823,6 +823,8 @@ static void hns_nic_rx_fini_pro(struct hns_nic_ring_data 
*ring_data)
struct hnae_ring *ring = ring_data->ring;
int num = 0;
 
+   ring_data->ring->q->handle->dev->ops->toggle_ring_irq(ring, 0);
+
/* for hardware bug fixed */
num = readl_relaxed(ring->io_base + RCB_REG_FBDNUM);
 
@@ -834,6 +836,20 @@ static void hns_nic_rx_fini_pro(struct hns_nic_ring_data 
*ring_data)
}
 }
 
+static void hns_nic_rx_fini_pro_v2(struct hns_nic_ring_data *ring_data)
+{
+   struct hnae_ring *ring = ring_data->ring;
+   int num = 0;
+
+   num = readl_relaxed(ring->io_base + RCB_REG_FBDNUM);
+
+   if (num == 0)
+   ring_data->ring->q->handle->dev->ops->toggle_ring_irq(
+   ring, 0);
+   else
+   napi_schedule(_data->napi);
+}
+
 static inline void hns_nic_reclaim_one_desc(struct hnae_ring *ring,
int *bytes, int *pkts)
 {
@@ -935,7 +951,11 @@ static int hns_nic_tx_poll_one(struct hns_nic_ring_data 
*ring_data,
 static void hns_nic_tx_fini_pro(struct hns_nic_ring_data *ring_data)
 {
struct hnae_ring *ring = ring_data->ring;
-   int head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
+   int head;
+
+   ring_data->ring->q->handle->dev->ops->toggle_ring_irq(ring, 0);
+
+   head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
 
if (head != ring->next_to_clean) {
ring_data->ring->q->handle->dev->ops->toggle_ring_irq(
@@ -945,6 +965,18 @@ static void hns_nic_tx_fini_pro(struct hns_nic_ring_data 
*ring_data)
}
 }
 
+static void hns_nic_tx_fini_pro_v2(struct hns_nic_ring_data *ring_data)
+{
+   struct hnae_ring *ring = ring_data->ring;
+   int head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
+
+   if (head == ring->next_to_clean)
+   ring_data->ring->q->handle->dev->ops->toggle_ring_irq(
+   ring, 0);
+   else
+   napi_schedule(_data->napi);
+}
+
 static void hns_nic_tx_clr_all_bufs(struct hns_nic_ring_data *ring_data)
 {
struct hnae_ring *ring = ring_data->ring;
@@ -976,10 +1008,7 @@ static int hns_nic_common_poll(struct napi_struct *napi, 
int budget)
 
if (clean_complete >= 0 && clean_complete < budget) {
napi_complete(napi);
-   ring_data->ring->q->handle->dev->ops->toggle_ring_irq(
-   ring_data->ring, 0);
-   if (ring_data->fini_process)
-   ring_data->fini_process(ring_data);
+   ring_data->fini_process(ring_data);
return 0;
}
 
@@ -1755,7 +1784,8 @@ static int hns_nic_init_ring_data(struct hns_nic_priv 
*priv)
rd->queue_index = i;
rd->ring = >qs[i]->tx_ring;
rd->poll_one = hns_nic_tx_poll_one;
-   rd->fini_process = is_ver1 ? hns_nic_tx_fini_pro : NULL;
+   rd->fini_process = is_ver1 ? hns_nic_tx_fini_pro :
+   hns_nic_tx_fini_pro_v2;
 
netif_napi_add(priv->netdev, >napi,
   hns_nic_common_poll, NIC_TX_CLEAN_MAX_NUM);
@@ -1767,7 +1797,8 @@ static int hns_nic_init_ring_data(struct hns_nic_priv 
*priv)
rd->ring = >qs[i - h->q_num]->rx_ring;
rd->poll_one = hns_nic_rx_poll_one;
rd->ex_process = hns_nic_rx_up_pro;
-   rd->fini_process = is_ver1 ? hns_nic_rx_fini_pro : NULL;
+   rd->fini_process = is_ver1 ? hns_nic_rx_fini_pro :
+   hns_nic_rx_fini_pro_v2;
 
netif_napi_add(priv->netdev, >napi,
   hns_nic_common_poll, NIC_RX_CLEAN_MAX_NUM);
-- 
1.7.9.5




[PATCH V2 for-next 7/8] net: hns: bug fix about broadcast/multicast packets

2016-09-29 Thread Salil Mehta
From: Daode Huang 

When the dsaf mode receives a broadcast packet, it will filter
the packet by comparing the received queue number and destination
queue number(get from forwarding table), if they are the same,
the packet will be filtered. Otherwise, the packet will be loopback.
So this patch select queue 0 to send broadcast and multicast packets.

Signed-off-by: Daode Huang 
Reviewed-by: Yisen Zhuang 
Signed-off-by: Salil Mehta 
---
PATCH v2: Addressed comments by David Miller
  Link: https://lkml.org/lkml/2016/9/28/390
PATCH V1: Initial Submit
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 09ed237..5494e0e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -1597,6 +1597,21 @@ struct rtnl_link_stats64 *hns_nic_get_stats64(struct 
net_device *ndev,
return stats;
 }
 
+static u16
+hns_nic_select_queue(struct net_device *ndev, struct sk_buff *skb,
+void *accel_priv, select_queue_fallback_t fallback)
+{
+   struct ethhdr *eth_hdr = (struct ethhdr *)skb->data;
+   struct hns_nic_priv *priv = netdev_priv(ndev);
+
+   /* fix hardware broadcast/multicast packets queue loopback */
+   if (!AE_IS_VER1(priv->enet_ver) &&
+   is_multicast_ether_addr(eth_hdr->h_dest))
+   return 0;
+   else
+   return fallback(ndev, skb);
+}
+
 static const struct net_device_ops hns_nic_netdev_ops = {
.ndo_open = hns_nic_net_open,
.ndo_stop = hns_nic_net_stop,
@@ -1612,6 +1627,7 @@ static const struct net_device_ops hns_nic_netdev_ops = {
.ndo_poll_controller = hns_nic_poll_controller,
 #endif
.ndo_set_rx_mode = hns_nic_set_rx_mode,
+   .ndo_select_queue = hns_nic_select_queue,
 };
 
 static void hns_nic_update_link_status(struct net_device *netdev)
-- 
1.7.9.5




Re: [PATCH net] net: pktgen: fix pkt_size

2016-09-29 Thread Sergei Shtylyov

Hello.

On 09/29/2016 05:04 PM, Paolo Abeni wrote:


The commit 879c7220e828 ("net: pktgen: Observe needed_headroom
of the device") increased the 'pkt_overhead' field value by
LL_RESERVED_SPACE.
As a side effect the generated packet size, computed as:

/* Eth + IPh + UDPh + mpls */
datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
  pkt_dev->pkt_overhead;

is decreased by the same value.
The above changed slightly the behavior of existing pktgen users,
and made the interface somewhat inconsistent.
Fix it by restoring the previous pkt_overhead value and using
LL_RESERVED_SPACE as extralen in skb allocation.
Also, change pktgen_alloc_skb() to only partially reserve
the headroom to allow the caller to prefetch from ll header
start.

Fixes: 879c7220e828 ("net: pktgen: Observe needed_headroom of the device")
Suggested-by: Ben Greear 
Signed-off-by: Paolo Abeni 
---
 net/core/pktgen.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index bbd118b..5c9b397 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c

[...]

@@ -2796,8 +2796,9 @@ static struct sk_buff *pktgen_alloc_skb(struct net_device 
*dev,
 skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT);
}

+   /* the caller prefetch from skb->data and reserve for the mac hdr */


   Pre-fetches and reserves?

[...]

MBR, Sergei



Re: next-20160929 build: 2 failures 4 warnings (next-20160929)

2016-09-29 Thread Mark Brown
On Thu, Sep 29, 2016 at 12:40:35PM +0100, Build bot for Mark Brown wrote:

For the past couple of days -next has been failing to build an ARM
allmodconfig due to:

>   arm-allmodconfig
> ERROR: "__aeabi_uldivmod" [net/netfilter/xt_hashlimit.ko] undefined!

which appears to be triggered by 11d5f15723c9 (netfilter: xt_hashlimit:
Create revision 2 to support higher pps rates) introducing a division of
a 64 bit number which should be done using do_div().


signature.asc
Description: PGP signature


[PATCH net-next v3] netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit division

2016-09-29 Thread Vishwanath Pai
v2:
Remove unnecessary div64_u64 around constants

v3:
remove backslashes

--

Fix link error in 32bit arch because of 64bit division

Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64

Signed-off-by: Vishwanath Pai 
Fixes: 11d5f15723c9 ("netfilter: xt_hashlimit: Create revision 2 to ...")

---
 net/netfilter/xt_hashlimit.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 44a095e..faf65f1 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -467,17 +467,18 @@ static u64 user2credits(u64 user, int revision)
/* If multiplying would overflow... */
if (user > 0x / (HZ*CREDITS_PER_JIFFY_v1))
/* Divide first. */
-   return (user / XT_HASHLIMIT_SCALE) *\
-   HZ * CREDITS_PER_JIFFY_v1;
+   return div64_u64(user, XT_HASHLIMIT_SCALE)
+   * HZ * CREDITS_PER_JIFFY_v1;
 
-   return (user * HZ * CREDITS_PER_JIFFY_v1) \
-   / XT_HASHLIMIT_SCALE;
+   return div64_u64((user * HZ * CREDITS_PER_JIFFY_v1),
+XT_HASHLIMIT_SCALE);
} else {
if (user > 0x / (HZ*CREDITS_PER_JIFFY))
-   return (user / XT_HASHLIMIT_SCALE_v2) *\
-   HZ * CREDITS_PER_JIFFY;
+   return div64_u64(user, XT_HASHLIMIT_SCALE_v2)
+   * HZ * CREDITS_PER_JIFFY;
 
-   return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE_v2;
+   return div64_u64((user * HZ * CREDITS_PER_JIFFY),
+XT_HASHLIMIT_SCALE_v2);
}
 }
 
-- 
1.9.1



Re: pull-request: wireless-drivers-next 2016-09-29

2016-09-29 Thread Pablo Neira Ayuso
On Thu, Sep 29, 2016 at 07:57:28PM +0300, Kalle Valo wrote:
> Hi Dave,
> 
> this should be the last wireless-drivers-next pull request for 4.9, from
> now on only important bugfixes. Nothing really special stands out,
> iwlwifi being most active but other drivers also getting attention. More
> details in the signed tag. Please let me know if there are any problems.
> 
> Or actually I had one problem. While doing a test merge I noticed that
> net-next fails to compile for me, but I don't think this is anything
> wireless related:
> 
>   CC  net/netfilter/core.o
> net/netfilter/core.c: In function 'nf_set_hooks_head':
> net/netfilter/core.c:96:149: error: 'struct net_device' has no member named 
> 'nf_hooks_ingress'

That's my problem, will be sending a pull request to fix this asap,
thanks.


Re: [PATCH net v2] tg3: Avoid NULL pointer dereference in tg3_io_error_detected()

2016-09-29 Thread Michael Chan
On Thu, Sep 29, 2016 at 9:24 AM, Guilherme G. Piccoli
 wrote:
> From: Milton Miller 
>
> While the driver is probing the adapter, an error may occur before the
> netdev structure is allocated and attached to pci_dev. In this case,
> not only netdev isn't available, but the tg3 private structure is also
> not available as it is just math from the NULL pointer, so dereferences
> must be skipped.
>
> The following trace is seen when the error is triggered:
>
>   [1.402247] Unable to handle kernel paging request for data at address 
> 0x1a99
>   [1.402410] Faulting instruction address: 0xc07e33f8
>   [1.402450] Oops: Kernel access of bad area, sig: 11 [#1]
>   [1.402481] SMP NR_CPUS=2048 NUMA PowerNV
>   [1.402513] Modules linked in:
>   [1.402545] CPU: 0 PID: 651 Comm: eehd Not tainted 4.4.0-36-generic 
> #55-Ubuntu
>   [1.402591] task: c01fe4e42a20 ti: c01fe4e88000 task.ti: 
> c01fe4e88000
>   [1.402742] NIP: c07e33f8 LR: c07e3164 CTR: c0595ea0
>   [1.402787] REGS: c01fe4e8b790 TRAP: 0300   Not tainted  
> (4.4.0-36-generic)
>   [1.402832] MSR: 90019033   CR: 28000422  
> XER: 2000
>   [1.403058] CFAR: c0008468 DAR: 1a99 DSISR: 4200 
> SOFTE: 1
>   GPR00: c07e3164 c01fe4e8ba10 c15c5e00 
>   GPR04: 0001  0039 0299
>   GPR08:  0001 c01fe4e88000 0006
>   GPR12:  cfb4 c00e6558 c03ca1bffd00
>   GPR16:    
>   GPR20:    c0d52768
>   GPR24: c0d52740 0100 c03ca1b52000 0002
>   GPR28: 0900  c152a0c0 c03ca1b52000
>   [1.404226] NIP [c07e33f8] tg3_io_error_detected+0x308/0x340
>   [1.404265] LR [c07e3164] tg3_io_error_detected+0x74/0x340
>
> This patch avoids the NULL pointer dereference by moving the access after
> the netdev NULL pointer check on tg3_io_error_detected(). Also, we add a
> check for netdev being NULL on tg3_io_resume() [suggested by Michael Chan].
>
> Fixes: 0486a063b1ff ("tg3: prevent ifup/ifdown during PCI error recovery")
> Fixes: dfc8f370316b ("net/tg3: Release IRQs on permanent error")
> Tested-by: Guilherme G. Piccoli 
> Signed-off-by: Milton Miller 
> Signed-off-by: Guilherme G. Piccoli 

Acked-by: Michael Chan 


Re: next-20160929 build: 2 failures 4 warnings (next-20160929)

2016-09-29 Thread Vishwanath Pai
On 09/29/2016 02:47 PM, Mark Brown wrote:
> On Thu, Sep 29, 2016 at 12:40:35PM +0100, Build bot for Mark Brown wrote:
> 
> For the past couple of days -next has been failing to build an ARM
> allmodconfig due to:
> 
>>  arm-allmodconfig
>> ERROR: "__aeabi_uldivmod" [net/netfilter/xt_hashlimit.ko] undefined!
> 
> which appears to be triggered by 11d5f15723c9 (netfilter: xt_hashlimit:
> Create revision 2 to support higher pps rates) introducing a division of
> a 64 bit number which should be done using do_div().
> 

I have sent a patch for this a couple of days ago to netdev, it hasn't
made it to net-next yet. Here's the latest one:

[PATCH net-next v3] netfilter: xt_hashlimit: Fix link error in 32bit
arch because of 64bit division

This should fix the link error.

-Vishwanath


[PATCH net-next 2/2] openvswitch: remove skb_mpls_header

2016-09-29 Thread Jiri Benc
skb_mpls_header is equivalent to skb_network_header now. There's no reason
to keep it.

Signed-off-by: Jiri Benc 
---
 include/net/mpls.h| 11 ---
 net/openvswitch/actions.c | 10 +-
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/include/net/mpls.h b/include/net/mpls.h
index 5b3b5addfb08..fde22d0b0ec1 100644
--- a/include/net/mpls.h
+++ b/include/net/mpls.h
@@ -25,15 +25,4 @@ static inline bool eth_p_mpls(__be16 eth_type)
eth_type == htons(ETH_P_MPLS_MC);
 }
 
-/*
- * For non-MPLS skbs this will correspond to the network header.
- * For MPLS skbs it will be before the network_header as the MPLS
- * label stack lies between the end of the mac header and the network
- * header. That is, for MPLS skbs the end of the mac header
- * is the top of the MPLS label stack.
- */
-static inline unsigned char *skb_mpls_header(struct sk_buff *skb)
-{
-   return skb_mac_header(skb) + skb->mac_len;
-}
 #endif
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 863e992dfbc0..bf4211f5c974 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -180,7 +180,7 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
skb_reset_mac_header(skb);
skb_set_network_header(skb, skb->mac_len);
 
-   new_mpls_lse = (__be32 *)skb_mpls_header(skb);
+   new_mpls_lse = (__be32 *)skb_network_header(skb);
*new_mpls_lse = mpls->mpls_lse;
 
skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
@@ -202,7 +202,7 @@ static int pop_mpls(struct sk_buff *skb, struct sw_flow_key 
*key,
if (unlikely(err))
return err;
 
-   skb_postpull_rcsum(skb, skb_mpls_header(skb), MPLS_HLEN);
+   skb_postpull_rcsum(skb, skb_network_header(skb), MPLS_HLEN);
 
memmove(skb_mac_header(skb) + MPLS_HLEN, skb_mac_header(skb),
skb->mac_len);
@@ -211,10 +211,10 @@ static int pop_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
skb_reset_mac_header(skb);
skb_set_network_header(skb, skb->mac_len);
 
-   /* skb_mpls_header() is used to locate the ethertype
+   /* skb_network_header() is used to locate the ethertype
 * field correctly in the presence of VLAN tags.
 */
-   hdr = (struct ethhdr *)(skb_mpls_header(skb) - ETH_HLEN);
+   hdr = (struct ethhdr *)(skb_network_header(skb) - ETH_HLEN);
update_ethertype(skb, hdr, ethertype);
if (eth_p_mpls(skb->protocol))
skb->protocol = ethertype;
@@ -234,7 +234,7 @@ static int set_mpls(struct sk_buff *skb, struct sw_flow_key 
*flow_key,
if (unlikely(err))
return err;
 
-   stack = (__be32 *)skb_mpls_header(skb);
+   stack = (__be32 *)skb_network_header(skb);
lse = OVS_MASKED(*stack, *mpls_lse, *mask);
if (skb->ip_summed == CHECKSUM_COMPLETE) {
__be32 diff[] = { ~(*stack), lse };
-- 
1.8.3.1



[PATCH net-next 1/2] openvswitch: mpls: set network header correctly on key extract

2016-09-29 Thread Jiri Benc
After the 48d2ab609b6b ("net: mpls: Fixups for GSO"), MPLS handling in
openvswitch was changed to have network header pointing to the start of the
MPLS headers and inner_network_header pointing after the MPLS headers.

However, key_extract was missed by the mentioned commit, causing incorrect
headers to be set when a MPLS packet just enters the bridge or after it is
recirculated.

Fixes: 48d2ab609b6b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc 
---
 net/openvswitch/flow.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 634cc10d6dee..c8c82e109c68 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -633,12 +633,7 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
} else if (eth_p_mpls(key->eth.type)) {
size_t stack_len = MPLS_HLEN;
 
-   /* In the presence of an MPLS label stack the end of the L2
-* header and the beginning of the L3 header differ.
-*
-* Advance network_header to the beginning of the L3
-* header. mac_len corresponds to the end of the L2 header.
-*/
+   skb_set_inner_network_header(skb, skb->mac_len);
while (1) {
__be32 lse;
 
@@ -646,12 +641,12 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
if (unlikely(error))
return 0;
 
-   memcpy(, skb_network_header(skb), MPLS_HLEN);
+   memcpy(, skb_inner_network_header(skb), MPLS_HLEN);
 
if (stack_len == MPLS_HLEN)
memcpy(>mpls.top_lse, , MPLS_HLEN);
 
-   skb_set_network_header(skb, skb->mac_len + stack_len);
+   skb_set_inner_network_header(skb, skb->mac_len + 
stack_len);
if (lse & htonl(MPLS_LS_S_MASK))
break;
 
-- 
1.8.3.1



[PATCH net-next 0/2] openvswitch: mpls fix and clean up

2016-09-29 Thread Jiri Benc
Convert to the new mpls skb layout the last remaining place in openvswitch,
forgotten on the mpls GSO rework. The GSO rework also allows for some nice
cleanup in the second patch.

Jiri Benc (2):
  openvswitch: mpls: set network header correctly on key extract
  openvswitch: remove skb_mpls_header

 include/net/mpls.h| 11 ---
 net/openvswitch/actions.c | 10 +-
 net/openvswitch/flow.c| 11 +++
 3 files changed, 8 insertions(+), 24 deletions(-)

-- 
1.8.3.1



Re: [PATCH] brcmfmac: implement more accurate skb tracking

2016-09-29 Thread Rafał Miłecki
On 27 September 2016 at 11:24, Arend Van Spriel
 wrote:
> On 26-9-2016 14:38, Rafał Miłecki wrote:
>> On 26 September 2016 at 14:13, Rafał Miłecki  wrote:
>>> On 26 September 2016 at 13:46, Arend Van Spriel
>>>  wrote:
 On 26-9-2016 12:23, Rafał Miłecki wrote:
> From: Rafał Miłecki 
>
> We need to track 802.1x packets to know if there are any pending ones
> for transmission. This is required for performing key update in the
> firmware.

 The problem we are trying to solve is a pretty old one. The problem is
 that wpa_supplicant uses two separate code paths: EAPOL messaging
 through data path and key configuration though nl80211.
>>>
>>> Can I find it described/reported somewhere?
>>>
>>>
> Unfortunately our old tracking code wasn't very accurate. It was
> treating skb as pending as soon as it was passed by the netif. Actual
> handling packet to the firmware was happening later as brcmfmac
> internally queues them and uses its own worker(s).

 That does not seem right. As soon as we get a 1x packet we need to wait
 with key configuration regardless whether it is still in the driver or
 handed over to firmware already.
>>>
>>> OK, thanks.
>>
>> Actually, it's not OK. I was trying to report/describe/discuss this
>> problem for over a week. I couldn't get much of answer from you.
>>
>> I had to come with a patch I worked on for quite some time. Only then
>> you decided to react and reply with a reason for a nack. I see this
>> patch may be wrong (but it's still hard to know what's going wrong
>> without a proper hostapd bug report). I'd expect you to somehow work &
>> communicate with open source community.
>
> We do or at least make an honest attempt, but there is more on our plate
> so responses may be delayed. It also does not help when you get anal and
> preachy when we do respond. Also not OK. In this case the delay is
> caused because I had to pick up the thread(s) as Hante is on vacation
> (he needed a break :-p ). However, you started sending patches so I
> decided to look at and respond to those. Sorry if you felt like we left
> you hanging to dry.

I believe I get easily irritated due to my communication experience I
got so far :(


Over a year ago I reported brcmfmac can't recover from failed
register_netdev(ice). This bug remains unfixed.

In 2014 I reported problem with 80 MHz support. I didn't have hardware
to fix & test it on my own (you weren't able/allowed to send me one of
your PCIe cards). In remained broken until I fixed it year later.

You missed my crash bug report about caused by missing eth_type_trans
and came with patch on your own a month later.

Earlier this year I reported you problem with BCM4366 and multiple
interfaces. I didn't get much help. 3 months later I came with patch
to workaround the problem but you said there's a better way to do
this. It took me 2 weeks to figure out a new wlioctl API for that
while all I needed was a simple hint on "interface_remove".

Right now I'm waiting to get any answer from you about 4366c0
firmware. It's still less than 2 weeks since I asked for it, but a
simple ETA would be nice. I'm actually not sure if I should report
more problems to you to don't distract you from pending things.

Problems with brcmf_netdev_wait_pend8021x were reported multiples
times for last few months. When I finally got time for that it took me
a week to debug them.


As you can see, it takes me months to get help on some things. And in
few cases I never got much help at all. Yes, I was hoping to have you
more involved into brcmfmac development and problems solving. I guess
things didn't meet my expectations and I got grumpy & preachy.

-- 
Rafał


[net-next PATCH] i40e: Clean up handling of msg_enable flags and debug parameter

2016-09-29 Thread Alexander Duyck
So the i40e driver had a really convoluted configuration for how to handle
the debug flags contained in msg_enable.  Part of the issue is that the
driver has its own 32 bit mask that it was using to track a separate set of
debug features.  From what I can tell it was trying to use the upper 4 bits
to determine if the value was meant to represent a bit-mask or the numeric
value provided by debug level.

What this patch does is clean this up by compressing those 4 bits into bit
31, as a result we just have to perform a check against the value being
negative to determine if we are looking at a debug level (positive), or a
debug mask (negative).  The debug level will populate the msg_level, and
the debug mask will populate the debug_mask in the hardware struct.

I added similar logic for ethtool.  If the value being provided has bit 31
set we assume the value being provided is a debug mask, otherwise we assume
it is a msg_enable mask.  For displaying we only provide the msg_enable,
and if debug_mask is in use we will print it to the dmesg log.

Lastly I removed the debugfs interface.  It is redundant with what we
already have in ethtool and really doesn't belong anyway.

Signed-off-by: Alexander Duyck 
---

So I am running this through a slightly different process than standard
because there are some items here that might be objectionable so I want to
have that ironed out before I deal with the out-of-tree Intel driver.

I just want to verify if this fix for this will work for net-next or if I
need to look at taking a different approach.  Once I get the go-no-go, I
will back-port this into the out-of-tree i40e driver.

 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |   18 ---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |7 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|   28 
 3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 0c1875b..acb0f13 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -1210,24 +1210,6 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
dev_info(>pdev->dev,
 "dump debug fwdata   
\n");
}
-
-   } else if (strncmp(cmd_buf, "msg_enable", 10) == 0) {
-   u32 level;
-   cnt = sscanf(_buf[10], "%i", );
-   if (cnt) {
-   if (I40E_DEBUG_USER & level) {
-   pf->hw.debug_mask = level;
-   dev_info(>pdev->dev,
-"set hw.debug_mask = 0x%08x\n",
-pf->hw.debug_mask);
-   }
-   pf->msg_enable = level;
-   dev_info(>pdev->dev, "set msg_enable = 0x%08x\n",
-pf->msg_enable);
-   } else {
-   dev_info(>pdev->dev, "msg_enable = 0x%08x\n",
-pf->msg_enable);
-   }
} else if (strncmp(cmd_buf, "pfr", 3) == 0) {
dev_info(>pdev->dev, "debugfs: forcing PFR\n");
i40e_do_reset_safe(pf, BIT(__I40E_PF_RESET_REQUESTED));
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index e79a920..b133a77 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -978,6 +978,10 @@ static u32 i40e_get_msglevel(struct net_device *netdev)
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_pf *pf = np->vsi->back;
+   u32 debug_mask = pf->hw.debug_mask;
+
+   if (debug_mask)
+   netdev_info(netdev, "i40e debug_mask: 0x%08X\n", debug_mask);
 
return pf->msg_enable;
 }
@@ -989,7 +993,8 @@ static void i40e_set_msglevel(struct net_device *netdev, 
u32 data)
 
if (I40E_DEBUG_USER & data)
pf->hw.debug_mask = data;
-   pf->msg_enable = data;
+   else
+   pf->msg_enable = data;
 }
 
 static int i40e_get_regs_len(struct net_device *netdev)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b93fa2a..b24c6eb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -93,8 +93,8 @@ MODULE_DEVICE_TABLE(pci, i40e_pci_tbl);
 
 #define I40E_MAX_VF_COUNT 128
 static int debug = -1;
-module_param(debug, int, 0);
-MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
+module_param(debug, uint, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all), Debug mask 
(0x8XXX)");
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION("Intel(R) 

net-next tree broken with CONFIG_NETFILTER_INGRESS=n

2016-09-29 Thread Marcel Holtmann
Hi Dave,

the net-next tree is broken since a few days now when 
CONFIG_NETFILTER_INGRESS=n is set.

  CC  net/netfilter/core.o
In file included from ./include/linux/linkage.h:4:0,
 from ./include/linux/kernel.h:6,
 from net/netfilter/core.c:10:
net/netfilter/core.c: In function ‘nf_set_hooks_head’:
net/netfilter/core.c:96:30: error: ‘struct net_device’ has no member named 
‘nf_hooks_ingress’
   rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
  ^
./include/linux/compiler.h:299:17: note: in definition of macro ‘WRITE_ONCE’
  union { typeof(x) __val; char __c[1]; } __u = \
 ^
Regards

Marcel



[PATCH v3 net 1/2] net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions

2016-09-29 Thread Shmulik Ladkani
Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
case where the input skb data pointer does not point at the mac header:

- They're doing push/pop, but fail to properly unwind data back to its
  original location.
  For example, in the skb_vlan_push case, any subsequent
  'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
  BEFORE start of frame, leading to bogus frames that may be transmitted.

- They update rcsum per the added/removed 4 bytes tag.
  Alas if data is originally after the vlan/eth headers, then these
  bytes were already pulled out of the csum.

OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
present no issues.

act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
at network header (upon ingress).
Other calles (ovs, bpf) already adjust skb->data at mac_header.

This patch fixes act_vlan to point to the mac_header prior calling
skb_vlan_*() functions, as other callers do.

Signed-off-by: Shmulik Ladkani 
Cc: Daniel Borkmann 
Cc: Pravin Shelar 
Cc: Jiri Pirko 
---
 net/sched/act_vlan.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 691409de3e..4ffc6c13a5 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -36,6 +36,12 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
bstats_update(>tcf_bstats, skb);
action = v->tcf_action;
 
+   /* Ensure 'data' points at mac_header prior calling vlan manipulating
+* functions.
+*/
+   if (skb_at_tc_ingress(skb))
+   skb_push_rcsum(skb, skb->mac_len);
+
switch (v->tcfv_action) {
case TCA_VLAN_ACT_POP:
err = skb_vlan_pop(skb);
@@ -57,6 +63,9 @@ drop:
action = TC_ACT_SHOT;
v->tcf_qstats.drops++;
 unlock:
+   if (skb_at_tc_ingress(skb))
+   skb_pull_rcsum(skb, skb->mac_len);
+
spin_unlock(>tcf_lock);
return action;
 }
-- 
2.7.4



[PATCH v3 net 2/2] net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header

2016-09-29 Thread Shmulik Ladkani
skb_vlan_pop/push were too generic, trying to support the cases where
skb->data is at mac header, and cases where skb->data is arbitrarily
elsewhere.

Supporting an arbitrary skb->data was complex and bogus:
 - It failed to unwind skb->data to its original location post actual
   pop/push.
   (Also, semantic is not well defined for unwinding: If data was into
the eth header, need to use same offset from start; But if data was
at network header or beyond, need to adjust the original offset
according to the push/pull)
 - It mangled the rcsum post actual push/pop, without taking into account
   that the eth bytes might already have been pulled out of the csum.

Most callers (ovs, bpf) already had their skb->data at mac_header upon
invoking skb_vlan_pop/push.
Last caller that failed to do so (act_vlan) has been recently fixed.

Therefore, to simplify things, no longer support arbitrary skb->data
inputs for skb_vlan_pop/push().

skb->data is expected to be exactly at mac_header; WARN otherwise.

Signed-off-by: Shmulik Ladkani 
Cc: Daniel Borkmann 
Cc: Pravin Shelar 
Cc: Jiri Pirko 
---
 v3: Instead of correcting unwinding of skb->data in skb_vlan_pop/push,
 just kill the support for arbitraray skb->data inputs, and assume
 given skb->data always points at mac_header.
 Fix act_vlan, the sole user not adehering to this assumption.

 v2: Instead of reducing mac_len by 4 bytes, which was found incorrect,
 fix the problem of wrong unwinding of 'skb->data'

 net/core/skbuff.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b68f..8c38263cdf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4478,13 +4478,18 @@ EXPORT_SYMBOL(skb_ensure_writable);
 static int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci)
 {
struct vlan_hdr *vhdr;
-   unsigned int offset = skb->data - skb_mac_header(skb);
+   int offset = skb->data - skb_mac_header(skb);
int err;
 
-   __skb_push(skb, offset);
+   if (WARN_ONCE(offset,
+ "__skb_vlan_pop got skb with skb->data not at mac header 
(offset %d)\n",
+ offset)) {
+   return -EINVAL;
+   }
+
err = skb_ensure_writable(skb, VLAN_ETH_HLEN);
if (unlikely(err))
-   goto pull;
+   return err;
 
skb_postpull_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN);
 
@@ -4501,12 +4506,13 @@ static int __skb_vlan_pop(struct sk_buff *skb, u16 
*vlan_tci)
skb_set_network_header(skb, ETH_HLEN);
 
skb_reset_mac_len(skb);
-pull:
-   __skb_pull(skb, offset);
 
return err;
 }
 
+/* Pop a vlan tag either from hwaccel or from payload.
+ * Expects skb->data at mac header.
+ */
 int skb_vlan_pop(struct sk_buff *skb)
 {
u16 vlan_tci;
@@ -4541,29 +4547,30 @@ int skb_vlan_pop(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(skb_vlan_pop);
 
+/* Push a vlan tag either into hwaccel or into payload (if hwaccel tag 
present).
+ * Expects skb->data at mac header.
+ */
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
 {
if (skb_vlan_tag_present(skb)) {
-   unsigned int offset = skb->data - skb_mac_header(skb);
+   int offset = skb->data - skb_mac_header(skb);
int err;
 
-   /* __vlan_insert_tag expect skb->data pointing to mac header.
-* So change skb->data before calling it and change back to
-* original position later
-*/
-   __skb_push(skb, offset);
+   if (WARN_ONCE(offset,
+ "skb_vlan_push got skb with skb->data not at mac 
header (offset %d)\n",
+ offset)) {
+   return -EINVAL;
+   }
+
err = __vlan_insert_tag(skb, skb->vlan_proto,
skb_vlan_tag_get(skb));
-   if (err) {
-   __skb_pull(skb, offset);
+   if (err)
return err;
-   }
 
skb->protocol = skb->vlan_proto;
skb->mac_len += VLAN_HLEN;
 
skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN);
-   __skb_pull(skb, offset);
}
__vlan_hwaccel_put_tag(skb, vlan_proto, vlan_tci);
return 0;
-- 
2.7.4



Re: [PATCH/RFC 00/12] Programming Open vSwitch (-like) flows into hardware using SwitchDev

2016-09-29 Thread Simon Horman
Hi Or,

On Wed, Sep 28, 2016 at 04:54:40PM +0300, Or Gerlitz wrote:
> On Wed, Sep 28, 2016 at 3:42 PM, Simon Horman
>  wrote:
> 
> > A different approach, not implemented by this patch-set, is for user-space
> > to program flows into hardware by some other means, for example TC, and/or
> > the (kernel) datapath.
> 
> Right, and we've submitted that code to the OVS community 24h ago [1].
> 
> This was done along the feedback we've got for the last two years (since
> the  LPC 2014 networking micro-conf). It allows offloading from
> multiple user-space
> applications through a single UAPI -- the TC one (currently we did
> flower, but the OVSD
> patch set can be extended to use whatever TC offloads are supported by
> the port driver,
> e.g U32, eBPF) and integration with 3rd party policy modules  running
> in user-space.
> 
> Lets hear people opinions and see where we go from now.
> 
> > I believe that approach does not conflict with this one.
> >  And there is some scope to share infrastructure in the kernel
> 
> maybe, possibly
> 
> We've having a talk in netdev 1.2 on offloading HW offloading of OVS
> and similar applications,
> I would encourage people to come and approach me and/or Rony Efraim
> from Mellanox before/after
> the talk to discuss that F2F, would love to get feedbacks, and also here...

Thanks for putting my post in context with the work you mention.
I am looking forward to some F2F discussions next week.

> Or.
> 
> [1] pointers to patches implementing the 2nd approach
> 
> cover-letter http://openvswitch.org/pipermail/dev/2016-September/079952.html
> 
> patches
> 
> https://patchwork.ozlabs.org/patch/675560/
> https://patchwork.ozlabs.org/patch/675567/
> https://patchwork.ozlabs.org/patch/675565/
> https://patchwork.ozlabs.org/patch/675559/
> https://patchwork.ozlabs.org/patch/675564/
> https://patchwork.ozlabs.org/patch/675563/
> https://patchwork.ozlabs.org/patch/675568/
> https://patchwork.ozlabs.org/patch/675566/
> https://patchwork.ozlabs.org/patch/675562/
> 
> [2] http://www.netdevconf.org/1.2/session.html?rony-efraim-1


Re: [PATCH v3 net 1/2] net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions

2016-09-29 Thread Shmulik Ladkani
David,

On Thu, 29 Sep 2016 12:10:40 +0300 Shmulik Ladkani  
wrote:
> This patch fixes act_vlan to point to the mac_header prior calling
> skb_vlan_*() functions, as other callers do.
> 

This 1/2 patch fixes the problem detailed in [1] for act_vlan,
last known caller of skb_vlan_*() with skb->data not at mac_header.

I think it's a good candidate for -stable; it fixes the observed bug and
it is rather focused.

Subsequent 2/2 patch hermetically seals the future possibility that one
might call skb_vlan_*() with skb->data not at mac_header.

This might go to stable as well, but not strictly required.

Thanks,
Shmulik

[1] https://patchwork.ozlabs.org/patch/676111/


Re: [PATCH] ipv6 addrconf: remove addrconf_sysctl_hop_limit()

2016-09-29 Thread Erik Kline
Seems fine to me.

Acked-by: Erik Kline 


Re: [PATCH RFC 5/6] net: phy: Trigger state machine on state change and not polling.

2016-09-29 Thread Andrew Lunn
On Wed, Sep 28, 2016 at 02:31:54PM -0700, Florian Fainelli wrote:
> On 09/28/2016 01:32 AM, Andrew Lunn wrote:
> > The phy_start() is used to indicate the PHY is now ready to do its
> > work. The state is changed, normally to PHY_UP which means that both
> > the MAC and the PHY are ready.
> > 
> > If the phy driver is using polling, when the next poll happens, the
> > state machine notices the PHY is now in PHY_UP, and kicks off
> > auto-negotiation, if needed.
> > 
> > If however, the PHY is using interrupts, there is no polling. The phy
> > is stuck in PHY_UP until the next interrupt comes along. And there is
> > no reason for the PHY to interrupt.
> > 
> > Have phy_start() schedule the state machine to run, which both speeds
> > up the polling use case, and makes the interrupt use case actually
> > work.
> > 
> > This problems exists whenever there is a state change which will not
> > cause an interrupt. Trigger the state machine in these cases,
> > e.g. phy_error().
> > 
> > Signed-off-by: Andrew Lunn 
> 
> No particular objections, this should also fix this:
> 
> http://lists.openwall.net/netdev/2016/05/17/147

Hi Florian

Yes, i was thinking it probably should have a fixes: tag and be a
separate patch. The hard part will be figuring out when this actually
broke, or does it go all the way back to when interrupt support was
added?

Andrew


Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Paolo Abeni
Hi Eric,

On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote:
> On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote:
> 
> > +static void udp_rmem_release(struct sock *sk, int partial)
> > +{
> > +   struct udp_sock *up = udp_sk(sk);
> > +   int fwd, amt;
> > +
> > +   if (partial && !udp_under_memory_pressure(sk))
> > +   return;
> > +
> > +   /* we can have concurrent release; if we catch any conflict
> > +* we let only one of them do the work
> > +*/
> > +   if (atomic_dec_if_positive(>can_reclaim) < 0)
> > +   return;
> > +
> > +   fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc));
> > +   if (fwd < SK_MEM_QUANTUM + partial) {
> > +   atomic_inc(>can_reclaim);
> > +   return;
> > +   }
> > +
> > +   amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> > +   atomic_sub(amt, >mem_allocated);
> > +   atomic_inc(>can_reclaim);
> > +
> > +   __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +   sk->sk_forward_alloc = fwd - amt;
> > +}

Thank you for reviewing this!

> This is racy... 

Could you please elaborate? 

> all these atomics make me nervous...

I'd like to drop some of them if possible.

atomic_inc(>can_reclaim);

could probably be replaced with atomic_set(>can_reclaim, 1) since we
don't have concurrent processes doing that and can_reclaim.counter is
known to be 0 at that point.
Performance wise the impact is minimal, since in normal condition we do
the reclaim only on socket shutdown.

Paolo



[PATCH] ipv6 addrconf: remove addrconf_sysctl_hop_limit()

2016-09-29 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This is an effective no-op in terms of user observable behaviour.

By preventing the overwrite of non-null extra1/extra2 fields
in addrconf_sysctl() we can enable the use of proc_dointvec_minmax().

This allows us to eliminate the constant min/max (1..255) trampoline
function that is addrconf_sysctl_hop_limit().

This is nice because it simplifies the code, and allows future
sysctls with constant min/max limits to also not require trampolines.

We still can't eliminate the trampoline for mtu because it isn't
actually a constant (it depends on other tunables of the device)
and thus requires at-write-time logic to enforce range.

Signed-off-by: Maciej Żenczykowski 
---
 net/ipv6/addrconf.c | 31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2f1f5d439788..8bd2d06eefe7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5467,20 +5467,6 @@ int addrconf_sysctl_forward(struct ctl_table *ctl, int 
write,
 }
 
 static
-int addrconf_sysctl_hop_limit(struct ctl_table *ctl, int write,
-  void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-   struct ctl_table lctl;
-   int min_hl = 1, max_hl = 255;
-
-   lctl = *ctl;
-   lctl.extra1 = _hl;
-   lctl.extra2 = _hl;
-
-   return proc_dointvec_minmax(, write, buffer, lenp, ppos);
-}
-
-static
 int addrconf_sysctl_mtu(struct ctl_table *ctl, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -5713,6 +5699,9 @@ int addrconf_sysctl_ignore_routes_with_linkdown(struct 
ctl_table *ctl,
return ret;
 }
 
+static const int one = 1;
+static const int two_five_five = 255;
+
 static const struct ctl_table addrconf_sysctl[] = {
{
.procname   = "forwarding",
@@ -5726,7 +5715,9 @@ static const struct ctl_table addrconf_sysctl[] = {
.data   = _devconf.hop_limit,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = addrconf_sysctl_hop_limit,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = (void *),
+   .extra2 = (void *)_five_five,
},
{
.procname   = "mtu",
@@ -6044,8 +6035,14 @@ static int __addrconf_sysctl_register(struct net *net, 
char *dev_name,
 
for (i = 0; table[i].data; i++) {
table[i].data += (char *)p - (char *)_devconf;
-   table[i].extra1 = idev; /* embedded; no ref */
-   table[i].extra2 = net;
+   /* If one of these is already set, then it is not safe to
+* overwrite either of them: this makes proc_dointvec_minmax
+* usable.
+*/
+   if (!table[i].extra1 && !table[i].extra2) {
+   table[i].extra1 = idev; /* embedded; no ref */
+   table[i].extra2 = net;
+   }
}
 
snprintf(path, sizeof(path), "net/ipv6/conf/%s", dev_name);
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v1] mlx4: remove unused fields

2016-09-29 Thread Tariq Toukan


On 28/09/2016 9:00 PM, David Decotigny wrote:

From: David Decotigny 

This also can address following UBSAN warnings:
[   36.640343] 

[   36.648772] UBSAN: Undefined behaviour in 
drivers/net/ethernet/mellanox/mlx4/fw.c:857:26
[   36.656853] shift exponent 64 is too large for 32-bit type 'int'
[   36.663348] 

[   36.671783] 

[   36.680213] UBSAN: Undefined behaviour in 
drivers/net/ethernet/mellanox/mlx4/fw.c:861:27
[   36.688297] shift exponent 35 is too large for 32-bit type 'int'
[   36.694702] 


Tested:
   reboot with UBSAN, no warning.

Signed-off-by: David Decotigny 
---
  drivers/net/ethernet/mellanox/mlx4/fw.c | 4 
  drivers/net/ethernet/mellanox/mlx4/fw.h | 2 --
  2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c 
b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 090bf81..f9cbc67 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -853,12 +853,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
dev_cap->max_eqs = 1 << (field & 0xf);
MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MTT_OFFSET);
dev_cap->reserved_mtts = 1 << (field >> 4);
-   MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MRW_SZ_OFFSET);
-   dev_cap->max_mrw_sz = 1 << field;
MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MRW_OFFSET);
dev_cap->reserved_mrws = 1 << (field & 0xf);
-   MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MTT_SEG_OFFSET);
-   dev_cap->max_mtt_seg = 1 << (field & 0x3f);
MLX4_GET(size, outbox, QUERY_DEV_CAP_NUM_SYS_EQ_OFFSET);
dev_cap->num_sys_eqs = size & 0xfff;
MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_REQ_QP_OFFSET);
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.h 
b/drivers/net/ethernet/mellanox/mlx4/fw.h
index f11614f..5343a05 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.h
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.h
@@ -80,9 +80,7 @@ struct mlx4_dev_cap {
int max_eqs;
int num_sys_eqs;
int reserved_mtts;
-   int max_mrw_sz;
int reserved_mrws;
-   int max_mtt_seg;
int max_requester_per_qp;
int max_responder_per_qp;
int max_rdma_global;

Reviewed-by: Tariq Toukan 
Thanks.


[PATCH v3 net-next 1/4] net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and make it a bool

2016-09-29 Thread Shmulik Ladkani
'tcfm_ok_push' specifies whether a mac_len sized push is needed upon
egress to the target device (if action is performed at ingress).

Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of
the target device (and use a bool instead of int).

This allows to decouple the attribute from the action to be taken.

Signed-off-by: Shmulik Ladkani 
---
 v2: declare tcfm_mac_header_xmit as bool instead of int

 include/net/tc_act/tc_mirred.h |  2 +-
 net/sched/act_mirred.c | 11 ++-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index 62770add15..95431092c4 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -8,7 +8,7 @@ struct tcf_mirred {
struct tc_actioncommon;
int tcfm_eaction;
int tcfm_ifindex;
-   int tcfm_ok_push;
+   booltcfm_mac_header_xmit;
struct net_device __rcu *tcfm_dev;
struct list_headtcfm_list;
 };
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 667dc382df..16e17a887b 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -60,11 +60,12 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
 {
struct tc_action_net *tn = net_generic(net, mirred_net_id);
struct nlattr *tb[TCA_MIRRED_MAX + 1];
+   bool mac_header_xmit = false;
struct tc_mirred *parm;
struct tcf_mirred *m;
struct net_device *dev;
-   int ret, ok_push = 0;
bool exists = false;
+   int ret;
 
if (nla == NULL)
return -EINVAL;
@@ -102,10 +103,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
case ARPHRD_IPGRE:
case ARPHRD_VOID:
case ARPHRD_NONE:
-   ok_push = 0;
+   mac_header_xmit = false;
break;
default:
-   ok_push = 1;
+   mac_header_xmit = true;
break;
}
} else {
@@ -136,7 +137,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
dev_put(rcu_dereference_protected(m->tcfm_dev, 1));
dev_hold(dev);
rcu_assign_pointer(m->tcfm_dev, dev);
-   m->tcfm_ok_push = ok_push;
+   m->tcfm_mac_header_xmit = mac_header_xmit;
}
 
if (ret == ACT_P_CREATED) {
@@ -181,7 +182,7 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
goto out;
 
if (!(at & AT_EGRESS)) {
-   if (m->tcfm_ok_push)
+   if (m->tcfm_mac_header_xmit)
skb_push_rcsum(skb2, skb->mac_len);
}
 
-- 
2.7.4



Re: How to submit potential patch in linux kernel

2016-09-29 Thread Daniel Baluta
Hi Shyam,

On Thu, Sep 29, 2016 at 1:02 PM, Andrew Lunn  wrote:
> On Thu, Sep 29, 2016 at 04:32:29AM +0530, Shyam Saini wrote:
>> Hi everyone,
>>
>> I'm Shyam, final year undergraduate student. I wanted to know how one can
>> submit potential linux kernel patch in networking subsystem.
>
> Documentation/SubmittingPatches is a good starting point.
>
> Then post a patch and we will help you learn the rest of the process.


You can have a look at https://kernelnewbies.org/FirstKernelPatch.
There is also this presentation from Greg:
https://www.youtube.com/watch?v=LLBrBBImJt4

thanks,
Daniel.


Re: [PATCH v2 net-next 4/4] net/sched: act_mirred: Implement ingress actions

2016-09-29 Thread Shmulik Ladkani
Hi Eric,

On Tue, 27 Sep 2016 14:27:13 -0700 Eric Dumazet  wrote:
> 
> Since this runs lockless, another cpu might change m->tcfm_eaction in
> the middle, and you could call dev_queue_xmit(skb2) while the skb2 was
> prepared for the opposite action.

Well, seem members of 'struct tcf_mirred' are out of sync wrt to each
other, even in existing code, regadless this patch:

- 'tcfm_dev' may be assigned, but 'tcfm_ok_push' not yet updated,
  may result in skb_push_rcsum being called/not called

- 'tcfm_eaction' is changed, in between "mirror is always swallowed" to
  the final 'out:' label,
  may result in wrong tc_verd assigned (or lack of assignment)

Seems the whole "params" need be rcu_dereferenced, like in
tunnel_key_act, or like your suggestion in
  https://patchwork.ozlabs.org/patch/667680/.

I'm gonna fix the new problem you pointed out, by reading-once
'tcfm_eaction' early (right when tcfm_dev is dereferenced) knowing this
is just "keeping things as is wrt running lockless", without introducing
any new non-coherent code.

Thanks,
Shmulik


[PATCH v3 net-next 0/4] act_mirred: Ingress actions support

2016-09-29 Thread Shmulik Ladkani
This patch series implements action mirred 'ingress' actions
TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR.

This allows attaching filters whose target is to hand matching skbs into
the rx processing of a specified device.

v3:
  in 4/4, addressed non coherency due to reading m->tcfm_eaction multiple
  times, as spotted by Eric Dumazet
v2:
  in 1/4, declare tcfm_mac_header_xmit as bool instead of int

Shmulik Ladkani (4):
  net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and
make it a bool
  net/sched: act_mirred: Refactor detection whether dev needs xmit at
mac header
  net/sched: tc_mirred: Rename public predicates
'is_tcf_mirred_redirect' and 'is_tcf_mirred_mirror'
  net/sched: act_mirred: Implement ingress actions

 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  4 +-
 .../net/ethernet/netronome/nfp/nfp_net_offload.c   |  2 +-
 include/net/tc_act/tc_mirred.h |  6 +-
 net/sched/act_mirred.c | 84 --
 7 files changed, 73 insertions(+), 29 deletions(-)

-- 
2.7.4



[PATCH v3 net-next 3/4] net/sched: tc_mirred: Rename public predicates 'is_tcf_mirred_redirect' and 'is_tcf_mirred_mirror'

2016-09-29 Thread Shmulik Ladkani
These accessors are used in various drivers that support tc offloading,
to detect properties of a given 'tc_action'.

'is_tcf_mirred_redirect' tests that the action is TCA_EGRESS_REDIR.
'is_tcf_mirred_mirror' tests that the action is TCA_EGRESS_MIRROR.

As a prep towards supporting INGRESS redir/mirror, rename these
predicates to reflect their true meaning:
  s/is_tcf_mirred_redirect/is_tcf_mirred_egress_redirect/
  s/is_tcf_mirred_mirror/is_tcf_mirred_egress_mirror/

Signed-off-by: Shmulik Ladkani 
Cc: Hariprasad S 
Cc: Jeff Kirsher 
Cc: Saeed Mahameed 
Cc: Jiri Pirko 
Cc: Ido Schimmel 
Cc: Jakub Kicinski 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c| 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c  | 2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   | 4 +++-
 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c | 2 +-
 include/net/tc_act/tc_mirred.h   | 4 ++--
 6 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
index 49d2debb33..52af62e0ec 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
@@ -113,7 +113,7 @@ static int fill_action_fields(struct adapter *adap,
}
 
/* Re-direct to specified port in hardware. */
-   if (is_tcf_mirred_redirect(a)) {
+   if (is_tcf_mirred_egress_redirect(a)) {
struct net_device *n_dev;
unsigned int i, index;
bool found = false;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a244d9a672..784b0b98ab 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8410,7 +8410,7 @@ static int parse_tc_actions(struct ixgbe_adapter *adapter,
}
 
/* Redirect to a VF or a offloaded macvlan */
-   if (is_tcf_mirred_redirect(a)) {
+   if (is_tcf_mirred_egress_redirect(a)) {
int ifindex = tcf_mirred_ifindex(a);
 
err = handle_redirect_action(adapter, ifindex, queue,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index a350b7171e..957a464489 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -404,7 +404,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
continue;
}
 
-   if (is_tcf_mirred_redirect(a)) {
+   if (is_tcf_mirred_egress_redirect(a)) {
int ifindex = tcf_mirred_ifindex(a);
struct net_device *out_dev;
struct mlx5e_priv *out_priv;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index fd74d1064f..6a4f9c4664 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1237,8 +1237,10 @@ static int mlxsw_sp_port_add_cls_matchall(struct 
mlxsw_sp_port *mlxsw_sp_port,
 
tcf_exts_to_list(cls->exts, );
list_for_each_entry(a, , list) {
-   if (!is_tcf_mirred_mirror(a) || protocol != htons(ETH_P_ALL))
+   if (!is_tcf_mirred_egress_mirror(a) ||
+   protocol != htons(ETH_P_ALL)) {
return -ENOTSUPP;
+   }
 
err = mlxsw_sp_port_add_cls_matchall_mirror(mlxsw_sp_port, cls,
a, ingress);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
index 8acfb631a0..cfed40c0e3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
@@ -128,7 +128,7 @@ nfp_net_bpf_get_act(struct nfp_net *nn, struct 
tc_cls_bpf_offload *cls_bpf)
if (is_tcf_gact_shot(a))
return NN_ACT_TC_DROP;
 
-   if (is_tcf_mirred_redirect(a) &&
+   if (is_tcf_mirred_egress_redirect(a) &&
tcf_mirred_ifindex(a) == nn->netdev->ifindex)
return NN_ACT_TC_REDIR;
}
diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index 95431092c4..604bc31e23 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -14,7 +14,7 @@ struct tcf_mirred {
 };
 #define to_mirred(a) 

[PATCH v3 net-next 2/4] net/sched: act_mirred: Refactor detection whether dev needs xmit at mac header

2016-09-29 Thread Shmulik Ladkani
Move detection logic that tests whether device expects skb data to point
at mac_header upon xmit into a function.

Signed-off-by: Shmulik Ladkani 
---
 net/sched/act_mirred.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 16e17a887b..69dcce8c75 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -54,6 +54,20 @@ static const struct nla_policy mirred_policy[TCA_MIRRED_MAX 
+ 1] = {
 static int mirred_net_id;
 static struct tc_action_ops act_mirred_ops;
 
+static bool dev_is_mac_header_xmit(const struct net_device *dev)
+{
+   switch (dev->type) {
+   case ARPHRD_TUNNEL:
+   case ARPHRD_TUNNEL6:
+   case ARPHRD_SIT:
+   case ARPHRD_IPGRE:
+   case ARPHRD_VOID:
+   case ARPHRD_NONE:
+   return false;
+   }
+   return true;
+}
+
 static int tcf_mirred_init(struct net *net, struct nlattr *nla,
   struct nlattr *est, struct tc_action **a, int ovr,
   int bind)
@@ -96,19 +110,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
tcf_hash_release(*a, bind);
return -ENODEV;
}
-   switch (dev->type) {
-   case ARPHRD_TUNNEL:
-   case ARPHRD_TUNNEL6:
-   case ARPHRD_SIT:
-   case ARPHRD_IPGRE:
-   case ARPHRD_VOID:
-   case ARPHRD_NONE:
-   mac_header_xmit = false;
-   break;
-   default:
-   mac_header_xmit = true;
-   break;
-   }
+   mac_header_xmit = dev_is_mac_header_xmit(dev);
} else {
dev = NULL;
}
-- 
2.7.4



[PATCH v3 net-next 4/4] net/sched: act_mirred: Implement ingress actions

2016-09-29 Thread Shmulik Ladkani
Up until now, 'action mirred' supported only egress actions (either
TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR).

This patch implements the corresponding ingress actions
TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR.

This allows attaching filters whose target is to hand matching skbs into
the rx processing of a specified device.

Signed-off-by: Shmulik Ladkani 
Cc: Jamal Hadi Salim 
Cc: Eric Dumazet 
---
 v3: Addressed non coherency due to reading m->tcfm_eaction multiple times,
 as spotted by Eric Dumazet

 net/sched/act_mirred.c | 51 --
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 69dcce8c75..22dcfd68e6 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -33,6 +33,25 @@
 static LIST_HEAD(mirred_list);
 static DEFINE_SPINLOCK(mirred_list_lock);
 
+static bool tcf_mirred_is_act_redirect(int action)
+{
+   return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
+}
+
+static u32 tcf_mirred_act_direction(int action)
+{
+   switch (action) {
+   case TCA_EGRESS_REDIR:
+   case TCA_EGRESS_MIRROR:
+   return AT_EGRESS;
+   case TCA_INGRESS_REDIR:
+   case TCA_INGRESS_MIRROR:
+   return AT_INGRESS;
+   default:
+   BUG();
+   }
+}
+
 static void tcf_mirred_release(struct tc_action *a, int bind)
 {
struct tcf_mirred *m = to_mirred(a);
@@ -97,6 +116,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
switch (parm->eaction) {
case TCA_EGRESS_MIRROR:
case TCA_EGRESS_REDIR:
+   case TCA_INGRESS_REDIR:
+   case TCA_INGRESS_MIRROR:
break;
default:
if (exists)
@@ -156,15 +177,20 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
  struct tcf_result *res)
 {
struct tcf_mirred *m = to_mirred(a);
+   bool m_mac_header_xmit;
struct net_device *dev;
struct sk_buff *skb2;
-   int retval, err;
+   int retval, err = 0;
+   int m_eaction;
+   int mac_len;
u32 at;
 
tcf_lastuse_update(>tcf_tm);
bstats_cpu_update(this_cpu_ptr(m->common.cpu_bstats), skb);
 
rcu_read_lock();
+   m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
+   m_eaction = READ_ONCE(m->tcfm_eaction);
retval = READ_ONCE(m->tcf_action);
dev = rcu_dereference(m->tcfm_dev);
if (unlikely(!dev)) {
@@ -183,23 +209,36 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
if (!skb2)
goto out;
 
-   if (!(at & AT_EGRESS)) {
-   if (m->tcfm_mac_header_xmit)
+   /* If action's target direction differs than filter's direction,
+* and devices expect a mac header on xmit, then mac push/pull is
+* needed.
+*/
+   if (at != tcf_mirred_act_direction(m_eaction) && m_mac_header_xmit) {
+   if (at & AT_EGRESS) {
+   /* caught at egress, act ingress: pull mac */
+   mac_len = skb_network_header(skb) - skb_mac_header(skb);
+   skb_pull_rcsum(skb2, mac_len);
+   } else {
+   /* caught at ingress, act egress: push mac */
skb_push_rcsum(skb2, skb->mac_len);
+   }
}
 
/* mirror is always swallowed */
-   if (m->tcfm_eaction != TCA_EGRESS_MIRROR)
+   if (tcf_mirred_is_act_redirect(m_eaction))
skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
 
skb2->skb_iif = skb->dev->ifindex;
skb2->dev = dev;
-   err = dev_queue_xmit(skb2);
+   if (tcf_mirred_act_direction(m_eaction) & AT_EGRESS)
+   err = dev_queue_xmit(skb2);
+   else
+   netif_receive_skb(skb2);
 
if (err) {
 out:
qstats_overlimit_inc(this_cpu_ptr(m->common.cpu_qstats));
-   if (m->tcfm_eaction != TCA_EGRESS_MIRROR)
+   if (tcf_mirred_is_act_redirect(m_eaction))
retval = TC_ACT_SHOT;
}
rcu_read_unlock();
-- 
2.7.4



Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-29 Thread Paolo Abeni
On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote:
> On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote:
> 
> > +static void udp_rmem_release(struct sock *sk, int partial)
> > +{
> > +   struct udp_sock *up = udp_sk(sk);
> > +   int fwd, amt;
> > +
> > +   if (partial && !udp_under_memory_pressure(sk))
> > +   return;
> > +
> > +   /* we can have concurrent release; if we catch any conflict
> > +* we let only one of them do the work
> > +*/
> > +   if (atomic_dec_if_positive(>can_reclaim) < 0)
> > +   return;
> > +
> > +   fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc));
> > +   if (fwd < SK_MEM_QUANTUM + partial) {
> > +   atomic_inc(>can_reclaim);
> > +   return;
> > +   }
> > +
> > +   amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> > +   atomic_sub(amt, >mem_allocated);
> > +   atomic_inc(>can_reclaim);
> > +
> > +   __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +   sk->sk_forward_alloc = fwd - amt;
> > +}
> 
> 
> This is racy... all these atomics make me nervous...

Ah, perhaps I got it: if we have a concurrent memory scheduling, we
could end up with a value of mem_allocated below the real need. 

That mismatch will not drift: at worst we can end up with mem_allocated
being single SK_MEM_QUANTUM below what is strictly needed.

A possible alternative could be:

static void udp_rmem_release(struct sock *sk, int partial)
{
struct udp_sock *up = udp_sk(sk);
int fwd, amt, alloc_old, alloc;

if (partial && !udp_under_memory_pressure(sk))
return;

alloc = atomic_read(>mem_allocated);
fwd = alloc - atomic_read(>sk_rmem_alloc);
if (fwd < SK_MEM_QUANTUM + partial)
return;

amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
alloc_old = atomic_cmpxchg(>mem_allocated, alloc, alloc - amt);
/* if a concurrent update is detected, just do nothing; if said update
 * is due to another memory release, that release take care of
 * reclaiming the memory for us, too.
 * Otherwise we will be able to release on later dequeue, since
 * we will eventually stop colliding with the writer when it will
 * consume all the fwd allocated memory
 */
if (alloc_old != alloc)
return;

__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
sk->sk_forward_alloc = fwd - amt;
}

which is even more lazy in reclaiming but should never underestimate the
needed forward allocation, and under pressure should eventually free the
needed memory.





Re: [PATCH] ipv6 addrconf: implement RFC7559 router solicitation backoff

2016-09-29 Thread Erik Kline
Passes my local unittest for this behaviour.

Acked-by: Erik Kline 


Re: How to submit potential patch in linux kernel

2016-09-29 Thread Andrew Lunn
On Thu, Sep 29, 2016 at 04:32:29AM +0530, Shyam Saini wrote:
> Hi everyone,
> 
> I'm Shyam, final year undergraduate student. I wanted to know how one can
> submit potential linux kernel patch in networking subsystem.

Documentation/SubmittingPatches is a good starting point.

Then post a patch and we will help you learn the rest of the process.

 Andrew


Re: [RFC] net: store port/representative id in metadata_dst

2016-09-29 Thread Jakub Kicinski
On Fri, 23 Sep 2016 14:20:40 -0700, John Fastabend wrote:
> On 16-09-23 01:45 PM, Jakub Kicinski wrote:
> > On Fri, 23 Sep 2016 13:25:10 -0700, John Fastabend wrote:  
> >> On 16-09-23 01:17 PM, Jakub Kicinski wrote:  
> >>> On Fri, 23 Sep 2016 10:22:59 -0700, Samudrala, Sridhar wrote:
>  On 9/23/2016 8:29 AM, Jakub Kicinski wrote:
> >>  [...]  
> >>  [...]
> 
>  The 'accel' parameter in dev_queue_xmit_accel() is currently only passed
>  to ndo_select_queue() via netdev_pick_tx() and is used to select the tx 
>  queue.
>  Also, it is not passed all the way to the driver specific xmit routine.  
>  Doesn't it require
>  changing all the driver xmit routines if we want to pass this parameter?
> 
> >>  [...]
> 
>  Yes.  The VFPR netdevs don't have any HW queues associated with them and 
>  we would like
>  to use the PF queues for the xmit.
>  I was also looking into some way of passing the port id via skb 
>  parameter to the
>  dev_queue_xmit() call so that the PF xmit routine can do a directed 
>  transmit to a specifc VF.
>  Is skb->cb an option to pass this info?
>  dst_metadata approach would work  too if it is acceptable.
> >>>
> >>> I don't think we can trust skb->cb to be set to anything meaningful
> >>> when the skb is received by the lower device.   
> >>
> >> Agreed. I wouldn't recommend using skb->cb. How about passing it through
> >> dev_queue_xmit_accel() through to the driver?
> >>
> >> If you pass the metadata through the dev_queue_xmit_accel() handle tx
> >> queue  selection would work using normal mechanisms (xps, select_queue,
> >> cls  hook, etc.). If you wanted to pick some specific queue based on
> >> policy the policy could be loaded into one of those hooks.  
> > 
> > Do you mean without extending how accel is handled by
> > dev_queue_xmit_accel() today?  If my goal is to not have extra HW
> > queues then I don't see how I could mux in the lower dev without extra
> > locking (as I tried to explain two emails ago).  Sorry for being slow
> > here :(
> >   
> 
> Not slow here I think I was overly optimistic...
> 
> Yeh let me try this, roughly the current flow is,
> 
>dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
>__dev_queue_xmit(skb, accel_priv);
>netdev_pick_tx(dev, skb, accel_priv);
>   ndo_select_queue(dev, skb, accel_priv, ...);
>[...]
>q->enqueue();
>[...]
>dev_hard_start_xmit();
>[...]
> 
> 
> So in this flow the VFR netdev driver handles its xmit routine by
> calling dev_queue_xmit_accel after setting skb->dev to the physical
> device and passing a cookie via accel that the select_queue() routine
> can use to pick a tx queue. The rest of the stack q->enqueue() and
> friends will ensure that locking and qdisc is handled correctly.
> 
> But accel_priv was lost at queue selection and so its not being passed
> down to the driver so no way to set your descriptor bits or whatever
> needed to push to the VF. I was sort of thinking we could map it from
> the select_queue routine but I can't figure out how to do that either.
> 
> The metadata idea doesn't seem that bad now that I've spent some more
> time going through it. Either that or hijack some field in the skb but
> I think that might be worse than the proposal here.
> 
> I'm trying to think up some other alternative now and will let you know
> if I think of anything clever but got nothing at the moment.

Cool, I'm happy to discuss this further at netdev but it seems like
there is no strong opposition so far?

FWIW in the example I gave I didn't do refcounting on the dst but I
think that's incorrect since we don't have control over lifetime of
redirected/stolen skbs.


RE: [PATCH 3/3] net: fec: align IP header in hardware

2016-09-29 Thread David Laight
From: Eric Nelson
> Sent: 28 September 2016 18:15
> On 09/28/2016 09:42 AM, David Laight wrote:
> > From: Eric Nelson
> >> Sent: 26 September 2016 19:40
> >> Hi David,
> >>
> >> On 09/26/2016 02:26 AM, David Laight wrote:
> >>> From: Eric Nelson
>  Sent: 24 September 2016 15:42
>  The FEC receive accelerator (RACC) supports shifting the data payload of
>  received packets by 16-bits, which aligns the payload (IP header) on a
>  4-byte boundary, which is, if not required, at least strongly suggested
>  by the Linux networking layer.
> >>> ...
>  +/* align IP header */
>  +val |= FEC_RACC_SHIFT16;
> >>>
> >>> I can't help feeling that there needs to be corresponding
> >>> changes to increase the buffer size by 2 (maybe for large mtu)
> >>> and to discard two bytes from the frame length.
> >>>
> >>
> >> In the normal case, the fec driver over-allocates all receive packets to
> >> be of size FEC_ENET_RX_FRSIZE (2048) minus the value of rx_align,
> >> which is either 0x0f (ARM) or 0x03 (PPC).
> >>
> >> If the frame length is less than rx_copybreak (typically 256), then
> >> the frame length from the receive buffer descriptor is used to
> >> control the allocation size for a copied buffer, and this will include
> >> the two bytes of padding if RACC_SHIFT16 is set.
> >>
> >>> If probably ought to be predicated on NET_IP_ALIGN as well.
> >>>
> >> Can you elaborate?
> >
> > From reading this it seems that the effect of FEC_RACC_SHIFT16 is to
> > add two bytes of 'junk' to the start of every receive frame.
> >
> 
> That's right. Two bytes of junk between the MAC header and the
> IP header.
> 
> > In the 'copybreak' case the new skb would need to be 2 bytes shorter
> > than the length reported by the hardware, and the data copied from
> > 2 bytes into the dma buffer.
> >
> 
> As it stands, the skb allocated by the copybreak routine will include
> the two bytes of padding, and the call to skb_pull_inline will ignore
> them.

Ok, I didn't see that call being added by this patch.

> > The extra 2 bytes also mean the that maximum mtu that can be received
> > into a buffer is two bytes less.
> >
> 
> Right, but I think the max is already high enough that this isn't a
> problem.
> 
> > If someone sets the mtu to (say) 9k for jumbo frames this might matter.
> > Even with fixed 2048 byte buffers it reduces the maximum value the mtu
> > can be set to by 2.
> >
> 
> As far as I can tell, the fec driver doesn't support jumbo frames, and
> the max frame length is currently hard-coded at PKT_MAXBUF_SIZE (1522).
> 
> This is well within the 2048-byte allocation, even with optional headers
> for VLAN etc.

Hmm...
That (probably) means all the skb the driver allocates are actually 4k.
It would be much better to reduce the size so that the entire skb
(with packet buffer) is less than 2k.

> > Now if NET_IP_ALIGN is zero then it is fine for the rx frame to start
> > on a 4n boundary, and the skb are likely to be allocated that way.
> > In this case you don't want to extra two bytes of 'junk'.
> >
> NET_IP_ALIGN is defaulting to 2 by the conditional in skbuff.h

Even though it is always currently set is isn't really ideal to have
a driver that breaks if it isn't set.
This could easily happen at some point in the future if the ethernet
logic is put with a different cpu.


> > OTOH if NET_IP_ALIGN is 2 then you need to 'fiddle' things so that
> > the data is dma'd to offset -2 in the skb and then ensure that the
> > end of frame is set correctly.
> >
> 
> That's what the RACC SHIFT16 bit does.

No, that causes the ethernet controller to add 2 bytes to the frame.
You then need to change the dma target address to match.
Otherwise if a new version of the silicon stops ignoring the low
address with the frame will be misaligned in the buffer.

The receive frame length will also (probably) be 2 larger than the
actual frame - so you need to set the end point correctly as well.
IP will probably ignore the 2 bytes of pad I think you are generating.

> The FEC hardware isn't capable of DMA'ing to an un-aligned address.
> On ARM, it requires 64-bit alignment, but suggests 128-bit alignment.
> 
> On other (PPC?) architectures, it requires 32-bit alignment. This is
> handled by the rx_align field.

That isn't entirely relevant.
If the kernel is being built with NET_IP_ALIGN set to 0 you should
align the destination mac address on a 4n boundary
(Or rather the skb are likely to be allocated that way).
If it causes misaligned memory reads later on that is a different problem.
The MAC driver has aligned the frames as it was told to.

David




[PATCH net-next 0/6] rxrpc: Fixes and adjustments

2016-09-29 Thread David Howells

This set of patches contains some fixes and adjustments:

 (1) Connections for exclusive calls are being reused because the check to
 work out whether to set RXRPC_CONN_DONT_REUSE is checking where the
 parameters will be copied to (but haven't yet).

 (2) Make Tx loss-injection go through the normal return, so the state gets
 set correctly for what the code thinks it has done.

 Note lost Tx packets in the tx_data trace rather than the skb
 tracepoint.

 (3) Activate channels according to the current state from within the
 channel_lock to avoid someone changing it on us.

 (4) Reduce the local endpoint's services list to a single pointer as we
 don't allow service AF_RXRPC sockets to share UDP ports with other
 AF_RXRPC sockets, so there can't be more than one element in the list.

 (5) Request more ACKs in slow-start mode to help monitor the state driving
 the window configuration.

 (6) Display the serial number of the packet being ACK'd rather than the
 ACK packet's own serial number in the congestion trace as this can be
 related to other entries in the trace.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160929

David
---
David Howells (6):
  rxrpc: Fix exclusive client connections
  rxrpc: Make Tx loss-injection go through normal return and adjust tracing
  rxrpc: When activating client conn channels, do state check inside lock
  rxrpc: Reduce the rxrpc_local::services list to a pointer
  rxrpc: Request more ACKs in slow-start mode
  rxrpc: Note serial number being ACK'd in the congestion management trace


 include/trace/events/rxrpc.h |6 --
 net/rxrpc/af_rxrpc.c |   21 -
 net/rxrpc/ar-internal.h  |6 ++
 net/rxrpc/call_accept.c  |8 
 net/rxrpc/call_event.c   |2 +-
 net/rxrpc/conn_client.c  |   38 ++
 net/rxrpc/input.c|8 
 net/rxrpc/local_object.c |3 +--
 net/rxrpc/misc.c |1 -
 net/rxrpc/output.c   |   18 ++
 net/rxrpc/security.c |8 
 net/rxrpc/sendmsg.c  |2 +-
 net/rxrpc/skbuff.c   |   11 +++
 13 files changed, 68 insertions(+), 64 deletions(-)



Re: UDP wierdness around skb_copy_and_csum_datagram_msg()

2016-09-29 Thread Christian Lamparter
On Wednesday, September 28, 2016 7:20:39 PM CEST Jay Smith wrote:
> Actually, on a little more searching of this list's archives, I think
> that this discussion:  https://patchwork.kernel.org/patch/9260733/ is
> about exactly the same issue I've found, except from the TCP side. I'm
> cc'ing a few of the participants from that discussion.
> 
> So is the patch proposed there (copying and restoring the entire
> iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a
> fix?

>From Alan's post:

"My ugly patch fixes this in the most obvious way: make a local copy of
msg->msg_iter before the call to skb_copy_and_csum_datagram(), and copy
it back if the checksum is bad, just before goto csum_error;"

IMHO this meant that the patch is a proof of concept for his problem.

> If not, would an alternate one that concealed the save-and-restore logic
> inside iov_iter.c be more acceptable? I'd be happy to produce whatever's
> needed, or yield to someone with stronger feelings about where it should
> go...
Al Viro identified more inconsistencies within the error-paths that deal
with EFAULT in the whole area (in and around skb_copy_and_csum_datagram()).

As far as I can tell the original discussion about the data corruption
issue went off on a tangent and it is stuck in figuring out "How to handle
the errors in tcp_copy_to_iovec()".

As for fixing the issue: I'm happy to test and review patches. 
The trouble is that nobody seem to be able to produce them...

Regards,
Christian


[PATCH net-next 1/6] rxrpc: Fix exclusive client connections

2016-09-29 Thread David Howells
Exclusive connections are currently reusable (which they shouldn't be)
because rxrpc_alloc_client_connection() checks the exclusive flag in the
rxrpc_connection struct before it's initialised from the function
parameters.  This means that the DONT_REUSE flag doesn't get set.

Fix this by checking the function parameters for the exclusive flag.

Signed-off-by: David Howells 
---

 net/rxrpc/conn_client.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
index c76a125df891..f5ee8bfa5bef 100644
--- a/net/rxrpc/conn_client.c
+++ b/net/rxrpc/conn_client.c
@@ -200,7 +200,7 @@ rxrpc_alloc_client_connection(struct rxrpc_conn_parameters 
*cp, gfp_t gfp)
}
 
atomic_set(>usage, 1);
-   if (conn->params.exclusive)
+   if (cp->exclusive)
__set_bit(RXRPC_CONN_DONT_REUSE, >flags);
 
conn->params= *cp;



Re: next-20160929 build: 2 failures 4 warnings (next-20160929)

2016-09-29 Thread Arnd Bergmann
On Thursday 29 September 2016, Vishwanath Pai wrote:
> I have sent a patch for this a couple of days ago to netdev, it hasn't
> made it to net-next yet. Here's the latest one:
> 
> [PATCH net-next v3] netfilter: xt_hashlimit: Fix link error in 32bit
> arch because of 64bit division
> 
> This should fix the link error.

I also did a patch (not submitted yet), but my solution used 32-bit
math for the version 1 case. I think that would be better so we
don't slow down 32-bit architectures too much (div_u64_u64
is very slow).

Arnd


Re: [PATCH net-next 1/2] openvswitch: mpls: set network header correctly on key extract

2016-09-29 Thread pravin shelar
On Thu, Sep 29, 2016 at 12:19 PM, Jiri Benc  wrote:
> After the 48d2ab609b6b ("net: mpls: Fixups for GSO"), MPLS handling in
> openvswitch was changed to have network header pointing to the start of the
> MPLS headers and inner_network_header pointing after the MPLS headers.
>
> However, key_extract was missed by the mentioned commit, causing incorrect
> headers to be set when a MPLS packet just enters the bridge or after it is
> recirculated.
>
> Fixes: 48d2ab609b6b ("net: mpls: Fixups for GSO")
> Signed-off-by: Jiri Benc 

Acked-by: Pravin B Shelar 


Re: UDP wierdness around skb_copy_and_csum_datagram_msg()

2016-09-29 Thread Eric Dumazet
On Fri, 2016-09-30 at 01:28 +0200, Christian Lamparter wrote:
> On Wednesday, September 28, 2016 7:20:39 PM CEST Jay Smith wrote:
> > Actually, on a little more searching of this list's archives, I think
> > that this discussion:  https://patchwork.kernel.org/patch/9260733/ is
> > about exactly the same issue I've found, except from the TCP side. I'm
> > cc'ing a few of the participants from that discussion.
> > 
> > So is the patch proposed there (copying and restoring the entire
> > iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a
> > fix?
> 
> From Alan's post:
> 
> "My ugly patch fixes this in the most obvious way: make a local copy of
> msg->msg_iter before the call to skb_copy_and_csum_datagram(), and copy
> it back if the checksum is bad, just before goto csum_error;"
> 
> IMHO this meant that the patch is a proof of concept for his problem.
> 
> > If not, would an alternate one that concealed the save-and-restore logic
> > inside iov_iter.c be more acceptable? I'd be happy to produce whatever's
> > needed, or yield to someone with stronger feelings about where it should
> > go...
> Al Viro identified more inconsistencies within the error-paths that deal
> with EFAULT in the whole area (in and around skb_copy_and_csum_datagram()).
> 
> As far as I can tell the original discussion about the data corruption
> issue went off on a tangent and it is stuck in figuring out "How to handle
> the errors in tcp_copy_to_iovec()".
> 
> As for fixing the issue: I'm happy to test and review patches. 
> The trouble is that nobody seem to be able to produce them...
> 

This is doable with a bit of fault injection I believe.

And "ethtool -K eth0 rx off gro off lro off"  to let the TCP receiver
compute the checksum itself.





  1   2   >