date:20160428

Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog

2016-04-28 Thread Alexei Starovoitov


On 4/28/16 10:05 PM, Eric Dumazet wrote:

On Thu, 2016-04-28 at 21:43 -0700, Alexei Starovoitov wrote:



I don't understand the logic completely, but isn't it
safer to do 'goto wait_for_memory;' here if we happened
to hit this in the middle of the loop?


Well, the wait_for_memory pushes data, and could early return to user
space with short writes (non blocking IO). This would break things...


I see. Right. My only concern was about restarting the loop
and msg_data_left(), since it's really hard to follow iov_iter logic.

Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 21:43 -0700, Alexei Starovoitov wrote:

> 
> I don't understand the logic completely, but isn't it
> safer to do 'goto wait_for_memory;' here if we happened
> to hit this in the middle of the loop?

Well, the wait_for_memory pushes data, and could early return to user
space with short writes (non blocking IO). This would break things...

After processing backlog, tcp_send_mss() needs to be called again,
and we also need to check sk_err and sk_shutdown. A goto looks fine to
me.

> Also does it make sense to rename __release_sock to
> something like _ _ _sk_flush_backlog, since that's
> what it's doing and not doing any 'release' ?

Well, I guess it could be renamed, but this has been named like that for
decades ? Why changing now, while this patch does not touch it ?

Re: [RFC PATCH V2 2/2] vhost: device IOTLB API

2016-04-28 Thread Jason Wang



On 04/29/2016 09:12 AM, Jason Wang wrote:
> On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote:
>> > On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>>> >>
>>> >> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
 >>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>  This patch tries to implement an device IOTLB for vhost. This could 
>  be
>  used with for co-operation with userspace(qemu) implementation of DMA
>  remapping.
> 
>  The idea is simple. When vhost meets an IOTLB miss, it will request
>  the assistance of userspace to do the translation, this is done
>  through:
> 
>  - Fill the translation request in a preset userspace address (This
>    address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>  - Notify userspace through eventfd (This eventfd was set through 
>  ioctl
>    VHOST_SET_IOTLB_FD).
 >>> Why use an eventfd for this?
>>> >> The aim is to implement the API all through ioctls.
>>> >>
 >>>  We use them for interrupts because
 >>> that happens to be what kvm wants, but here - why don't we
 >>> just add a generic support for reading out events
 >>> on the vhost fd itself?
>>> >> I've considered this approach, but what's the advantages of this? I mean
>>> >> looks like all other ioctls could be done through vhost fd
>>> >> reading/writing too.
>> > read/write have a non-blocking flag.
>> >
>> > It's not useful for other ioctls but it's useful here.
>> >
> Ok, this looks better.
>
>  - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> 
>  When userspace finishes the translation, it will update the vhost
>  IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge 
>  of
>  snooping the IOTLB invalidation of IOMMU IOTLB and use
>  VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
 >>> There's one problem here, and that is that VQs still do not undergo
 >>> translation.  In theory VQ could be mapped in such a way
 >>> that it's not contigious in userspace memory.
>>> >> I'm not sure I get the issue, current vhost API support setting
>>> >> desc_user_addr, used_user_addr and avail_user_addr independently. So
>>> >> looks ok? If not, looks not a problem to device IOTLB API itself.
>> > The problem is that addresses are all HVA.
>> >
>> > Without an iommu, we ask for them to be contigious and
>> > since bus address == GPA, this means contigious GPA =>
>> > contigious HVA. With an IOMMU you can map contigious
>> > bus address but non contigious GPA and non contigious HVA.
> Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent
> a new ioctl to set bus addr (guest iova). The access the VQ through
> device IOTLB too.

Note that userspace has checked for this and fallback to userspace if it
detects non contiguous GPA. Consider this happens rare, I'm not sure we
should handle this.

>
>> >
>> > Another concern: what if guest changes the GPA while keeping bus address
>> > constant? Normal devices will work because they only use
>> > bus addresses, but virtio will break.
> If we access VQ through device IOTLB too, this could be solved.
>

I don't see a reason why guest want change GPA during DMA. Even if it
can, it needs lots of other synchronization.

Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog

2016-04-28 Thread Alexei Starovoitov


On 4/28/16 8:10 PM, Eric Dumazet wrote:

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet 
Cc: Soheil Hassas Yeganeh 
Cc: Alexei Starovoitov 
Cc: Marcelo Ricardo Leitner 
---
  include/net/sock.h | 11 +++
  net/core/sock.c|  7 +++
  net/ipv4/tcp.c |  8 ++--
  3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3df778ccaa82..1dbb1f9f7c1b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -926,6 +926,17 @@ void sk_stream_kill_queues(struct sock *sk);
  void sk_set_memalloc(struct sock *sk);
  void sk_clear_memalloc(struct sock *sk);

+void __sk_flush_backlog(struct sock *sk);
+
+static inline bool sk_flush_backlog(struct sock *sk)
+{
+   if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
+   __sk_flush_backlog(sk);
+   return true;
+   }
+   return false;
+}
+
  int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);

  struct request_sock_ops;
diff --git a/net/core/sock.c b/net/core/sock.c
index 70744dbb6c3f..f615e9391170 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2048,6 +2048,13 @@ static void __release_sock(struct sock *sk)
sk->sk_backlog.len = 0;
  }

+void __sk_flush_backlog(struct sock *sk)
+{
+   spin_lock_bh(>sk_lock.slock);
+   __release_sock(sk);
+   spin_unlock_bh(>sk_lock.slock);
+}
+
  /**
   * sk_wait_data - wait for data to arrive at sk_receive_queue
   * @sk:sock to wait on
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4787f86ae64c..b945c2b046c5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1136,11 +1136,12 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
/* This should be in poll */
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);

-   mss_now = tcp_send_mss(sk, _goal, flags);
-
/* Ok commence sending. */
copied = 0;

+restart:
+   mss_now = tcp_send_mss(sk, _goal, flags);
+
err = -EPIPE;
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
@@ -1166,6 +1167,9 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;

+   if (sk_flush_backlog(sk))
+   goto restart;


I don't understand the logic completely, but isn't it
safer to do 'goto wait_for_memory;' here if we happened
to hit this in the middle of the loop?
Also does it make sense to rename __release_sock to
something like _ _ _sk_flush_backlog, since that's
what it's doing and not doing any 'release' ?

Ack for patches 2 and 6. Great improvement!

[PATCH v2 net-next 6/7] net: do not block BH while processing socket backlog

2016-04-28 Thread Eric Dumazet

Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.

Signed-off-by: Eric Dumazet 
---
 net/core/sock.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index e16a5db853c6..70744dbb6c3f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2019,33 +2019,27 @@ static void __release_sock(struct sock *sk)
__releases(>sk_lock.slock)
__acquires(>sk_lock.slock)
 {
-   struct sk_buff *skb = sk->sk_backlog.head;
+   struct sk_buff *skb, *next;
 
-   do {
+   while ((skb = sk->sk_backlog.head) != NULL) {
sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
-   bh_unlock_sock(sk);
 
-   do {
-   struct sk_buff *next = skb->next;
+   spin_unlock_bh(>sk_lock.slock);
 
+   do {
+   next = skb->next;
prefetch(next);
WARN_ON_ONCE(skb_dst_is_noref(skb));
skb->next = NULL;
sk_backlog_rcv(sk, skb);
 
-   /*
-* We are in process context here with softirqs
-* disabled, use cond_resched_softirq() to preempt.
-* This is safe to do because we've taken the backlog
-* queue private:
-*/
-   cond_resched_softirq();
+   cond_resched();
 
skb = next;
} while (skb != NULL);
 
-   bh_lock_sock(sk);
-   } while ((skb = sk->sk_backlog.head) != NULL);
+   spin_lock_bh(>sk_lock.slock);
+   }
 
/*
 * Doing the zeroing here guarantee we can not loop forever
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog

2016-04-28 Thread Eric Dumazet

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet 
Cc: Soheil Hassas Yeganeh 
Cc: Alexei Starovoitov 
Cc: Marcelo Ricardo Leitner 
---
 include/net/sock.h | 11 +++
 net/core/sock.c|  7 +++
 net/ipv4/tcp.c |  8 ++--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3df778ccaa82..1dbb1f9f7c1b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -926,6 +926,17 @@ void sk_stream_kill_queues(struct sock *sk);
 void sk_set_memalloc(struct sock *sk);
 void sk_clear_memalloc(struct sock *sk);
 
+void __sk_flush_backlog(struct sock *sk);
+
+static inline bool sk_flush_backlog(struct sock *sk)
+{
+   if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
+   __sk_flush_backlog(sk);
+   return true;
+   }
+   return false;
+}
+
 int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);
 
 struct request_sock_ops;
diff --git a/net/core/sock.c b/net/core/sock.c
index 70744dbb6c3f..f615e9391170 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2048,6 +2048,13 @@ static void __release_sock(struct sock *sk)
sk->sk_backlog.len = 0;
 }
 
+void __sk_flush_backlog(struct sock *sk)
+{
+   spin_lock_bh(>sk_lock.slock);
+   __release_sock(sk);
+   spin_unlock_bh(>sk_lock.slock);
+}
+
 /**
  * sk_wait_data - wait for data to arrive at sk_receive_queue
  * @sk:sock to wait on
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4787f86ae64c..b945c2b046c5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1136,11 +1136,12 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
/* This should be in poll */
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
-   mss_now = tcp_send_mss(sk, _goal, flags);
-
/* Ok commence sending. */
copied = 0;
 
+restart:
+   mss_now = tcp_send_mss(sk, _goal, flags);
+
err = -EPIPE;
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
@@ -1166,6 +1167,9 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
 
+   if (sk_flush_backlog(sk))
+   goto restart;
+
skb = sk_stream_alloc_skb(sk,
  select_size(sk, sg),
  sk->sk_allocation,
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 4/7] udp: prepare for non BH masking at backlog processing

2016-04-28 Thread Eric Dumazet

UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/udp.c | 4 ++--
 net/ipv6/udp.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 093284c5c03b..f67f52ba4809 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1514,9 +1514,9 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
 
/* Note that an ENOMEM error is charged twice */
if (rc == -ENOMEM)
-   __UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
+   UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
is_udplite);
-   __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
kfree_skb(skb);
trace_udp_fail_queue_rcv_skb(rc, sk);
return -1;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1ba5a74ac18f..f911c63f79e6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -570,9 +570,9 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
 
/* Note that an ENOMEM error is charged twice */
if (rc == -ENOMEM)
-   __UDP6_INC_STATS(sock_net(sk),
+   UDP6_INC_STATS(sock_net(sk),
 UDP_MIB_RCVBUFERRORS, is_udplite);
-   __UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
kfree_skb(skb);
return -1;
}
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 3/7] dccp: do not assume DCCP code is non preemptible

2016-04-28 Thread Eric Dumazet

DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.

Signed-off-by: Eric Dumazet 
---
 net/dccp/input.c   | 2 +-
 net/dccp/ipv4.c| 4 ++--
 net/dccp/ipv6.c| 4 ++--
 net/dccp/options.c | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/dccp/input.c b/net/dccp/input.c
index 2437ecc13b82..ba347184bda9 100644
--- a/net/dccp/input.c
+++ b/net/dccp/input.c
@@ -359,7 +359,7 @@ send_sync:
goto discard;
}
 
-   __DCCP_INC_STATS(DCCP_MIB_INERRS);
+   DCCP_INC_STATS(DCCP_MIB_INERRS);
 discard:
__kfree_skb(skb);
return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index a8164272e0f4..5c7e413a3ae4 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -533,8 +533,8 @@ static void dccp_v4_ctl_send_reset(const struct sock *sk, 
struct sk_buff *rxskb)
bh_unlock_sock(ctl_sk);
 
if (net_xmit_eval(err) == 0) {
-   __DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
-   __DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
+   DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
+   DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
}
 out:
 dst_release(dst);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 0f4eb4ea57a5..d176f4e66369 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -277,8 +277,8 @@ static void dccp_v6_ctl_send_reset(const struct sock *sk, 
struct sk_buff *rxskb)
if (!IS_ERR(dst)) {
skb_dst_set(skb, dst);
ip6_xmit(ctl_sk, skb, , NULL, 0);
-   __DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
-   __DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
+   DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
+   DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
return;
}
 
diff --git a/net/dccp/options.c b/net/dccp/options.c
index b82b7ee9a1d2..74d29c56c367 100644
--- a/net/dccp/options.c
+++ b/net/dccp/options.c
@@ -253,7 +253,7 @@ out_nonsensical_length:
return 0;
 
 out_invalid_option:
-   __DCCP_INC_STATS(DCCP_MIB_INVALIDOPT);
+   DCCP_INC_STATS(DCCP_MIB_INVALIDOPT);
rc = DCCP_RESET_CODE_OPTION_ERROR;
 out_featneg_failed:
DCCP_WARN("DCCP(%p): Option %d (len=%d) error=%u\n", sk, opt, len, rc);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 5/7] sctp: prepare for socket backlog behavior change

2016-04-28 Thread Eric Dumazet

sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.

It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.

Signed-off-by: Eric Dumazet 
---
 net/sctp/inqueue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index b335ffcef0b9..9d87bba0ff1d 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -89,10 +89,12 @@ void sctp_inq_push(struct sctp_inq *q, struct sctp_chunk 
*chunk)
 * Eventually, we should clean up inqueue to not rely
 * on the BH related data structures.
 */
+   local_bh_disable();
list_add_tail(>list, >in_chunk_list);
if (chunk->asoc)
chunk->asoc->stats.ipackets++;
q->immediate.func(>immediate);
+   local_bh_enable();
 }
 
 /* Peek at the next chunk on the inqeue. */
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 2/7] tcp: do not block bh during prequeue processing

2016-04-28 Thread Eric Dumazet

AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c   |  4 
 net/ipv4/tcp_input.c | 30 ++
 2 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b24c6ed4a04f..4787f86ae64c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1449,12 +1449,8 @@ static void tcp_prequeue_process(struct sock *sk)
 
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPREQUEUED);
 
-   /* RX process wants to run with disabled BHs, though it is not
-* necessary */
-   local_bh_disable();
while ((skb = __skb_dequeue(>ucopy.prequeue)) != NULL)
sk_backlog_rcv(sk, skb);
-   local_bh_enable();
 
/* Clear memory counter. */
tp->ucopy.memory = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ac85fb42a5a2..6171f92be090 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4611,14 +4611,12 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 
__set_current_state(TASK_RUNNING);
 
-   local_bh_enable();
if (!skb_copy_datagram_msg(skb, 0, tp->ucopy.msg, 
chunk)) {
tp->ucopy.len -= chunk;
tp->copied_seq += chunk;
eaten = (chunk == skb->len);
tcp_rcv_space_adjust(sk);
}
-   local_bh_disable();
}
 
if (eaten <= 0) {
@@ -5134,7 +5132,6 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
int chunk = skb->len - hlen;
int err;
 
-   local_bh_enable();
if (skb_csum_unnecessary(skb))
err = skb_copy_datagram_msg(skb, hlen, tp->ucopy.msg, chunk);
else
@@ -5146,32 +5143,9 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
tcp_rcv_space_adjust(sk);
}
 
-   local_bh_disable();
return err;
 }
 
-static __sum16 __tcp_checksum_complete_user(struct sock *sk,
-   struct sk_buff *skb)
-{
-   __sum16 result;
-
-   if (sock_owned_by_user(sk)) {
-   local_bh_enable();
-   result = __tcp_checksum_complete(skb);
-   local_bh_disable();
-   } else {
-   result = __tcp_checksum_complete(skb);
-   }
-   return result;
-}
-
-static inline bool tcp_checksum_complete_user(struct sock *sk,
-struct sk_buff *skb)
-{
-   return !skb_csum_unnecessary(skb) &&
-  __tcp_checksum_complete_user(sk, skb);
-}
-
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5386,7 +5360,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb,
}
}
if (!eaten) {
-   if (tcp_checksum_complete_user(sk, skb))
+   if (tcp_checksum_complete(skb))
goto csum_error;
 
if ((int)skb->truesize > sk->sk_forward_alloc)
@@ -5430,7 +5404,7 @@ no_ack:
}
 
 slow_path:
-   if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+   if (len < (th->doff << 2) || tcp_checksum_complete(skb))
goto csum_error;
 
if (!th->ack && !th->rst && !th->syn)
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 0/7] net: make TCP preemptible

2016-04-28 Thread Eric Dumazet

Most of TCP stack assumed it was running from BH handler.

This is great for most things, as TCP behavior is very sensitive
to scheduling artifacts.

However, the prequeue and backlog processing are problematic,
as they need to be flushed with BH being blocked.

To cope with modern needs, TCP sockets have big sk_rcvbuf values,
in the order of 16 MB, and soon 32 MB.
This means that backlog can hold thousands of packets, and things
like TCP coalescing or collapsing on this amount of packets can
lead to insane latency spikes, since BH are blocked for too long.

It is time to make UDP/TCP stacks preemptible.

Note that fast path still runs from BH handler.

v2:
Added "tcp: make tcp_sendmsg() aware of socket backlog"
to reduce latency problems of large sends.

Eric Dumazet (7):
  tcp: do not assume TCP code is non preemptible
  tcp: do not block bh during prequeue processing
  dccp: do not assume DCCP code is non preemptible
  udp: prepare for non BH masking at backlog processing
  sctp: prepare for socket backlog behavior change
  net: do not block BH while processing socket backlog
  tcp: make tcp_sendmsg() aware of socket backlog

 include/net/sock.h   |  11 +
 net/core/sock.c  |  29 +--
 net/dccp/input.c |   2 +-
 net/dccp/ipv4.c  |   4 +-
 net/dccp/ipv6.c  |   4 +-
 net/dccp/options.c   |   2 +-
 net/ipv4/tcp.c   |  14 +++---
 net/ipv4/tcp_cdg.c   |  20 
 net/ipv4/tcp_cubic.c |  20 
 net/ipv4/tcp_fastopen.c  |  12 ++---
 net/ipv4/tcp_input.c | 126 +++
 net/ipv4/tcp_ipv4.c  |  14 --
 net/ipv4/tcp_minisocks.c |   2 +-
 net/ipv4/tcp_output.c|  11 ++---
 net/ipv4/tcp_recovery.c  |   4 +-
 net/ipv4/tcp_timer.c |  10 ++--
 net/ipv4/udp.c   |   4 +-
 net/ipv6/tcp_ipv6.c  |  12 ++---
 net/ipv6/udp.c   |   4 +-
 net/sctp/inqueue.c   |   2 +
 20 files changed, 150 insertions(+), 157 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 1/7] tcp: do not assume TCP code is non preemptible

2016-04-28 Thread Eric Dumazet

We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_cdg.c   | 20 +-
 net/ipv4/tcp_cubic.c | 20 +-
 net/ipv4/tcp_fastopen.c  | 12 +++---
 net/ipv4/tcp_input.c | 96 
 net/ipv4/tcp_ipv4.c  | 14 ---
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv4/tcp_output.c| 11 +++---
 net/ipv4/tcp_recovery.c  |  4 +-
 net/ipv4/tcp_timer.c | 10 +++--
 net/ipv6/tcp_ipv6.c  | 12 +++---
 11 files changed, 104 insertions(+), 99 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cb4d1cabb42c..b24c6ed4a04f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3095,7 +3095,7 @@ void tcp_done(struct sock *sk)
struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
 
if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
-   __TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+   TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
 
tcp_set_state(sk, TCP_CLOSE);
tcp_clear_xmit_timers(sk);
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 3c00208c37f4..4e3007845888 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -155,11 +155,11 @@ static void tcp_cdg_hystart_update(struct sock *sk)
 
ca->last_ack = now_us;
if (after(now_us, ca->round_start + base_owd)) {
-   __NET_INC_STATS(sock_net(sk),
-   
LINUX_MIB_TCPHYSTARTTRAINDETECT);
-   __NET_ADD_STATS(sock_net(sk),
-   LINUX_MIB_TCPHYSTARTTRAINCWND,
-   tp->snd_cwnd);
+   NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTTRAINDETECT);
+   NET_ADD_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTTRAINCWND,
+ pp>>sn__cwdd);
tp->snd_ssthresh = tp->snd_cwnd;
return;
}
@@ -174,11 +174,11 @@ static void tcp_cdg_hystart_update(struct sock *sk)
 125U);
 
if (ca->rtt.min > thresh) {
-   __NET_INC_STATS(sock_net(sk),
-   
LINUX_MIB_TCPHYSTARTDELAYDETECT);
-   __NET_ADD_STATS(sock_net(sk),
-   LINUX_MIB_TCPHYSTARTDELAYCWND,
-   tp->snd_cwnd);
+   NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTDELAYDETECT);
+   NET_ADD_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTDELAYCWND,
+ tp->snd_cwnd);
tp->snd_ssthresh = tp->snd_cwnd;
}
}
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 59155af9de5d..0ce946e395e1 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -402,11 +402,11 @@ static void hystart_update(struct sock *sk, u32 delay)
ca->last_ack = now;
if ((s32)(now - ca->round_start) > ca->delay_min >> 4) {
ca->found |= HYSTART_ACK_TRAIN;
-   __NET_INC_STATS(sock_net(sk),
-   
LINUX_MIB_TCPHYSTARTTRAINDETECT);
-   __NET_ADD_STATS(sock_net(sk),
-   LINUX_MIB_TCPHYSTARTTRAINCWND,
-   tp->snd_cwnd);
+   NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTTRAINDETECT);
+   NET_ADD_STATS(sock_net(sk),
+ LINUX_MIB_TCPHYSTARTTRAINCWND,
+ tp->snd_cwnd);
tp->snd_ssthresh = tp->snd_cwnd;
}
}
@@ -423,11 +423,11 @@ static void hystart_update(struct sock *sk, u32 delay)
if (ca->curr_rtt > ca->delay_min +

Re: [PATCH] mdio_bus: Fix MDIO bus scanning in __mdiobus_register()

2016-04-28 Thread Marek Vasut

On 04/29/2016 03:49 AM, Florian Fainelli wrote:
> Le 28/04/2016 18:09, Marek Vasut a écrit :
>> Since commit b74766a0a0feeef5c779709cc5d109451c0d5b17 in linux-next,
>> ( phylib: don't return NULL from get_phy_device() ), phy_get_device()
>> will return ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is
>> all ones.
>>
>> This causes problem with stmmac driver and likely some other drivers
>> which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
>> board with linux-next 20160427 and 20160428. In case of the stmmac, if
>> there is no PHY node specified in the DT for the stmmac block, the stmmac
>> driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
>> stmmac_mdio_register() ) will call mdiobus_register() , which will
>> register the MDIO bus and probe for the PHY.
>>
>> The mdiobus_register() resp. __mdiobus_register() iterates over all of
>> the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
>> which invokes get_phy_device(). Before the aforementioned patch, the
>> mdiobus_scan() would return NULL if no PHY was found on a given address
>> and mdiobus_register() would continue and try the next PHY address. Now,
>> mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
>> 'if (IS_ERR(phydev))' condition and the loop exits immediatelly if the
>> PHY address does not contain PHY.
>>
>> Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
>> error comes around, continue with the next PHY address.
>>
>> Signed-off-by: Marek Vasut <ma...@denx.de>
>> Cc: Arnd Bergmann <a...@arndb.de>
>> Cc: David S. Miller <da...@davemloft.net>
>> Cc: Dinh Nguyen <dingu...@opensource.altera.com>
>> Cc: Florian Fainelli <f.faine...@gmail.com>
>> Cc: Sergei Shtylyov <sergei.shtyl...@cogentembedded.com>
> 
> Acked-by: Florian Fainelli <f.faine...@gmail.com>
> 
> I had an exact same patch posted yesterday but not formally like you
> did, thanks!

Ah, my google-fu must be weak tonight. Thanks!

>> ---
>>  drivers/net/phy/mdio_bus.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> NOTE: I don't quite like this explicit check , but I don't have better idea 
>> now.
>>
>> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
>> index 499003ee..388f992 100644
>> --- a/drivers/net/phy/mdio_bus.c
>> +++ b/drivers/net/phy/mdio_bus.c
>> @@ -333,7 +333,7 @@ int __mdiobus_register(struct mii_bus *bus, struct 
>> module *owner)
>>  struct phy_device *phydev;
>>  
>>  phydev = mdiobus_scan(bus, i);
>> -if (IS_ERR(phydev)) {
>> +if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
>>  err = PTR_ERR(phydev);
>>  goto error;
>>  }
>>
> 
> 


-- 
Best regards,
Marek Vasut

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-28 Thread Alexander Duyck

On Thu, Apr 28, 2016 at 6:18 PM, Matthew Finlay  wrote:
>
>
>
>
>
>>>
>>> The mlx5 hardware requires the outer UDP checksum is not set when 
>>> offloading encapsulated packets.
>>
>>The Intel documentation said the same thing.  That was due to the fact
>>that the hardware didn't computer the outer UDP header checksum.  I
>>suspect the Mellanox hardware has the same issue.  Also I have tested
>>on a ConnectX-4 board with the latest firmware and what I am seeing is
>>that with my patches applied the outer checksum is being correctly
>>applied for segmentation offloads.
>>
>>My thought is that that the hardware appears to ignore the UDP
>>checksum so if it is non-zero you cannot guarantee the checksum would
>>be correct on the last frame if it is a different size than the rest
>>of the segments.  In the case of these patches that issue has been
>>resolved as I have precomputed the UDP checksum for the outer UDP
>>header and all of the segments will be the same length so there should
>>be no variation in the UDP checksum of the outer header.  Unless you
>>can tell my exactly the reason why we cannot provide the outer UDP
>>checksum I would assume that the reason is due to the fact that the
>>hardware doesn't compute it so you cannot handle a fragment on the end
>>which is resolved already via GSO_PARTIAL.
>
> I will check internally and verify there are no unforeseen issues with 
> setting the outer UDP checksum in this scenario.

Thanks.  Any idea how long it should be.  I know I was getting a
auto-reply about people being out until May 1st due to a holiday so I
am just wondering if we should have Dave drop this patch set and I
submit a v2 when you can get me the feedback next week, or if we run
with the patches as-is for now and be prepared to revert if anything
should come up.

- Alex

Re: [PATCH] mdio_bus: Fix MDIO bus scanning in __mdiobus_register()

2016-04-28 Thread Florian Fainelli

Le 28/04/2016 18:09, Marek Vasut a écrit :
> Since commit b74766a0a0feeef5c779709cc5d109451c0d5b17 in linux-next,
> ( phylib: don't return NULL from get_phy_device() ), phy_get_device()
> will return ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is
> all ones.
> 
> This causes problem with stmmac driver and likely some other drivers
> which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
> board with linux-next 20160427 and 20160428. In case of the stmmac, if
> there is no PHY node specified in the DT for the stmmac block, the stmmac
> driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
> stmmac_mdio_register() ) will call mdiobus_register() , which will
> register the MDIO bus and probe for the PHY.
> 
> The mdiobus_register() resp. __mdiobus_register() iterates over all of
> the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
> which invokes get_phy_device(). Before the aforementioned patch, the
> mdiobus_scan() would return NULL if no PHY was found on a given address
> and mdiobus_register() would continue and try the next PHY address. Now,
> mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
> 'if (IS_ERR(phydev))' condition and the loop exits immediatelly if the
> PHY address does not contain PHY.
> 
> Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
> error comes around, continue with the next PHY address.
> 
> Signed-off-by: Marek Vasut <ma...@denx.de>
> Cc: Arnd Bergmann <a...@arndb.de>
> Cc: David S. Miller <da...@davemloft.net>
> Cc: Dinh Nguyen <dingu...@opensource.altera.com>
> Cc: Florian Fainelli <f.faine...@gmail.com>
> Cc: Sergei Shtylyov <sergei.shtyl...@cogentembedded.com>

Acked-by: Florian Fainelli <f.faine...@gmail.com>

I had an exact same patch posted yesterday but not formally like you
did, thanks!

> ---
>  drivers/net/phy/mdio_bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> NOTE: I don't quite like this explicit check , but I don't have better idea 
> now.
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 499003ee..388f992 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -333,7 +333,7 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
> *owner)
>   struct phy_device *phydev;
>  
>   phydev = mdiobus_scan(bus, i);
> - if (IS_ERR(phydev)) {
> + if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
>   err = PTR_ERR(phydev);
>   goto error;
>   }
> 


-- 
Florian

[PATCH] tipc: Only process unicast on intended node

2016-04-28 Thread Hamish Martin

We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.

Signed-off-by: Hamish Martin 
Reviewed-by: Chris Packham 
Reviewed-by: John Thompson 
---
 net/tipc/node.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index ace178fd3850..e5dda495d4b6 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1460,6 +1460,11 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, 
struct tipc_bearer *b)
return tipc_node_bc_rcv(net, skb, bearer_id);
}
 
+   /* Discard unicast link messages destined for another node */
+   if (unlikely(!msg_short(hdr) &&
+(msg_destnode(hdr) != tipc_own_addr(net
+   goto discard;
+
/* Locate neighboring node that sent packet */
n = tipc_node_find(net, msg_prevnode(hdr));
if (unlikely(!n))
-- 
2.8.1

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-28 Thread Matthew Finlay






>>
>> The mlx5 hardware requires the outer UDP checksum is not set when offloading 
>> encapsulated packets.
>
>The Intel documentation said the same thing.  That was due to the fact
>that the hardware didn't computer the outer UDP header checksum.  I
>suspect the Mellanox hardware has the same issue.  Also I have tested
>on a ConnectX-4 board with the latest firmware and what I am seeing is
>that with my patches applied the outer checksum is being correctly
>applied for segmentation offloads.
>
>My thought is that that the hardware appears to ignore the UDP
>checksum so if it is non-zero you cannot guarantee the checksum would
>be correct on the last frame if it is a different size than the rest
>of the segments.  In the case of these patches that issue has been
>resolved as I have precomputed the UDP checksum for the outer UDP
>header and all of the segments will be the same length so there should
>be no variation in the UDP checksum of the outer header.  Unless you
>can tell my exactly the reason why we cannot provide the outer UDP
>checksum I would assume that the reason is due to the fact that the
>hardware doesn't compute it so you cannot handle a fragment on the end
>which is resolved already via GSO_PARTIAL.

I will check internally and verify there are no unforeseen issues with setting 
the outer UDP checksum in this scenario.

>
>- Alex

[PATCH net-next] net: dsa: mv88e6xxx: replace ds with ps where possible

2016-04-28 Thread Vivien Didelot

From: Andrew Lunn 

The dsa_switch structure ds is actually needed in very few places,
mostly during setup of the switch. The private structure ps is however
needed nearly everywhere. Pass ps, not ds internally.

[vd: rebased Andrew's patch.]

Signed-off-by: Andrew Lunn 
Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6123.c |  14 +-
 drivers/net/dsa/mv88e6131.c |  22 +-
 drivers/net/dsa/mv88e6171.c |  14 +-
 drivers/net/dsa/mv88e6352.c |  24 +-
 drivers/net/dsa/mv88e6xxx.c | 917 ++--
 drivers/net/dsa/mv88e6xxx.h |  14 +-
 6 files changed, 511 insertions(+), 494 deletions(-)

diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c
index 534ebc8..5535a42 100644
--- a/drivers/net/dsa/mv88e6123.c
+++ b/drivers/net/dsa/mv88e6123.c
@@ -50,6 +50,7 @@ static const char *mv88e6123_drv_probe(struct device *dsa_dev,
 
 static int mv88e6123_setup_global(struct dsa_switch *ds)
 {
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
u32 upstream_port = dsa_upstream_port(ds);
int ret;
u32 reg;
@@ -62,7 +63,7 @@ static int mv88e6123_setup_global(struct dsa_switch *ds)
 * external PHYs to poll), don't discard packets with
 * excessive collisions, and mask all interrupt sources.
 */
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_CONTROL, 0x);
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_CONTROL, 0x);
if (ret)
return ret;
 
@@ -73,26 +74,29 @@ static int mv88e6123_setup_global(struct dsa_switch *ds)
reg = upstream_port << GLOBAL_MONITOR_CONTROL_INGRESS_SHIFT |
upstream_port << GLOBAL_MONITOR_CONTROL_EGRESS_SHIFT |
upstream_port << GLOBAL_MONITOR_CONTROL_ARP_SHIFT;
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_MONITOR_CONTROL, reg);
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_MONITOR_CONTROL, reg);
if (ret)
return ret;
 
/* Disable remote management for now, and set the switch's
 * DSA device number.
 */
-   return mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_CONTROL_2,
+   return mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_CONTROL_2,
   ds->index & 0x1f);
 }
 
 static int mv88e6123_setup(struct dsa_switch *ds)
 {
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;
 
-   ret = mv88e6xxx_setup_common(ds);
+   ps->ds = ds;
+
+   ret = mv88e6xxx_setup_common(ps);
if (ret < 0)
return ret;
 
-   ret = mv88e6xxx_switch_reset(ds, false);
+   ret = mv88e6xxx_switch_reset(ps, false);
if (ret < 0)
return ret;
 
diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c
index c3eb9a8..357ab79 100644
--- a/drivers/net/dsa/mv88e6131.c
+++ b/drivers/net/dsa/mv88e6131.c
@@ -56,6 +56,7 @@ static const char *mv88e6131_drv_probe(struct device *dsa_dev,
 
 static int mv88e6131_setup_global(struct dsa_switch *ds)
 {
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
u32 upstream_port = dsa_upstream_port(ds);
int ret;
u32 reg;
@@ -69,14 +70,14 @@ static int mv88e6131_setup_global(struct dsa_switch *ds)
 * to arbitrate between packet queues, set the maximum frame
 * size to 1632, and mask all interrupt sources.
 */
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_CONTROL,
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_CONTROL,
  GLOBAL_CONTROL_PPU_ENABLE |
  GLOBAL_CONTROL_MAX_FRAME_1632);
if (ret)
return ret;
 
/* Set the VLAN ethertype to 0x8100. */
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_CORE_TAG_TYPE, 0x8100);
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_CORE_TAG_TYPE, 0x8100);
if (ret)
return ret;
 
@@ -87,7 +88,7 @@ static int mv88e6131_setup_global(struct dsa_switch *ds)
reg = upstream_port << GLOBAL_MONITOR_CONTROL_INGRESS_SHIFT |
upstream_port << GLOBAL_MONITOR_CONTROL_EGRESS_SHIFT |
GLOBAL_MONITOR_CONTROL_ARP_DISABLED;
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_MONITOR_CONTROL, reg);
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_MONITOR_CONTROL, reg);
if (ret)
return ret;
 
@@ -96,11 +97,11 @@ static int mv88e6131_setup_global(struct dsa_switch *ds)
 * DSA device number.
 */
if (ds->dst->pd->nr_chips > 1)
-   ret = mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_CONTROL_2,
+   ret = mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_CONTROL_2,
  GLOBAL_CONTROL_2_MULTIPLE_CASCADE |
  (ds->index & 0x1f));
else
-   ret

Re: [PATCHv2] netem: Segment GSO packets on enqueue.

2016-04-28 Thread Neil Horman

On Thu, Apr 28, 2016 at 01:58:53PM -0700, Eric Dumazet wrote:
> On Thu, 2016-04-28 at 16:09 -0400, Neil Horman wrote:
> > This was recently reported to me, and reproduced on the latest net kernel, 
> > when
> > attempting to run netperf from a host that had a netem qdisc attached to the
> > egress interface:
> 
> >  
> > -   return NET_XMIT_SUCCESS;
> > +finish_segs:
> > +   while (segs) {
> > +   skb2 = segs->next;
> > +   segs->next = NULL;
> > +   qdisc_skb_cb(segs)->pkt_len = segs->len;
> > +   rc = qdisc_enqueue(segs, sch);
> > +   if (rc != NET_XMIT_SUCCESS) {
> > +   if (net_xmit_drop_count(rc))
> > +   qdisc_qstats_drop(sch);
> > +   }
> > +   segs = skb2;
> > +   }
> > +   return rc;
> >  }
> 
> It seems you missed the qdisc_tree_reduce_backlog() call ?
> 
Crap, yes, sorry.  I did a last minute modification to move the segment
requeuing to netem_enqueue and inadvertently remove it, I'll repost shortly.

Best
Neil

> 
> 
>

Re: [RFC PATCH V2 2/2] vhost: device IOTLB API

2016-04-28 Thread Jason Wang



On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>>
>> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
>>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
 This patch tries to implement an device IOTLB for vhost. This could be
 used with for co-operation with userspace(qemu) implementation of DMA
 remapping.

 The idea is simple. When vhost meets an IOTLB miss, it will request
 the assistance of userspace to do the translation, this is done
 through:

 - Fill the translation request in a preset userspace address (This
   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
 - Notify userspace through eventfd (This eventfd was set through ioctl
   VHOST_SET_IOTLB_FD).
>>> Why use an eventfd for this?
>> The aim is to implement the API all through ioctls.
>>
>>>  We use them for interrupts because
>>> that happens to be what kvm wants, but here - why don't we
>>> just add a generic support for reading out events
>>> on the vhost fd itself?
>> I've considered this approach, but what's the advantages of this? I mean
>> looks like all other ioctls could be done through vhost fd
>> reading/writing too.
> read/write have a non-blocking flag.
>
> It's not useful for other ioctls but it's useful here.
>

Ok, this looks better.

 - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl

 When userspace finishes the translation, it will update the vhost
 IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
 snooping the IOTLB invalidation of IOMMU IOTLB and use
 VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>>> There's one problem here, and that is that VQs still do not undergo
>>> translation.  In theory VQ could be mapped in such a way
>>> that it's not contigious in userspace memory.
>> I'm not sure I get the issue, current vhost API support setting
>> desc_user_addr, used_user_addr and avail_user_addr independently. So
>> looks ok? If not, looks not a problem to device IOTLB API itself.
> The problem is that addresses are all HVA.
>
> Without an iommu, we ask for them to be contigious and
> since bus address == GPA, this means contigious GPA =>
> contigious HVA. With an IOMMU you can map contigious
> bus address but non contigious GPA and non contigious HVA.

Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent
a new ioctl to set bus addr (guest iova). The access the VQ through
device IOTLB too.

>
> Another concern: what if guest changes the GPA while keeping bus address
> constant? Normal devices will work because they only use
> bus addresses, but virtio will break.

If we access VQ through device IOTLB too, this could be solved.

>
>
>
>>>
 Signed-off-by: Jason Wang 
>>> What limits amount of entries that kernel keeps around?
>> It depends on guest working set I think. Looking at
>> http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:
>>
>> - For 2MB page size in guest, it suggests hugepages=1024
>> - For 1GB page size, it suggests a hugepages=4
>>
>> So I choose 2048 to make sure it can cover this.
> 4K page size is rather common, too.

I assume hugepages is used widely, and there's a note in the above link:

"For 64-bit applications, it is recommended to use 1 GB hugepages if the
platform supports them."

For 4K case, the TLB hit rate will be very low for a large working set
even in a physical environment. Not sure we should care, if we want, we
probably can cache more translations in userspace's device IOTLB
implementation.

>
>>> Do we want at least a mod parameter for this?
>> Maybe.
>>
 ---
  drivers/vhost/net.c|   6 +-
  drivers/vhost/vhost.c  | 301 
 +++--
  drivers/vhost/vhost.h  |  17 ++-
  fs/eventfd.c   |   3 +-
  include/uapi/linux/vhost.h |  35 ++
  5 files changed, 320 insertions(+), 42 deletions(-)

>> [...]
>>
 +struct vhost_iotlb_entry {
 +  __u64 iova;
 +  __u64 size;
 +  __u64 userspace_addr;
>>> Alignment requirements?
>> The API does not require any alignment. Will add a comment for this.
>>
 +  struct {
 +#define VHOST_ACCESS_RO  0x1
 +#define VHOST_ACCESS_WO  0x2
 +#define VHOST_ACCESS_RW  0x3
 +  __u8  perm;
 +#define VHOST_IOTLB_MISS   1
 +#define VHOST_IOTLB_UPDATE 2
 +#define VHOST_IOTLB_INVALIDATE 3
 +  __u8  type;
 +#define VHOST_IOTLB_INVALID0x1
 +#define VHOST_IOTLB_VALID  0x2
 +  __u8  valid;
>>> why do we need this flag?
>> Useless, will remove.
>>
 +  __u8  u8_padding;
 +  __u32 padding;
 +  } flags;
 +};
 +
 +struct vhost_vring_iotlb_entry {
 +  unsigned int index;
 +  __u64 userspace_addr;
 +};
 +
  struct vhost_memory_region {

[PATCH] mdio_bus: Fix MDIO bus scanning in __mdiobus_register()

2016-04-28 Thread Marek Vasut

Since commit b74766a0a0feeef5c779709cc5d109451c0d5b17 in linux-next,
( phylib: don't return NULL from get_phy_device() ), phy_get_device()
will return ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is
all ones.

This causes problem with stmmac driver and likely some other drivers
which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
board with linux-next 20160427 and 20160428. In case of the stmmac, if
there is no PHY node specified in the DT for the stmmac block, the stmmac
driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
stmmac_mdio_register() ) will call mdiobus_register() , which will
register the MDIO bus and probe for the PHY.

The mdiobus_register() resp. __mdiobus_register() iterates over all of
the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
which invokes get_phy_device(). Before the aforementioned patch, the
mdiobus_scan() would return NULL if no PHY was found on a given address
and mdiobus_register() would continue and try the next PHY address. Now,
mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
'if (IS_ERR(phydev))' condition and the loop exits immediatelly if the
PHY address does not contain PHY.

Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
error comes around, continue with the next PHY address.

Signed-off-by: Marek Vasut <ma...@denx.de>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: David S. Miller <da...@davemloft.net>
Cc: Dinh Nguyen <dingu...@opensource.altera.com>
Cc: Florian Fainelli <f.faine...@gmail.com>
Cc: Sergei Shtylyov <sergei.shtyl...@cogentembedded.com>
---
 drivers/net/phy/mdio_bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

NOTE: I don't quite like this explicit check , but I don't have better idea now.

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 499003ee..388f992 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -333,7 +333,7 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
*owner)
struct phy_device *phydev;
 
phydev = mdiobus_scan(bus, i);
-   if (IS_ERR(phydev)) {
+   if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
err = PTR_ERR(phydev);
goto error;
}
-- 
2.7.0

Re: [PATCH net-next 1/6] tcp: do not assume TCP code is non preemptible

2016-04-28 Thread Eric Dumazet

On Wed, 2016-04-27 at 22:25 -0700, Eric Dumazet wrote:
> We want to to make TCP stack preemptible, as draining prequeue
> and backlog queues can take lot of time.
> 
> Many SNMP updates were assuming that BH (and preemption) was disabled.
> 
> Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
> and some __TCP_INC_STATS() to TCP_INC_STATS()
> 
> Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
> and tcp_v4_send_ack(), we add an explicit preempt disabled section.
> 
> Signed-off-by: Eric Dumazet 
> ---

I'll send a V2 including following changes I missed :

I'll also include the sendmsg() latency breaker.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0509a685d90c..25d527922b18 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2698,7 +2698,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
tp->retrans_stamp = tcp_skb_timestamp(skb);
 
} else if (err != -EBUSY) {
-   __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
 
if (tp->undo_retrans < 0)
@@ -2822,7 +2822,7 @@ begin_fwd:
if (tcp_retransmit_skb(sk, skb, segs))
return;
 
-   __NET_INC_STATS(sock_net(sk), mib_idx);
+   NET_INC_STATS(sock_net(sk), mib_idx);
 
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += tcp_skb_pcount(skb);

[PATCH net-next 1/3] qed: add infrastructure for device self tests.

2016-04-28 Thread Sudarsana Reddy Kalluru

This patch adds the functionality and APIs needed for selftests.
It adds the ability to configure the link-mode which is required for the
implementation of loopback tests. It adds the APIs for clock test,
register test, interrupt test and memory test.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/Makefile  |  3 +-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h | 13 
 drivers/net/ethernet/qlogic/qed/qed_main.c| 28 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 42 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.h | 22 +++
 drivers/net/ethernet/qlogic/qed/qed_selftest.c| 76 +++
 drivers/net/ethernet/qlogic/qed/qed_selftest.h| 40 
 drivers/net/ethernet/qlogic/qed/qed_sp.h  | 10 +++
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c | 21 +++
 include/linux/qed/qed_if.h| 47 ++
 10 files changed, 301 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_selftest.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_selftest.h

diff --git a/drivers/net/ethernet/qlogic/qed/Makefile 
b/drivers/net/ethernet/qlogic/qed/Makefile
index 5c2fd57..aafa669 100644
--- a/drivers/net/ethernet/qlogic/qed/Makefile
+++ b/drivers/net/ethernet/qlogic/qed/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_QED) := qed.o
 
 qed-y := qed_cxt.o qed_dev.o qed_hw.o qed_init_fw_funcs.o qed_init_ops.o \
-qed_int.o qed_main.o qed_mcp.o qed_sp_commands.o qed_spq.o qed_l2.o
+qed_int.o qed_main.o qed_mcp.o qed_sp_commands.o qed_spq.o qed_l2.o \
+qed_selftest.o
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 5aa78a9..c4fae71 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -3857,6 +3857,7 @@ struct public_drv_mb {
 #define DRV_MSG_CODE_PHY_CORE_WRITE 0x000e
 #define DRV_MSG_CODE_SET_VERSION0x000f
 
+#define DRV_MSG_CODE_BIST_TEST  0x001e
 #define DRV_MSG_CODE_SET_LED_MODE   0x0020
 
 #define DRV_MSG_SEQ_NUMBER_MASK 0x
@@ -3914,6 +3915,18 @@ struct public_drv_mb {
 #define DRV_MB_PARAM_SET_LED_MODE_ON0x1
 #define DRV_MB_PARAM_SET_LED_MODE_OFF   0x2
 
+#define DRV_MB_PARAM_BIST_UNKNOWN_TEST  0
+#define DRV_MB_PARAM_BIST_REGISTER_TEST 1
+#define DRV_MB_PARAM_BIST_CLOCK_TEST2
+
+#define DRV_MB_PARAM_BIST_RC_UNKNOWN0
+#define DRV_MB_PARAM_BIST_RC_PASSED 1
+#define DRV_MB_PARAM_BIST_RC_FAILED 2
+#define DRV_MB_PARAM_BIST_RC_INVALID_PARAMETER  3
+
+#define DRV_MB_PARAM_BIST_TEST_INDEX_SHIFT  0
+#define DRV_MB_PARAM_BIST_TEST_INDEX_MASK   0x00FF
+
u32 fw_mb_header;
 #define FW_MSG_CODE_MASK0x
 #define FW_MSG_CODE_DRV_LOAD_ENGINE 0x1010
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 1918b83..1b758bd 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -28,6 +28,7 @@
 #include "qed_dev_api.h"
 #include "qed_mcp.h"
 #include "qed_hw.h"
+#include "qed_selftest.h"
 
 static char version[] =
"QLogic FastLinQ 4 Core Module qed " DRV_MODULE_VERSION "\n";
@@ -976,6 +977,25 @@ static int qed_set_link(struct qed_dev *cdev,
else
link_params->pause.forced_tx = false;
}
+   if (params->override_flags & QED_LINK_OVERRIDE_LOOPBACK_MODE) {
+   switch (params->loopback_mode) {
+   case QED_LINK_LOOPBACK_INT_PHY:
+   link_params->loopback_mode = PMM_LOOPBACK_INT_PHY;
+   break;
+   case QED_LINK_LOOPBACK_EXT_PHY:
+   link_params->loopback_mode = PMM_LOOPBACK_EXT_PHY;
+   break;
+   case QED_LINK_LOOPBACK_EXT:
+   link_params->loopback_mode = PMM_LOOPBACK_EXT;
+   break;
+   case QED_LINK_LOOPBACK_MAC:
+   link_params->loopback_mode = PMM_LOOPBACK_MAC;
+   break;
+   default:
+   link_params->loopback_mode = PMM_LOOPBACK_NONE;
+   break;
+   }
+   }
 
rc = qed_mcp_set_link(hwfn, ptt, params->link_up);
 
@@ -1182,7 +1202,15 @@ static int qed_set_led(struct qed_dev *cdev, enum 
qed_led_mode mode)
return status;
 }
 
+struct qed_selftest_ops qed_selftest_ops_pass = {
+   .selftest_memory = _selftest_memory,
+   .selftest_interrupt = _selftest_interrupt,
+   .selftest_register = _selftest_register,
+   .selftest_clock =

[PATCH net-next 2/3] qede: add support for selftests.

2016-04-28 Thread Sudarsana Reddy Kalluru

This patch adds the qede ethtool support for the following tests:
- interrupt test
- memory test
- register test
- clock test

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 56 -
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index f1dd25a..e25a05b 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -125,6 +125,21 @@ static const char 
qede_private_arr[QEDE_PRI_FLAG_LEN][ETH_GSTRING_LEN] = {
"Coupled-Function",
 };
 
+enum qede_ethtool_tests {
+   QEDE_ETHTOOL_INTERRUPT_TEST,
+   QEDE_ETHTOOL_MEMORY_TEST,
+   QEDE_ETHTOOL_REGISTER_TEST,
+   QEDE_ETHTOOL_CLOCK_TEST,
+   QEDE_ETHTOOL_TEST_MAX
+};
+
+static const char qede_tests_str_arr[QEDE_ETHTOOL_TEST_MAX][ETH_GSTRING_LEN] = 
{
+   "Interrupt (online)\t",
+   "Memory (online)\t\t",
+   "Register (online)\t",
+   "Clock (online)\t\t",
+};
+
 static void qede_get_strings_stats(struct qede_dev *edev, u8 *buf)
 {
int i, j, k;
@@ -152,6 +167,10 @@ static void qede_get_strings(struct net_device *dev, u32 
stringset, u8 *buf)
memcpy(buf, qede_private_arr,
   ETH_GSTRING_LEN * QEDE_PRI_FLAG_LEN);
break;
+   case ETH_SS_TEST:
+   memcpy(buf, qede_tests_str_arr,
+  ETH_GSTRING_LEN * QEDE_ETHTOOL_TEST_MAX);
+   break;
default:
DP_VERBOSE(edev, QED_MSG_DEBUG,
   "Unsupported stringset 0x%08x\n", stringset);
@@ -192,7 +211,8 @@ static int qede_get_sset_count(struct net_device *dev, int 
stringset)
return num_stats + QEDE_NUM_RQSTATS;
case ETH_SS_PRIV_FLAGS:
return QEDE_PRI_FLAG_LEN;
-
+   case ETH_SS_TEST:
+   return QEDE_ETHTOOL_TEST_MAX;
default:
DP_VERBOSE(edev, QED_MSG_DEBUG,
   "Unsupported stringset 0x%08x\n", stringset);
@@ -827,6 +847,39 @@ static int qede_set_rxfh(struct net_device *dev, const u32 
*indir,
return 0;
 }
 
+static void qede_self_test(struct net_device *dev,
+  struct ethtool_test *etest, u64 *buf)
+{
+   struct qede_dev *edev = netdev_priv(dev);
+
+   DP_VERBOSE(edev, QED_MSG_DEBUG,
+  "Self-test command parameters: offline = %d, external_lb = 
%d\n",
+  (etest->flags & ETH_TEST_FL_OFFLINE),
+  (etest->flags & ETH_TEST_FL_EXTERNAL_LB) >> 2);
+
+   memset(buf, 0, sizeof(u64) * QEDE_ETHTOOL_TEST_MAX);
+
+   if (edev->ops->common->selftest->selftest_interrupt(edev->cdev)) {
+   buf[QEDE_ETHTOOL_INTERRUPT_TEST] = 1;
+   etest->flags |= ETH_TEST_FL_FAILED;
+   }
+
+   if (edev->ops->common->selftest->selftest_memory(edev->cdev)) {
+   buf[QEDE_ETHTOOL_MEMORY_TEST] = 1;
+   etest->flags |= ETH_TEST_FL_FAILED;
+   }
+
+   if (edev->ops->common->selftest->selftest_register(edev->cdev)) {
+   buf[QEDE_ETHTOOL_REGISTER_TEST] = 1;
+   etest->flags |= ETH_TEST_FL_FAILED;
+   }
+
+   if (edev->ops->common->selftest->selftest_clock(edev->cdev)) {
+   buf[QEDE_ETHTOOL_CLOCK_TEST] = 1;
+   etest->flags |= ETH_TEST_FL_FAILED;
+   }
+}
+
 static const struct ethtool_ops qede_ethtool_ops = {
.get_settings = qede_get_settings,
.set_settings = qede_set_settings,
@@ -852,6 +905,7 @@ static const struct ethtool_ops qede_ethtool_ops = {
.set_rxfh = qede_set_rxfh,
.get_channels = qede_get_channels,
.set_channels = qede_set_channels,
+   .self_test = qede_self_test,
 };
 
 void qede_set_ethtool_ops(struct net_device *dev)
-- 
1.8.3.1

[PATCH net-next 3/3] qede: add implementation for internal loopback test.

2016-04-28 Thread Sudarsana Reddy Kalluru

This patch adds the qede implementation for internal loopback test.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
Signed-off-by: Manish Chopra 
---
 drivers/net/ethernet/qlogic/qede/qede.h |   4 +
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 234 
 drivers/net/ethernet/qlogic/qede/qede_main.c|   8 +-
 3 files changed, 242 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index a687e7a..ff3ac0c 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -308,6 +308,10 @@ void qede_reload(struct qede_dev *edev,
 union qede_reload_args *args);
 int qede_change_mtu(struct net_device *dev, int new_mtu);
 void qede_fill_by_demand_stats(struct qede_dev *edev);
+bool qede_has_rx_work(struct qede_rx_queue *rxq);
+int qede_txq_has_work(struct qede_tx_queue *txq);
+void qede_recycle_rx_bd_ring(struct qede_rx_queue *rxq, struct qede_dev *edev,
+u8 count);
 
 #define RX_RING_SIZE_POW   13
 #define RX_RING_SIZE   ((u16)BIT(RX_RING_SIZE_POW))
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index e25a05b..0d04f16 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,6 +28,9 @@
 #define QEDE_RQSTAT_STRING(stat_name) (#stat_name)
 #define QEDE_RQSTAT(stat_name) \
 {QEDE_RQSTAT_OFFSET(stat_name), QEDE_RQSTAT_STRING(stat_name)}
+
+#define QEDE_SELFTEST_POLL_COUNT 100
+
 static const struct {
u64 offset;
char string[ETH_GSTRING_LEN];
@@ -126,6 +130,7 @@ static const char 
qede_private_arr[QEDE_PRI_FLAG_LEN][ETH_GSTRING_LEN] = {
 };
 
 enum qede_ethtool_tests {
+   QEDE_ETHTOOL_INT_LOOPBACK,
QEDE_ETHTOOL_INTERRUPT_TEST,
QEDE_ETHTOOL_MEMORY_TEST,
QEDE_ETHTOOL_REGISTER_TEST,
@@ -134,6 +139,7 @@ enum qede_ethtool_tests {
 };
 
 static const char qede_tests_str_arr[QEDE_ETHTOOL_TEST_MAX][ETH_GSTRING_LEN] = 
{
+   "Internal loopback (offline)",
"Interrupt (online)\t",
"Memory (online)\t\t",
"Register (online)\t",
@@ -847,6 +853,226 @@ static int qede_set_rxfh(struct net_device *dev, const 
u32 *indir,
return 0;
 }
 
+/* This function enables the interrupt generation and the NAPI on the device */
+static void qede_netif_start(struct qede_dev *edev)
+{
+   int i;
+
+   if (!netif_running(edev->ndev))
+   return;
+
+   for_each_rss(i) {
+   /* Update and reenable interrupts */
+   qed_sb_ack(edev->fp_array[i].sb_info, IGU_INT_ENABLE, 1);
+   napi_enable(>fp_array[i].napi);
+   }
+}
+
+/* This function disables the NAPI and the interrupt generation on the device 
*/
+static void qede_netif_stop(struct qede_dev *edev)
+{
+   int i;
+
+   for_each_rss(i) {
+   napi_disable(>fp_array[i].napi);
+   /* Disable interrupts */
+   qed_sb_ack(edev->fp_array[i].sb_info, IGU_INT_DISABLE, 0);
+   }
+}
+
+static int qede_selftest_transmit_traffic(struct qede_dev *edev,
+ struct sk_buff *skb)
+{
+   struct qede_tx_queue *txq = >fp_array[0].txqs[0];
+   struct eth_tx_1st_bd *first_bd;
+   dma_addr_t mapping;
+   int i, idx, val;
+
+   /* Fill the entry in the SW ring and the BDs in the FW ring */
+   idx = txq->sw_tx_prod & NUM_TX_BDS_MAX;
+   txq->sw_tx_ring[idx].skb = skb;
+   first_bd = qed_chain_produce(>tx_pbl);
+   memset(first_bd, 0, sizeof(*first_bd));
+   val = 1 << ETH_TX_1ST_BD_FLAGS_START_BD_SHIFT;
+   first_bd->data.bd_flags.bitfields = val;
+
+   /* Map skb linear data for DMA and set in the first BD */
+   mapping = dma_map_single(>pdev->dev, skb->data,
+skb_headlen(skb), DMA_TO_DEVICE);
+   if (unlikely(dma_mapping_error(>pdev->dev, mapping))) {
+   DP_NOTICE(edev, "SKB mapping failed\n");
+   return -ENOMEM;
+   }
+   BD_SET_UNMAP_ADDR_LEN(first_bd, mapping, skb_headlen(skb));
+
+   /* update the first BD with the actual num BDs */
+   first_bd->data.nbds = 1;
+   txq->sw_tx_prod++;
+   /* 'next page' entries are counted in the producer value */
+   val = cpu_to_le16(qed_chain_get_prod_idx(>tx_pbl));
+   txq->tx_db.data.bd_prod = val;
+
+   /* wmb makes sure that the BDs data is updated before updating the
+* producer, otherwise FW may read old data from the BDs.
+*/
+   wmb();
+   barrier();
+   writel(txq->tx_db.raw, txq->doorbell_addr);
+
+   /* mmiowb is needed to synchronize doorbell

[PATCH net-next 0/3] qed/qede: ethtool selftests support.

2016-04-28 Thread Sudarsana Reddy Kalluru

Hi David,

This series adds the driver support for following selftests:
1. Register test
2. Memory test
3. Clock test
4. Interrupt test
5. Internal loopback test
Patch (1) adds the qed driver infrastructure for selftests. Patches (2) and
(3) add qede driver support for ethtool selftests.

Please consider applying this series to "net-next".

Thanks,
Sudarsana

Sudarsana Reddy Kalluru (3):
  qed: add infrastructure for device self tests.
  qede: add support for selftests.
  qede: add implementation for internal loopback test.

 drivers/net/ethernet/qlogic/qed/Makefile  |   3 +-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  13 +
 drivers/net/ethernet/qlogic/qed/qed_main.c|  28 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c |  42 
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  22 ++
 drivers/net/ethernet/qlogic/qed/qed_selftest.c|  76 ++
 drivers/net/ethernet/qlogic/qed/qed_selftest.h|  40 +++
 drivers/net/ethernet/qlogic/qed/qed_sp.h  |  10 +
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  21 ++
 drivers/net/ethernet/qlogic/qede/qede.h   |   4 +
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c   | 290 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c  |   8 +-
 include/linux/qed/qed_if.h|  47 
 13 files changed, 598 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_selftest.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_selftest.h

-- 
1.8.3.1

Re: [PATCH v2] net: macb: do not scan PHYs manually

2016-04-28 Thread Josh Cartwright

On Thu, Apr 28, 2016 at 11:23:15PM +0200, Andrew Lunn wrote:
> On Thu, Apr 28, 2016 at 04:03:57PM -0500, Josh Cartwright wrote:
> > On Thu, Apr 28, 2016 at 08:59:32PM +0200, Andrew Lunn wrote:
> > > On Thu, Apr 28, 2016 at 01:55:27PM -0500, Nathan Sullivan wrote:
> > > > On Thu, Apr 28, 2016 at 08:43:03PM +0200, Andrew Lunn wrote:
> > > > > > I agree that is a valid fix for AT91, however it won't solve our 
> > > > > > problem, since
> > > > > > we have no children on the second ethernet MAC in our devices' 
> > > > > > device trees. I'm
> > > > > > starting to feel like our second MAC shouldn't even really register 
> > > > > > the MDIO bus
> > > > > > since it isn't being used - maybe adding a DT property to not have 
> > > > > > a bus is a
> > > > > > better option?
> > > > > 
> > > > > status = "disabled"
> > > > > 
> > > > > would be the unusual way.
> > > > > 
> > > > >   Andrew
> > > > 
> > > > Oh, sorry, I meant we use both MACs on Zynq, however the PHYs are on 
> > > > the MDIO
> > > > bus of the first MAC.  So, the second MAC is used for ethernet but not 
> > > > for MDIO,
> > > > and so it does not have any PHYs under its DT node.  It would be nice 
> > > > if there
> > > > were a way to tell macb not to bother with MDIO for the second MAC, 
> > > > since that's
> > > > handled by the first MAC.
> > > 
> > > Yes, exactly, add support for status = "disabled" in the mdio node.
> > 
> > Unfortunately, the 'macb' doesn't have a "mdio node", or alternatively:
> > the node representing the mdio bus is the same node which represents the
> > macb instance itself.  Setting 'status = "disabled"' on this node will
> > just prevent the probing of the macb instance.
> 
> :-(
> 
> It is very common to have an mdio node within the MAC node, for example 
> imx6sx-sdb.dtsi

Okay, I think that makes sense.  I think, then, perhaps the solution to
our problem is to:

  1. Modify the macb driver to support an 'mdio' node. (And adjust the
 binding document accordingly).  If the node is found, it's used for
 of_mdiobus_register() w/o any of the manual scan madness.
  2. For backwards compatibility, in the case where an 'mdio' node does
 not exist, leave the existing behavior the way it is now
 (of_mdiobus_register() followed by manual scan) [perhaps warn of
 deprecation as well?]
  3. Update binding docs to reflect the above.

In this way, for our usecase, the 'status = "disabled"' in the newly
created 'mdio' node isn't necessary.  It's sufficient for the node to
exist and be empty.

>  {
> pinctrl-names = "default";
> pinctrl-0 = <_enet1>;
> phy-supply = <_enet_3v3>;
> phy-mode = "rgmii";
> phy-handle = <>;
> status = "okay";
> 
> mdio {
> #address-cells = <1>;
> #size-cells = <0>;
> 
> ethphy1: ethernet-phy@1 {
> reg = <1>;
> };
> 
> ethphy2: ethernet-phy@2 {
> reg = <2>;
> };
> };
> };
> 
>  {
> pinctrl-names = "default";
> pinctrl-0 = <_enet2>;
> phy-mode = "rgmii";
> phy-handle = <>;
> status = "okay";
> };
> 
> This even has the two phys on one bus, as you described...

Yep...looks nearly exactly the same case.

Thanks,
  Josh

[PATCH net-next 1/1] tipc: set 'active' state correctly for first established link

2016-04-28 Thread Jon Maloy

When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.

This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.

Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 68d9f7b..c299156 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -554,6 +554,7 @@ static void __tipc_node_link_up(struct tipc_node *n, int 
bearer_id,
*slot1 = bearer_id;
tipc_node_fsm_evt(n, SELF_ESTABL_CONTACT_EVT);
n->action_flags |= TIPC_NOTIFY_NODE_UP;
+   tipc_link_set_active(nl, true);
tipc_bcast_add_peer(n->net, nl, xmitq);
return;
}
-- 
1.9.1

[PATCH net-next] ila: ipv6/ila: fix nlsize calculation for lwtunnel

2016-04-28 Thread Tom Herbert

The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR
and ILA_ATTR_CSUM_MODE.

Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE.

Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
Reported-by: Nicolas Dichtel 
Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_lwt.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index 4985e1a..7788090 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -133,7 +133,7 @@ static int ila_fill_encap_info(struct sk_buff *skb,
if (nla_put_u64_64bit(skb, ILA_ATTR_LOCATOR, (__force 
u64)p->locator.v64,
  ILA_ATTR_PAD))
goto nla_put_failure;
-   if (nla_put_u64(skb, ILA_ATTR_CSUM_MODE, (__force u8)p->csum_mode))
+   if (nla_put_u8(skb, ILA_ATTR_CSUM_MODE, (__force u8)p->csum_mode))
goto nla_put_failure;
 
return 0;
@@ -144,8 +144,12 @@ nla_put_failure:
 
 static int ila_encap_nlsize(struct lwtunnel_state *lwtstate)
 {
-   /* No encapsulation overhead */
-   return 0;
+   return
+   /* ILA_ATTR_LOCATOR */
+   nla_total_size(sizeof(u64)) +
+   /* ILA_ATTR_CSUM_MODE */
+   nla_total_size(sizeof(u8)) +
+   0;
 }
 
 static int ila_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
-- 
2.8.0.rc2

Re: pull request [net]: batman-adv-0160426

2016-04-28 Thread Antonio Quartulli

On Thu, Apr 28, 2016 at 04:43:51PM -0400, David Miller wrote:
> > Patch 2 and 3 have no "Fixes:" tag because the offending commits date
> > back to when batman-adv was not yet officially in the net tree.
> 
> This is not correct.  Instead, in the future, you should provide a
> Fixes: tag that indicates the commit that merged batman-adv into the
> upstream tree initially.

makes sense. Thanks for the suggestion!

-- 
Antonio Quartulli


signature.asc
Description: Digital signature

Re: [PATCH v3 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 19:24 -0400, Craig Gallek wrote:
> From: Craig Gallek 
> 
> I forgot to include a check for listener port equality when deciding
> if two sockets should belong to the same reuseport group.  This was
> not caught previously because it's only necessary when two listening
> sockets for the same user happen to hash to the same listener bucket.
> The same error does not exist in the UDP path.
> 
> Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
> Signed-off-by: Craig Gallek 
> ---

Thanks Craig

Acked-by: Eric Dumazet

Re: [PATCH net-next] ravb: Remove rx buffer ALIGN

2016-04-28 Thread Simon Horman

Hi Sergei, Hi Kaneko-san,

On Tue, Apr 26, 2016 at 10:14:41PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 04/24/2016 07:16 PM, Yoshihiro Kaneko wrote:
> 
> >From: Kazuya Mizuguchi 
> >
> >Aligning the reception data size is not required.
> 
>OK, the gen 2/3 manuals indeed don't require this. I assume the patch has
> been tested...

This morning I tested this patch applied on net-next using the
r8a7795/salvator-x  (Gen-3). My test was to boot to a user-space prompt
using NFS root which was successful. I can run further tests on this setup
if it would be useful.

Unfortunately I do not have access to hardware to allow me to test this
on Gen-2.

> >Signed-off-by: Kazuya Mizuguchi 
> >Signed-off-by: Yoshihiro Kaneko 

Tested-by: Simon Horman 

>I have a few comments though...

[...]

[PATCH net-next v2] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Florian Fainelli

Add a check whether the 'struct device_node' pointer passed to
of_mdiobus_register() is an available (aka enabled) node in the Device
Tree.

Rationale for doing this are cases where an Ethernet MAC provides a MDIO
bus controller and node, and an additional Ethernet MAC might be
connecting its PHY/switches to that first MDIO bus controller, while
still embedding one internally which is therefore marked as "disabled".

Instead of sprinkling checks like these in callers of
of_mdiobus_register(), do this in a central location.

Reviewed-by: Andrew Lunn 
Signed-off-by: Florian Fainelli 
---
Changes in v2:
- utilize -ENODEV instead of -EINVAL
- added Andrew's reviewed-by tag

 drivers/of/of_mdio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index b622b33dbf93..e051e1b57609 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -209,6 +209,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
bool scanphys = false;
int addr, rc;
 
+   /* Do not continue if the node is disabled */
+   if (!of_device_is_available(np))
+   return -ENODEV;
+
/* Mask out all PHYs from auto probing.  Instead the PHYs listed in
 * the device tree are populated after the bus has been registered */
mdio->phy_mask = ~0;
-- 
2.1.0

[PATCH v3 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek

From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
v3 Changes
  - Eric pointed out that the net namespace check isn't necessary when
comparing bucket pointers.  They can not be equal across namespaces.
v2 Changes
  - Suggestions from Eric Dumazet to include network namespace equality
check and to avoid a dreference by simply checking inet_bind_bucket
pointer equality.
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..0d9e9d7bb029 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,6 +470,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
@@ -479,6 +480,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash == tb &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH net-next] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Andrew Lunn

> Fair enough, I will submit something after re-spining this patch to use
> -ENODEV, which I agree is a better return code. Did you want me to
> remove that blurb from the commit message?

Blurb looks good. More blurb is better than less...

  Andrew

Re: [PATCH] net: davinci_mdio: Set of_node in the mdio bus

2016-04-28 Thread Andrew Lunn

> Is there another way to be able to make the of_mdio_find_bus() call be able to
> find the davinci mdio bus?

I missed the first post, and i cannot find it in the archive. Can you
explain what your problem is please.

So long as you call of_mdiobus_register() passing the correct device
node, it should all work.

  Andrew

Re: [PATCH net] RDMA/nes: don't leak skb if carrier down

2016-04-28 Thread Doug Ledford

On 4/28/2016 2:20 PM, David Miller wrote:
> From: Florian Westphal 
> Date: Sun, 24 Apr 2016 22:18:59 +0200
> 
>> Alternatively one could free the skb, OTOH I don't think this test is
>> useful so just remove it.
>>
>> Cc: 
>> Signed-off-by: Florian Westphal 
>> ---
>>  Noticed this while working on the TX_LOCKED removal.
> 
> Assuming Doug will take this.

Thanks for mentioning this ;-)

They had sent to netdev and cc:ed linux-rdma, so I took that to mean
they intended for you to take it.  I'll pick it up for my upcoming pull
request.




signature.asc
Description: OpenPGP digital signature

Re: [PATCH net-next] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Florian Fainelli

On 28/04/16 15:12, Andrew Lunn wrote:
> On Thu, Apr 28, 2016 at 02:55:10PM -0700, Florian Fainelli wrote:
>> Add a check whether the 'struct device_node' pointer passed to
>> of_mdiobus_register() is an available (aka enabled) node in the Device
>> Tree.
>>
>> Rationale for doing this are cases where an Ethernet MAC provides a MDIO
>> bus controller and node, and an additional Ethernet MAC might be
>> connecting its PHY/switches to that first MDIO bus controller, while
>> still embedding one internally which is therefore marked as "disabled".
>>
>> Instead of sprinkling checks like these in callers of
>> of_mdiobus_register(), do this in a central location.
> 
> I think this discussion has shown there is no documented best
> practices for MDIO bus drivers and how PHYs nodes are placed within
> device tree. Maybe you could document the generic MDIO binding, both
> as integrated into a MAC device node, and as a separate device?

Fair enough, I will submit something after re-spining this patch to use
-ENODEV, which I agree is a better return code. Did you want me to
remove that blurb from the commit message?
-- 
Florian

[PATCH net-next 03/12] net/mlx5: Introduce modify flow rule destination

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

This API is used for modifying the flow rule destination.
This is needed for modifying the pointed flow table by the
traffic type classifier rules to point on the aRFS tables.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |4 ++--
 include/linux/mlx5/fs.h   |3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 89cce97..bb2c1cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -615,8 +615,8 @@ static int update_root_ft_create(struct mlx5_flow_table 
*ft, struct fs_prio
return err;
 }
 
-static int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
-   struct mlx5_flow_destination *dest)
+int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
+struct mlx5_flow_destination *dest)
 {
struct mlx5_flow_table *ft;
struct mlx5_flow_group *fg;
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 8dec550..28a5b66 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -113,4 +113,7 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
   struct mlx5_flow_destination *dest);
 void mlx5_del_flow_rule(struct mlx5_flow_rule *fr);
 
+int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
+struct mlx5_flow_destination *dest);
+
 #endif
-- 
1.7.1

[PATCH net-next 02/12] net/mlx5e: Direct TIR per RQ

2016-04-28 Thread Saeed Mahameed

From: Tariq Toukan 

Introduce new TIRs for direct access per RQ.
Now we have 2 available kinds of TIRs:
- indirect TIR per traffic type, each points to one RQT (RSS RQT)
  same as before.
- New direct TIR per RQ, each points to RQT with a size of one
  that forwards packets to that RQ only.

Driver will open max channels (num cores) direct TIRs by default,
they will be filled with the actual RQs once channels are allocated.

Needed for downstream aRFS and ethtool direct steering functionalities.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   21 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c|4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  295 
 4 files changed, 191 insertions(+), 138 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index bbc01a4..5c8e98c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -385,14 +385,7 @@ enum mlx5e_traffic_types {
MLX5E_TT_IPV6,
MLX5E_TT_ANY,
MLX5E_NUM_TT,
-};
-
-#define IS_HASHING_TT(tt) (tt != MLX5E_TT_ANY)
-
-enum mlx5e_rqt_ix {
-   MLX5E_INDIRECTION_RQT,
-   MLX5E_SINGLE_RQ_RQT,
-   MLX5E_NUM_RQT,
+   MLX5E_NUM_INDIR_TIRS = MLX5E_TT_ANY,
 };
 
 struct mlx5e_eth_addr_info {
@@ -453,6 +446,11 @@ struct mlx5e_flow_tables {
struct mlx5e_flow_table main;
 };
 
+struct mlx5e_direct_tir {
+   u32  tirn;
+   u32  rqtn;
+};
+
 struct mlx5e_priv {
/* priv data path fields - start */
struct mlx5e_sq**txq_to_sq_map;
@@ -470,8 +468,9 @@ struct mlx5e_priv {
 
struct mlx5e_channel **channel;
u32tisn[MLX5E_MAX_NUM_TC];
-   u32rqtn[MLX5E_NUM_RQT];
-   u32tirn[MLX5E_NUM_TT];
+   u32indir_rqtn;
+   u32indir_tirn[MLX5E_NUM_INDIR_TIRS];
+   struct mlx5e_direct_tirdirect_tir[MLX5E_MAX_NUM_CHANNELS];
 
struct mlx5e_flow_tables   fts;
struct mlx5e_eth_addr_db   eth_addr;
@@ -578,7 +577,7 @@ void mlx5e_disable_vlan_filter(struct mlx5e_priv *priv);
 
 int mlx5e_modify_rqs_vsd(struct mlx5e_priv *priv, bool vsd);
 
-int mlx5e_redirect_rqt(struct mlx5e_priv *priv, enum mlx5e_rqt_ix rqt_ix);
+int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz, int ix);
 void mlx5e_build_tir_ctx_hash(void *tirc, struct mlx5e_priv *priv);
 
 int mlx5e_open_locked(struct net_device *netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index a06958a..498d407 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -826,9 +826,8 @@ static void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, 
void *in, int inlen)
MLX5_SET(modify_tir_in, in, bitmask.hash, 1);
mlx5e_build_tir_ctx_hash(tirc, priv);
 
-   for (i = 0; i < MLX5E_NUM_TT; i++)
-   if (IS_HASHING_TT(i))
-   mlx5_core_modify_tir(mdev, priv->tirn[i], in, inlen);
+   for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
+   mlx5_core_modify_tir(mdev, priv->indir_tirn[i], in, inlen);
 }
 
 static int mlx5e_set_rxfh(struct net_device *dev, const u32 *indir,
@@ -850,9 +849,11 @@ static int mlx5e_set_rxfh(struct net_device *dev, const 
u32 *indir,
mutex_lock(>state_lock);
 
if (indir) {
+   u32 rqtn = priv->indir_rqtn;
+
memcpy(priv->params.indirection_rqt, indir,
   sizeof(priv->params.indirection_rqt));
-   mlx5e_redirect_rqt(priv, MLX5E_INDIRECTION_RQT);
+   mlx5e_redirect_rqt(priv, rqtn, MLX5E_INDIR_RQT_SIZE, 0);
}
 
if (key)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index d00a242..4df49e6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -247,7 +247,7 @@ static int __mlx5e_add_eth_addr_rule(struct mlx5e_priv 
*priv,
   outer_headers.dmac_47_16);
u8 *mv_dmac = MLX5_ADDR_OF(fte_match_param, mv,
   outer_headers.dmac_47_16);
-   u32 *tirn = priv->tirn;
+   u32 *tirn = priv->indir_tirn;
u32 tt_vec;
int err = 0;
 
@@ -274,7 +274,7 @@ static int __mlx5e_add_eth_addr_rule(struct mlx5e_priv 
*priv,
 
if (tt_vec & BIT(MLX5E_TT_ANY)) {
rule_p = >ft_rule[MLX5E_TT_ANY];
-   dest.tir_num =

[PATCH net-next 08/12] net/mlx5e: Split the main flow steering table

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Currently, the main flow table is used for two purposes:
One is to do mac filtering and the other is to classify
the packet l3-l4 header in order to steer the packet to
the right RSS TIR.

This design is very complex, for each configured mac address we
have to add eleven rules (rule for each traffic type), the same if the
device is put to promiscuous/allmulti mode.
This scheme isn't scalable for future features like aRFS.

In order to simplify it, the main flow table is split to two flow
tables:
1. l2 table - filter the packet dmac address, if there is a match
we forward to the ttc flow table.

2. TTC (Traffic Type Classifier) table - classify the traffic
type of the packet and steer the packet to the right TIR.

In this new design, when new mac address is added, the driver adds
only one flow rule instead of eleven.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   42 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c   |  979 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |2 +-
 4 files changed, 462 insertions(+), 563 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 02b9644..2c9879c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -399,31 +399,18 @@ struct mlx5e_vxlan_db {
struct radix_tree_root  tree;
 };
 
-struct mlx5e_eth_addr_info {
+struct mlx5e_l2_rule {
u8  addr[ETH_ALEN + 2];
-   u32 tt_vec;
-   struct mlx5_flow_rule *ft_rule[MLX5E_NUM_TT];
+   struct mlx5_flow_rule *rule;
 };
 
-#define MLX5E_ETH_ADDR_HASH_SIZE (1 << BITS_PER_BYTE)
-
 struct mlx5e_flow_table {
int num_groups;
struct mlx5_flow_table *t;
struct mlx5_flow_group **g;
 };
 
-struct mlx5e_main_table {
-   struct mlx5e_flow_tableft;
-   struct hlist_head  netdev_uc[MLX5E_ETH_ADDR_HASH_SIZE];
-   struct hlist_head  netdev_mc[MLX5E_ETH_ADDR_HASH_SIZE];
-   struct mlx5e_eth_addr_info broadcast;
-   struct mlx5e_eth_addr_info allmulti;
-   struct mlx5e_eth_addr_info promisc;
-   bool   broadcast_enabled;
-   bool   allmulti_enabled;
-   bool   promisc_enabled;
-};
+#define MLX5E_L2_ADDR_HASH_SIZE BIT(BITS_PER_BYTE)
 
 struct mlx5e_tc_table {
struct mlx5_flow_table  *t;
@@ -441,11 +428,30 @@ struct mlx5e_vlan_table {
bool  filter_disabled;
 };
 
+struct mlx5e_l2_table {
+   struct mlx5e_flow_tableft;
+   struct hlist_head  netdev_uc[MLX5E_L2_ADDR_HASH_SIZE];
+   struct hlist_head  netdev_mc[MLX5E_L2_ADDR_HASH_SIZE];
+   struct mlx5e_l2_rule   broadcast;
+   struct mlx5e_l2_rule   allmulti;
+   struct mlx5e_l2_rule   promisc;
+   bool   broadcast_enabled;
+   bool   allmulti_enabled;
+   bool   promisc_enabled;
+};
+
+/* L3/L4 traffic type classifier */
+struct mlx5e_ttc_table {
+   struct mlx5e_flow_table  ft;
+   struct mlx5_flow_rule*rules[MLX5E_NUM_TT];
+};
+
 struct mlx5e_flow_steering {
struct mlx5_flow_namespace  *ns;
struct mlx5e_tc_table   tc;
struct mlx5e_vlan_table vlan;
-   struct mlx5e_main_table main;
+   struct mlx5e_l2_table   l2;
+   struct mlx5e_ttc_table  ttc;
 };
 
 struct mlx5e_direct_tir {
@@ -563,7 +569,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv);
 
 int mlx5e_create_flow_steering(struct mlx5e_priv *priv);
 void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv);
-void mlx5e_init_eth_addr(struct mlx5e_priv *priv);
+void mlx5e_init_l2_addr(struct mlx5e_priv *priv);
 void mlx5e_set_rx_mode_work(struct work_struct *work);
 
 void mlx5e_fill_hwstamp(struct mlx5e_tstamp *clock, u64 timestamp,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 3ee35b0..6e353b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -37,9 +37,16 @@
 #include 
 #include "en.h"
 
+static int mlx5e_add_l2_flow_rule(struct mlx5e_priv *priv,
+ struct mlx5e_l2_rule *ai, int type);
+static void mlx5e_del_l2_flow_rule(struct mlx5e_priv *priv,
+  struct mlx5e_l2_rule *ai);
+
+/* NIC prio FTS */
 enum {
MLX5E_VLAN_FT_LEVEL = 0,
-   MLX5E_MAIN_FT_LEVEL
+   MLX5E_L2_FT_LEVEL,
+   MLX5E_TTC_FT_LEVEL
 };
 
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
@@ -63,21 +70,21 @@ enum {
MLX5E_ACTION_DEL  = 2,
 };

[PATCH net-next 11/12] net/mlx5e: Add accelerated RFS support

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Implement ndo_rx_flow_steer ndo.
A new flow steering rule will be composed from the
skb 4-tuple and added to the hardware aRFS flow table.

Each rule is stored in an internal hash table, if such
skb 4-tuple rule already exists we update the corresponding
hardware steering rule with the new destination.

For garbage collection rps_may_expire_flow will be
invoked for a limited amount of old rules upon any
ndo_rx_flow_steer invocation.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  427 -
 2 files changed, 436 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 999e058..21c3841 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -448,9 +448,12 @@ struct mlx5e_ttc_table {
struct mlx5_flow_rule*rules[MLX5E_NUM_TT];
 };
 
+#define ARFS_HASH_SHIFT BITS_PER_BYTE
+#define ARFS_HASH_SIZE BIT(BITS_PER_BYTE)
 struct arfs_table {
struct mlx5e_flow_table  ft;
struct mlx5_flow_rule*default_rule;
+   struct hlist_headrules_hash[ARFS_HASH_SIZE];
 };
 
 enum  arfs_type {
@@ -463,6 +466,11 @@ enum  arfs_type {
 
 struct mlx5e_arfs_tables {
struct arfs_table arfs_tables[ARFS_NUM_TYPES];
+   /* Protect aRFS rules list */
+   spinlock_t arfs_lock;
+   struct list_head   rules;
+   intlast_filter_id;
+   struct workqueue_struct*wq;
 };
 
 /* NIC prio FTS */
@@ -685,6 +693,8 @@ static inline void mlx5e_arfs_destroy_tables(struct 
mlx5e_priv *priv) {}
 #else
 int mlx5e_arfs_create_tables(struct mlx5e_priv *priv);
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv);
+int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+   u16 rxq_index, u32 flow_id);
 #endif
 
 u16 mlx5e_get_max_inline_cap(struct mlx5_core_dev *mdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index cd50419..e54fbc1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -30,8 +30,47 @@
  * SOFTWARE.
  */
 
-#include "en.h"
+#include 
 #include 
+#include 
+#include 
+#include "en.h"
+
+struct arfs_tuple {
+   __be16 etype;
+   u8 ip_proto;
+   union {
+   __be32 src_ipv4;
+   struct in6_addr src_ipv6;
+   };
+   union {
+   __be32 dst_ipv4;
+   struct in6_addr dst_ipv6;
+   };
+   __be16 src_port;
+   __be16 dst_port;
+};
+
+struct arfs_rule {
+   struct mlx5e_priv   *priv;
+   struct work_struct  arfs_work;
+   struct mlx5_flow_rule   *rule;
+   struct hlist_node   hlist;
+   int rxq;
+   /* Flow ID passed to ndo_rx_flow_steer */
+   int flow_id;
+   /* Filter ID returned by ndo_rx_flow_steer */
+   int filter_id;
+   struct arfs_tuple   tuple;
+};
+
+#define mlx5e_for_each_arfs_rule(hn, tmp, arfs_tables, i, j) \
+   for (i = 0; i < ARFS_NUM_TYPES; i++) \
+   mlx5e_for_each_hash_arfs_rule(hn, tmp, 
arfs_tables[i].rules_hash, j)
+
+#define mlx5e_for_each_hash_arfs_rule(hn, tmp, hash, j) \
+   for (j = 0; j < ARFS_HASH_SIZE; j++) \
+   hlist_for_each_entry_safe(hn, tmp, [j], hlist)
 
 static void arfs_destroy_table(struct arfs_table *arfs_t)
 {
@@ -39,12 +78,17 @@ static void arfs_destroy_table(struct arfs_table *arfs_t)
mlx5e_destroy_flow_table(_t->ft);
 }
 
+static void arfs_del_rules(struct mlx5e_priv *priv);
+
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv)
 {
int i;
 
if (!(priv->netdev->hw_features & NETIF_F_NTUPLE))
return;
+
+   arfs_del_rules(priv);
+   destroy_workqueue(priv->fs.arfs.wq);
for (i = 0; i < ARFS_NUM_TYPES; i++) {
if (!IS_ERR_OR_NULL(priv->fs.arfs.arfs_tables[i].ft.t))
arfs_destroy_table(>fs.arfs.arfs_tables[i]);
@@ -239,6 +283,12 @@ int mlx5e_arfs_create_tables(struct mlx5e_priv *priv)
if (!(priv->netdev->hw_features & NETIF_F_NTUPLE))
return 0;
 
+   spin_lock_init(>fs.arfs.arfs_lock);
+   INIT_LIST_HEAD(>fs.arfs.rules);
+   priv->fs.arfs.wq = create_singlethread_workqueue("mlx5e_arfs");
+   if (!priv->fs.arfs.wq)
+   return -ENOMEM;
+
for (i = 0; i < ARFS_NUM_TYPES; i++) {
err = arfs_create_table(priv, i);
if (err)
@@ -249,3 +299,378 @@ err:
mlx5e_arfs_destroy_tables(priv);
return err;
 }
+
+#define

[PATCH net-next 12/12] net/mlx5e: Enabling aRFS mechanism

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Accelerated RFS requires that ntuple filtering is enabled via
ethtool and driver supports ndo_rx_flow_steer.
When the ntuple filtering is enabled, we modify the l3_l4 ttc
rules to point on the aRFS flow tables and when the filtering
is disabled, we modify the l3_l4 ttc rules to point on the RSS
TIRs.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   12 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  |   77 +++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   25 +++
 4 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 21c3841..34523c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -690,9 +690,21 @@ static inline int mlx5e_arfs_create_tables(struct 
mlx5e_priv *priv)
 }
 
 static inline void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv) {}
+
+static inline int mlx5e_arfs_enable(struct mlx5e_priv *priv)
+{
+   return -ENOTSUPP;
+}
+
+static inline int mlx5e_arfs_disable(struct mlx5e_priv *priv)
+{
+   return -ENOTSUPP;
+}
 #else
 int mlx5e_arfs_create_tables(struct mlx5e_priv *priv);
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv);
+int mlx5e_arfs_enable(struct mlx5e_priv *priv);
+int mlx5e_arfs_disable(struct mlx5e_priv *priv);
 int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
u16 rxq_index, u32 flow_id);
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index e54fbc1..b4ae0fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -72,14 +72,87 @@ struct arfs_rule {
for (j = 0; j < ARFS_HASH_SIZE; j++) \
hlist_for_each_entry_safe(hn, tmp, [j], hlist)
 
+static enum mlx5e_traffic_types arfs_get_tt(enum arfs_type type)
+{
+   switch (type) {
+   case ARFS_IPV4_TCP:
+   return MLX5E_TT_IPV4_TCP;
+   case ARFS_IPV4_UDP:
+   return MLX5E_TT_IPV4_UDP;
+   case ARFS_IPV6_TCP:
+   return MLX5E_TT_IPV6_TCP;
+   case ARFS_IPV6_UDP:
+   return MLX5E_TT_IPV6_UDP;
+   default:
+   return -EINVAL;
+   }
+}
+
+static int arfs_disable(struct mlx5e_priv *priv)
+{
+   struct mlx5_flow_destination dest;
+   u32 *tirn = priv->indir_tirn;
+   int err = 0;
+   int tt;
+   int i;
+
+   dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+   for (i = 0; i < ARFS_NUM_TYPES; i++) {
+   dest.tir_num = tirn[i];
+   tt = arfs_get_tt(i);
+   /* Modify ttc rules destination to bypass the aRFS tables*/
+   err = mlx5_modify_rule_destination(priv->fs.ttc.rules[tt],
+  );
+   if (err) {
+   netdev_err(priv->netdev,
+  "%s: modify ttc destination failed\n",
+  __func__);
+   return err;
+   }
+   }
+   return 0;
+}
+
+static void arfs_del_rules(struct mlx5e_priv *priv);
+
+int mlx5e_arfs_disable(struct mlx5e_priv *priv)
+{
+   arfs_del_rules(priv);
+
+   return arfs_disable(priv);
+}
+
+int mlx5e_arfs_enable(struct mlx5e_priv *priv)
+{
+   struct mlx5_flow_destination dest;
+   int err = 0;
+   int tt;
+   int i;
+
+   dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+   for (i = 0; i < ARFS_NUM_TYPES; i++) {
+   dest.ft = priv->fs.arfs.arfs_tables[i].ft.t;
+   tt = arfs_get_tt(i);
+   /* Modify ttc rules destination to point on the aRFS FTs */
+   err = mlx5_modify_rule_destination(priv->fs.ttc.rules[tt],
+  );
+   if (err) {
+   netdev_err(priv->netdev,
+  "%s: modify ttc destination failed err=%d\n",
+  __func__, err);
+   arfs_disable(priv);
+   return err;
+   }
+   }
+   return 0;
+}
+
 static void arfs_destroy_table(struct arfs_table *arfs_t)
 {
mlx5_del_flow_rule(arfs_t->default_rule);
mlx5e_destroy_flow_table(_t->ft);
 }
 
-static void arfs_del_rules(struct mlx5e_priv *priv);
-
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv)
 {
int i;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 498d407..534d99e 100644
---

[PATCH net-next 00/12] Mellanox 100G mlx5 ethernet aRFS support

2016-04-28 Thread Saeed Mahameed

Hi Dave,

This series adds accelerated RFS support for the mlx5e driver.
I have added one patch non-related to aRFS that fixes the rtnl_lock
warning mlx5 driver been getting since b7aade15485a ('vxlan: break dependency 
with netdev drivers')

aRFS support in details:

A direct TIR per RQ is now required in order to have the essential building 
blocks
for aRFS.  Today the driver has one direct TIR that forwards traffic to RQ[0] 
(core 0),
and one indirect TIR for RSS indirection table.  For that we've added one 
direct TIR
per RQ, e.g.: TIR[i] -> RQ[i] (core i).

Publicize Modify flow rule destination and reveal it in flow steering API, to 
have the 
ability to dynamically modify the destination TIR(core) for aRFS rules from the 
ethernet driver.

Initializing CPU reverse mapping to notify upper layer on internal receive 
queue cpu
mappings.

Some design refactoring for mlx5e ethernet driver flow tables and flow steering 
API.
Now the caller of create_flow_table can choose the level of the flow table, 
this way
we will create the mlx5e flow tables in a reversed order and connect them as we 
go, 
we create flow table[i+1] before flow table[i] to be able to set flow table[i + 
1] as
a destination of flow table[i] once flow table[i] is created.
also we have split the main flow table in the following manner:
- From before: RX packet had to visit two flow tables until it is delivered 
to its receive queue:
RX packet -> vlan filter flow table -> main flow table.
> vlan filter will check the packet vlan field is allowed.
> main flow will check if the dest mac is allowed and will check the 
l3/l4 headers to 
retrieve the RSS hash for steering the packet into its final receive 
queue.

- Now main flow table is split into l2 dst mac steering table and ttc 
(traffic type classifier) table:
RX packet -> vlan filter -> l2 table -> ttc table
> vlan filter - same as before
> L2 filter - filter packets according their destination mac address
> ttc table - classify packet headers for RSS steering
- L3/L4 classification rules to steer the packet according to thier 
headers hash
- in case of none of the rules applies the packet is steered to 
RQ[0]

After the above refactoring all left to-do is to create aRFS flow table which 
will manage
aRFS steering rules to forward traffic to the desired RQ (core) and just 
connect the ttc 
table rules destinations to aRFS flow table.  

aRFS flow table in case of a miss will deliver the traffic to the core where 
the original 
ttc hash would have chosen.

TTC table is not initialized and enabled until the user explicitly asks to, 
i.e. setting the NETIF_F_NTUPLE
to ON.  This way there is no need for ttc table to forward traffic to aRFS 
table unless required.
When setting back to OFF aRFS flow table is disabled and disconnected.

Thanks,
Saeed

Maor Gottlieb (10):
  net/mlx5: Introduce modify flow rule destination
  net/mlx5: Set number of allowed levels in priority
  net/mlx5: Add user chosen levels when allocating flow tables
  net/mlx5: Support different attributes for priorities in namespace
  net/mlx5e: Refactor mlx5e flow steering structs
  net/mlx5e: Split the main flow steering table
  net/mlx5: Initializing CPU reverse mapping
  net/mlx5e: Create aRFS flow tables
  net/mlx5e: Add accelerated RFS support
  net/mlx5e: Enabling aRFS mechanism

Matthew Finlay (1):
  net/mlx5e: Call vxlan_get_rx_port() with rtnl lock

Tariq Toukan (1):
  net/mlx5e: Direct TIR per RQ

 drivers/infiniband/hw/mlx5/main.c  |3 +-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  171 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  |  749 ++
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   24 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c| 1060 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  346 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|   46 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.h|2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  149 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   18 +
 include/linux/mlx5/driver.h|3 +
 include/linux/mlx5/fs.h|9 +-
 15 files changed, 1736 insertions(+), 849 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c

[PATCH net-next 01/12] net/mlx5e: Call vxlan_get_rx_port() with rtnl lock

2016-04-28 Thread Saeed Mahameed

From: Matthew Finlay 

Hold the rtnl lock when calling vxlan_get_rx_port().

Fixes: b7aade15485a ("vxlan: break dependency with netdev drivers")
Signed-off-by: Matthew Finlay 
Reported-by: Alexander Duyck 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8484ac4..8ffaf5c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2936,8 +2936,11 @@ static void *mlx5e_create_netdev(struct mlx5_core_dev 
*mdev)
goto err_tc_cleanup;
}
 
-   if (mlx5e_vxlan_allowed(mdev))
+   if (mlx5e_vxlan_allowed(mdev)) {
+   rtnl_lock();
vxlan_get_rx_port(netdev);
+   rtnl_unlock();
+   }
 
mlx5e_enable_async_events(priv);
schedule_work(>set_rx_mode_work);
-- 
1.7.1

[PATCH net-next 09/12] net/mlx5: Initializing CPU reverse mapping

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Allocating CPU rmap and add entry for each IRQ.
CPU rmap is used in aRFS to get the RX queue number
of the RX completion interrupts.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |3 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c|   18 ++
 include/linux/mlx5/driver.h   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 07c596d..a2b3297 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1689,6 +1689,9 @@ int mlx5e_open_locked(struct net_device *netdev)
mlx5e_redirect_rqts(priv);
mlx5e_update_carrier(priv);
mlx5e_timestamp_init(priv);
+#ifdef CONFIG_RFS_ACCEL
+   priv->netdev->rx_cpu_rmap = priv->mdev->rmap;
+#endif
 
schedule_delayed_work(>update_stats_work, 0);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6892746..6feef7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -48,6 +48,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_RFS_ACCEL
+#include 
+#endif
 #include "mlx5_core.h"
 #include "fs_core.h"
 #ifdef CONFIG_MLX5_CORE_EN
@@ -665,6 +668,12 @@ static void free_comp_eqs(struct mlx5_core_dev *dev)
struct mlx5_eq_table *table = >priv.eq_table;
struct mlx5_eq *eq, *n;
 
+#ifdef CONFIG_RFS_ACCEL
+   if (dev->rmap) {
+   free_irq_cpu_rmap(dev->rmap);
+   dev->rmap = NULL;
+   }
+#endif
spin_lock(>lock);
list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
list_del(>list);
@@ -691,6 +700,11 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
INIT_LIST_HEAD(>comp_eqs_list);
ncomp_vec = table->num_comp_vectors;
nent = MLX5_COMP_EQ_SIZE;
+#ifdef CONFIG_RFS_ACCEL
+   dev->rmap = alloc_irq_cpu_rmap(ncomp_vec);
+   if (!dev->rmap)
+   return -ENOMEM;
+#endif
for (i = 0; i < ncomp_vec; i++) {
eq = kzalloc(sizeof(*eq), GFP_KERNEL);
if (!eq) {
@@ -698,6 +712,10 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
goto clean;
}
 
+#ifdef CONFIG_RFS_ACCEL
+   irq_cpu_rmap_add(dev->rmap,
+dev->priv.msix_arr[i + 
MLX5_EQ_VEC_COMP_BASE].vector);
+#endif
snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
err = mlx5_create_map_eq(dev, eq,
 i + MLX5_EQ_VEC_COMP_BASE, nent, 0,
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 96a428d..d552944 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -560,6 +560,9 @@ struct mlx5_core_dev {
struct mlx5_profile *profile;
atomic_tnum_qps;
u32 issi;
+#ifdef CONFIG_RFS_ACCEL
+   struct cpu_rmap *rmap;
+#endif
 };
 
 struct mlx5_db {
-- 
1.7.1

[PATCH net-next 06/12] net/mlx5: Support different attributes for priorities in namespace

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Currently, namespace could be initialized only
with priorities with the same attributes.
Add support to initialize namespace with priorities
with different attributes(e.g. different number of levels).

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |   31 +
 1 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index ca55d7e..2b82293 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -74,9 +74,10 @@
 #define BY_PASS_MIN_LEVEL (KERNEL_MIN_LEVEL + MLX5_BY_PASS_NUM_PRIOS +\
   LEFTOVERS_NUM_PRIOS)
 
-#define KERNEL_NUM_LEVELS 3
-#define KERNEL_NUM_PRIOS 2
-#define KERNEL_MIN_LEVEL 2
+#define KERNEL_NIC_PRIO_NUM_LEVELS 2
+#define KERNEL_NIC_NUM_PRIOS 1
+/* One more level for tc */
+#define KERNEL_MIN_LEVEL (KERNEL_NIC_PRIO_NUM_LEVELS + 1)
 
 #define ANCHOR_NUM_LEVELS 1
 #define ANCHOR_NUM_PRIOS 1
@@ -106,8 +107,9 @@ static struct init_tree_node {
 ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS,
  BY_PASS_PRIO_NUM_LEVELS))),
ADD_PRIO(0, KERNEL_MIN_LEVEL, 0, {},
-ADD_NS(ADD_MULTIPLE_PRIO(KERNEL_NUM_PRIOS,
- KERNEL_NUM_LEVELS))),
+ADD_NS(ADD_MULTIPLE_PRIO(1, 1),
+   ADD_MULTIPLE_PRIO(KERNEL_NIC_NUM_PRIOS,
+ KERNEL_NIC_PRIO_NUM_LEVELS))),
ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0,
 
FS_REQUIRED_CAPS(FS_CAP(flow_table_properties_nic_receive.flow_modify_en),
  
FS_CAP(flow_table_properties_nic_receive.modify_root),
@@ -1375,14 +1377,14 @@ static struct mlx5_flow_namespace 
*fs_create_namespace(struct fs_prio *prio)
return ns;
 }
 
-static int create_leaf_prios(struct mlx5_flow_namespace *ns, struct 
init_tree_node
-*prio_metadata)
+static int create_leaf_prios(struct mlx5_flow_namespace *ns, int prio,
+struct init_tree_node *prio_metadata)
 {
struct fs_prio *fs_prio;
int i;
 
for (i = 0; i < prio_metadata->num_leaf_prios; i++) {
-   fs_prio = fs_create_prio(ns, i, prio_metadata->num_levels);
+   fs_prio = fs_create_prio(ns, prio++, prio_metadata->num_levels);
if (IS_ERR(fs_prio))
return PTR_ERR(fs_prio);
}
@@ -1409,7 +1411,7 @@ static int init_root_tree_recursive(struct mlx5_core_dev 
*dev,
struct init_tree_node *init_node,
struct fs_node *fs_parent_node,
struct init_tree_node *init_parent_node,
-   int index)
+   int prio)
 {
int max_ft_level = MLX5_CAP_FLOWTABLE(dev,
  flow_table_properties_nic_receive.
@@ -1427,8 +1429,8 @@ static int init_root_tree_recursive(struct mlx5_core_dev 
*dev,
 
fs_get_obj(fs_ns, fs_parent_node);
if (init_node->num_leaf_prios)
-   return create_leaf_prios(fs_ns, init_node);
-   fs_prio = fs_create_prio(fs_ns, index, init_node->num_levels);
+   return create_leaf_prios(fs_ns, prio, init_node);
+   fs_prio = fs_create_prio(fs_ns, prio, init_node->num_levels);
if (IS_ERR(fs_prio))
return PTR_ERR(fs_prio);
base = _prio->node;
@@ -1441,11 +1443,16 @@ static int init_root_tree_recursive(struct 
mlx5_core_dev *dev,
} else {
return -EINVAL;
}
+   prio = 0;
for (i = 0; i < init_node->ar_size; i++) {
err = init_root_tree_recursive(dev, _node->children[i],
-  base, init_node, i);
+  base, init_node, prio);
if (err)
return err;
+   if (init_node->children[i].type == FS_TYPE_PRIO &&
+   init_node->children[i].num_leaf_prios) {
+   prio += init_node->children[i].num_leaf_prios;
+   }
}
 
return 0;
-- 
1.7.1

[PATCH net-next 04/12] net/mlx5: Set number of allowed levels in priority

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Refactors the flow steering namespace creation,
by changing the name num_fts to num_levels.
When new flow table is created, the driver assign new level
to this flow table therefore the meaning is equivalent.
Since downstream patches will introduce the ability to create more
than one flow table per level, the name num_fts is no
longer accurate.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |   61 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |2 +-
 2 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index bb2c1cd..cfb35c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -40,18 +40,18 @@
 #define INIT_TREE_NODE_ARRAY_SIZE(...) (sizeof((struct 
init_tree_node[]){__VA_ARGS__}) /\
 sizeof(struct init_tree_node))
 
-#define ADD_PRIO(num_prios_val, min_level_val, max_ft_val, caps_val,\
+#define ADD_PRIO(num_prios_val, min_level_val, num_levels_val, caps_val,\
 ...) {.type = FS_TYPE_PRIO,\
.min_ft_level = min_level_val,\
-   .max_ft = max_ft_val,\
+   .num_levels = num_levels_val,\
.num_leaf_prios = num_prios_val,\
.caps = caps_val,\
.children = (struct init_tree_node[]) {__VA_ARGS__},\
.ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \
 }
 
-#define ADD_MULTIPLE_PRIO(num_prios_val, max_ft_val, ...)\
-   ADD_PRIO(num_prios_val, 0, max_ft_val, {},\
+#define ADD_MULTIPLE_PRIO(num_prios_val, num_levels_val, ...)\
+   ADD_PRIO(num_prios_val, 0, num_levels_val, {},\
 __VA_ARGS__)\
 
 #define ADD_NS(...) {.type = FS_TYPE_NAMESPACE,\
@@ -67,17 +67,18 @@
 #define FS_REQUIRED_CAPS(...) {.arr_sz = INIT_CAPS_ARRAY_SIZE(__VA_ARGS__), \
   .caps = (long[]) {__VA_ARGS__} }
 
-#define LEFTOVERS_MAX_FT 1
+#define LEFTOVERS_NUM_LEVELS 1
 #define LEFTOVERS_NUM_PRIOS 1
-#define BY_PASS_PRIO_MAX_FT 1
-#define BY_PASS_MIN_LEVEL (KENREL_MIN_LEVEL + MLX5_BY_PASS_NUM_PRIOS +\
-  LEFTOVERS_MAX_FT)
 
-#define KERNEL_MAX_FT 3
+#define BY_PASS_PRIO_NUM_LEVELS 1
+#define BY_PASS_MIN_LEVEL (KERNEL_MIN_LEVEL + MLX5_BY_PASS_NUM_PRIOS +\
+  LEFTOVERS_NUM_PRIOS)
+
+#define KERNEL_NUM_LEVELS 3
 #define KERNEL_NUM_PRIOS 2
-#define KENREL_MIN_LEVEL 2
+#define KERNEL_MIN_LEVEL 2
 
-#define ANCHOR_MAX_FT 1
+#define ANCHOR_NUM_LEVELS 1
 #define ANCHOR_NUM_PRIOS 1
 #define ANCHOR_MIN_LEVEL (BY_PASS_MIN_LEVEL + 1)
 struct node_caps {
@@ -92,7 +93,7 @@ static struct init_tree_node {
int min_ft_level;
int num_leaf_prios;
int prio;
-   int max_ft;
+   int num_levels;
 } root_fs = {
.type = FS_TYPE_NAMESPACE,
.ar_size = 4,
@@ -102,17 +103,19 @@ static struct init_tree_node {
  
FS_CAP(flow_table_properties_nic_receive.modify_root),
  
FS_CAP(flow_table_properties_nic_receive.identified_miss_table_mode),
  
FS_CAP(flow_table_properties_nic_receive.flow_table_modify)),
-ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS, 
BY_PASS_PRIO_MAX_FT))),
-   ADD_PRIO(0, KENREL_MIN_LEVEL, 0, {},
-ADD_NS(ADD_MULTIPLE_PRIO(KERNEL_NUM_PRIOS, 
KERNEL_MAX_FT))),
+ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS,
+ BY_PASS_PRIO_NUM_LEVELS))),
+   ADD_PRIO(0, KERNEL_MIN_LEVEL, 0, {},
+ADD_NS(ADD_MULTIPLE_PRIO(KERNEL_NUM_PRIOS,
+ KERNEL_NUM_LEVELS))),
ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0,
 
FS_REQUIRED_CAPS(FS_CAP(flow_table_properties_nic_receive.flow_modify_en),
  
FS_CAP(flow_table_properties_nic_receive.modify_root),
  
FS_CAP(flow_table_properties_nic_receive.identified_miss_table_mode),
  
FS_CAP(flow_table_properties_nic_receive.flow_table_modify)),
-ADD_NS(ADD_MULTIPLE_PRIO(LEFTOVERS_NUM_PRIOS, 
LEFTOVERS_MAX_FT))),
+ADD_NS(ADD_MULTIPLE_PRIO(LEFTOVERS_NUM_PRIOS, 
LEFTOVERS_NUM_LEVELS))),
ADD_PRIO(0, ANCHOR_MIN_LEVEL, 0, {},
-ADD_NS(ADD_MULTIPLE_PRIO(ANCHOR_NUM_PRIOS, 
ANCHOR_MAX_FT))),
+ADD_NS(ADD_MULTIPLE_PRIO(ANCHOR_NUM_PRIOS, 
ANCHOR_NUM_LEVELS))),
}
 };
 
@@ -716,7 +719,7 @@ struct mlx5_flow_table *mlx5_create_flow_table(struct 
mlx5_flow_namespace *ns,

[PATCH net-next 07/12] net/mlx5e: Refactor mlx5e flow steering structs

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Slightly refactor and re-order the flow steering structs,
tables and data-bases for better self-containment and
flexibility to add more future steering phases
(tables/rules/data bases) e.g: aRFS.

Changes:
1. Move the vlan DB and address DB into their table structs.
2. Rename steering table structs to unique format: mlx5e_*_table,
e.g: mlx5e_vlan_table.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   73 +
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c   |  186 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |   45 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.h   |2 +-
 5 files changed, 160 insertions(+), 154 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5c8e98c..02b9644 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -388,6 +388,17 @@ enum mlx5e_traffic_types {
MLX5E_NUM_INDIR_TIRS = MLX5E_TT_ANY,
 };
 
+enum {
+   MLX5E_STATE_ASYNC_EVENTS_ENABLE,
+   MLX5E_STATE_OPENED,
+   MLX5E_STATE_DESTROYING,
+};
+
+struct mlx5e_vxlan_db {
+   spinlock_t  lock; /* protect vxlan table */
+   struct radix_tree_root  tree;
+};
+
 struct mlx5e_eth_addr_info {
u8  addr[ETH_ALEN + 2];
u32 tt_vec;
@@ -396,7 +407,14 @@ struct mlx5e_eth_addr_info {
 
 #define MLX5E_ETH_ADDR_HASH_SIZE (1 << BITS_PER_BYTE)
 
-struct mlx5e_eth_addr_db {
+struct mlx5e_flow_table {
+   int num_groups;
+   struct mlx5_flow_table *t;
+   struct mlx5_flow_group **g;
+};
+
+struct mlx5e_main_table {
+   struct mlx5e_flow_tableft;
struct hlist_head  netdev_uc[MLX5E_ETH_ADDR_HASH_SIZE];
struct hlist_head  netdev_mc[MLX5E_ETH_ADDR_HASH_SIZE];
struct mlx5e_eth_addr_info broadcast;
@@ -407,13 +425,15 @@ struct mlx5e_eth_addr_db {
bool   promisc_enabled;
 };
 
-enum {
-   MLX5E_STATE_ASYNC_EVENTS_ENABLE,
-   MLX5E_STATE_OPENED,
-   MLX5E_STATE_DESTROYING,
+struct mlx5e_tc_table {
+   struct mlx5_flow_table  *t;
+
+   struct rhashtable_paramsht_params;
+   struct rhashtable   ht;
 };
 
-struct mlx5e_vlan_db {
+struct mlx5e_vlan_table {
+   struct mlx5e_flow_table ft;
unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
struct mlx5_flow_rule   *active_vlans_rule[VLAN_N_VID];
struct mlx5_flow_rule   *untagged_rule;
@@ -421,29 +441,11 @@ struct mlx5e_vlan_db {
bool  filter_disabled;
 };
 
-struct mlx5e_vxlan_db {
-   spinlock_t  lock; /* protect vxlan table */
-   struct radix_tree_root  tree;
-};
-
-struct mlx5e_flow_table {
-   int num_groups;
-   struct mlx5_flow_table  *t;
-   struct mlx5_flow_group  **g;
-};
-
-struct mlx5e_tc_flow_table {
-   struct mlx5_flow_table  *t;
-
-   struct rhashtable_paramsht_params;
-   struct rhashtable   ht;
-};
-
-struct mlx5e_flow_tables {
-   struct mlx5_flow_namespace  *ns;
-   struct mlx5e_tc_flow_table  tc;
-   struct mlx5e_flow_table vlan;
-   struct mlx5e_flow_table main;
+struct mlx5e_flow_steering {
+   struct mlx5_flow_namespace  *ns;
+   struct mlx5e_tc_table   tc;
+   struct mlx5e_vlan_table vlan;
+   struct mlx5e_main_table main;
 };
 
 struct mlx5e_direct_tir {
@@ -451,6 +453,11 @@ struct mlx5e_direct_tir {
u32  rqtn;
 };
 
+enum {
+   MLX5E_TC_PRIO = 0,
+   MLX5E_NIC_PRIO
+};
+
 struct mlx5e_priv {
/* priv data path fields - start */
struct mlx5e_sq**txq_to_sq_map;
@@ -472,9 +479,7 @@ struct mlx5e_priv {
u32indir_tirn[MLX5E_NUM_INDIR_TIRS];
struct mlx5e_direct_tirdirect_tir[MLX5E_MAX_NUM_CHANNELS];
 
-   struct mlx5e_flow_tables   fts;
-   struct mlx5e_eth_addr_db   eth_addr;
-   struct mlx5e_vlan_db   vlan;
+   struct mlx5e_flow_steering fs;
struct mlx5e_vxlan_db  vxlan;
 
struct mlx5e_paramsparams;
@@ -556,8 +561,8 @@ struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
 
-int mlx5e_create_flow_tables(struct mlx5e_priv *priv);
-void mlx5e_destroy_flow_tables(struct mlx5e_priv *priv);
+int mlx5e_create_flow_steering(struct mlx5e_priv *priv);
+void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv);
 void mlx5e_init_eth_addr(struct mlx5e_priv *priv);
 void mlx5e_set_rx_mode_work(struct work_struct *work);
 
diff --git

[PATCH net-next 05/12] net/mlx5: Add user chosen levels when allocating flow tables

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Currently, consumers of the flow steering infrastructure can't
choose their own flow table levels and are limited to one
flow table per level. This just waste levels.
Instead, we introduce here the possibility to use multiple
flow tables in a level. The user is free to connect these
flow tables, while following the rule (FTEs in FT of level x
could only point to FTs of level y where y > x).

In addition this patch switch the order of the create/destroy
flow tables of the NIC(vlan and main).

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c |3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c   |   30 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |3 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |   66 +---
 include/linux/mlx5/fs.h   |6 +-
 6 files changed, 70 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 99eb1c1..3ff663c 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1438,7 +1438,8 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
if (!ft) {
ft = mlx5_create_auto_grouped_flow_table(ns, priority,
 num_entries,
-num_groups);
+num_groups,
+0);
 
if (!IS_ERR(ft)) {
prio->refcount = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 4df49e6..d61171a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -37,6 +37,11 @@
 #include 
 #include "en.h"
 
+enum {
+   MLX5E_VLAN_FT_LEVEL = 0,
+   MLX5E_MAIN_FT_LEVEL
+};
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 enum {
@@ -1041,7 +1046,8 @@ static int mlx5e_create_main_flow_table(struct mlx5e_priv 
*priv)
int err;
 
ft->num_groups = 0;
-   ft->t = mlx5_create_flow_table(priv->fts.ns, 1, MLX5E_MAIN_TABLE_SIZE);
+   ft->t = mlx5_create_flow_table(priv->fts.ns, 1, MLX5E_MAIN_TABLE_SIZE,
+  MLX5E_MAIN_FT_LEVEL);
 
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
@@ -1150,7 +1156,8 @@ static int mlx5e_create_vlan_flow_table(struct mlx5e_priv 
*priv)
int err;
 
ft->num_groups = 0;
-   ft->t = mlx5_create_flow_table(priv->fts.ns, 1, MLX5E_VLAN_TABLE_SIZE);
+   ft->t = mlx5_create_flow_table(priv->fts.ns, 1, MLX5E_VLAN_TABLE_SIZE,
+  MLX5E_VLAN_FT_LEVEL);
 
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
@@ -1167,11 +1174,16 @@ static int mlx5e_create_vlan_flow_table(struct 
mlx5e_priv *priv)
if (err)
goto err_free_g;
 
+   err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
+   if (err)
+   goto err_destroy_vlan_flow_groups;
+
return 0;
 
+err_destroy_vlan_flow_groups:
+   mlx5e_destroy_groups(ft);
 err_free_g:
kfree(ft->g);
-
 err_destroy_vlan_flow_table:
mlx5_destroy_flow_table(ft->t);
ft->t = NULL;
@@ -1194,15 +1206,11 @@ int mlx5e_create_flow_tables(struct mlx5e_priv *priv)
if (!priv->fts.ns)
return -EINVAL;
 
-   err = mlx5e_create_vlan_flow_table(priv);
-   if (err)
-   return err;
-
err = mlx5e_create_main_flow_table(priv);
if (err)
-   goto err_destroy_vlan_flow_table;
+   return err;
 
-   err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
+   err = mlx5e_create_vlan_flow_table(priv);
if (err)
goto err_destroy_main_flow_table;
 
@@ -1210,8 +1218,6 @@ int mlx5e_create_flow_tables(struct mlx5e_priv *priv)
 
 err_destroy_main_flow_table:
mlx5e_destroy_main_flow_table(priv);
-err_destroy_vlan_flow_table:
-   mlx5e_destroy_vlan_flow_table(priv);
 
return err;
 }
@@ -1219,6 +1225,6 @@ err_destroy_vlan_flow_table:
 void mlx5e_destroy_flow_tables(struct mlx5e_priv *priv)
 {
mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
-   mlx5e_destroy_main_flow_table(priv);
mlx5e_destroy_vlan_flow_table(priv);
+   mlx5e_destroy_main_flow_table(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index b3de09f..2137387 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++

[PATCH net-next 10/12] net/mlx5e: Create aRFS flow tables

2016-04-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Create the following four flow tables for aRFS usage:
1. IPv4 TCP - filtering 4-tuple of IPv4 TCP packets.
2. IPv6 TCP - filtering 4-tuple of IPv6 TCP packets.
3. IPv4 UDP - filtering 4-tuple of IPv4 UDP packets.
4. IPv6 UDP - filtering 4-tuple of IPv6 UDP packets.

Each flow table has two flow groups: one for the 4-tuple
filtering (full match)  and the other contains * rule for miss rule.

Full match rule means a hit for aRFS and packet will be forwarded
to the dedicated RQ/Core, miss rule packets will be forwarded to
default RSS hashing.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   41 
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  251 +
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c   |   23 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |3 +-
 6 files changed, 313 insertions(+), 14 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 4fc45ee..679e18f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -9,3 +9,4 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \
en_txrx.o en_clock.o vxlan.o en_tc.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
+mlx5_core-$(CONFIG_RFS_ACCEL) +=  en_arfs.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2c9879c..999e058 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -48,6 +48,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+#define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
+
 #define MLX5E_MAX_NUM_TC   8
 
 #define MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE0x6
@@ -446,12 +448,38 @@ struct mlx5e_ttc_table {
struct mlx5_flow_rule*rules[MLX5E_NUM_TT];
 };
 
+struct arfs_table {
+   struct mlx5e_flow_table  ft;
+   struct mlx5_flow_rule*default_rule;
+};
+
+enum  arfs_type {
+   ARFS_IPV4_TCP,
+   ARFS_IPV6_TCP,
+   ARFS_IPV4_UDP,
+   ARFS_IPV6_UDP,
+   ARFS_NUM_TYPES,
+};
+
+struct mlx5e_arfs_tables {
+   struct arfs_table arfs_tables[ARFS_NUM_TYPES];
+};
+
+/* NIC prio FTS */
+enum {
+   MLX5E_VLAN_FT_LEVEL = 0,
+   MLX5E_L2_FT_LEVEL,
+   MLX5E_TTC_FT_LEVEL,
+   MLX5E_ARFS_FT_LEVEL
+};
+
 struct mlx5e_flow_steering {
struct mlx5_flow_namespace  *ns;
struct mlx5e_tc_table   tc;
struct mlx5e_vlan_table vlan;
struct mlx5e_l2_table   l2;
struct mlx5e_ttc_table  ttc;
+   struct mlx5e_arfs_tablesarfs;
 };
 
 struct mlx5e_direct_tir {
@@ -570,6 +598,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv);
 int mlx5e_create_flow_steering(struct mlx5e_priv *priv);
 void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv);
 void mlx5e_init_l2_addr(struct mlx5e_priv *priv);
+void mlx5e_destroy_flow_table(struct mlx5e_flow_table *ft);
 void mlx5e_set_rx_mode_work(struct work_struct *work);
 
 void mlx5e_fill_hwstamp(struct mlx5e_tstamp *clock, u64 timestamp,
@@ -646,6 +675,18 @@ extern const struct dcbnl_rtnl_ops mlx5e_dcbnl_ops;
 int mlx5e_dcbnl_ieee_setets_core(struct mlx5e_priv *priv, struct ieee_ets 
*ets);
 #endif
 
+#ifndef CONFIG_RFS_ACCEL
+static inline int mlx5e_arfs_create_tables(struct mlx5e_priv *priv)
+{
+   return 0;
+}
+
+static inline void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv) {}
+#else
+int mlx5e_arfs_create_tables(struct mlx5e_priv *priv);
+void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv);
+#endif
+
 u16 mlx5e_get_max_inline_cap(struct mlx5_core_dev *mdev);
 
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
new file mode 100644
index 000..cd50419
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-28 Thread Matthew Finlay






On 4/20/16, 11:06 AM, "Alexander Duyck"  wrote:

>On Wed, Apr 20, 2016 at 10:40 AM, Saeed Mahameed
> wrote:
>> On Tue, Apr 19, 2016 at 10:06 PM, Alexander Duyck  
>> wrote:
>>> This patch assumes that the mlx5 hardware will ignore existing IPv4/v6
>>> header fields for length and checksum as well as the length and checksum
>>> fields for outer UDP headers.
>>>
>>> I have no means of testing this as I do not have any mlx5 hardware but
>>> thought I would submit it as an RFC to see if anyone out there wants to
>>> test this and see if this does in fact enable this functionality allowing
>>> us to to segment UDP tunneled frames that have an outer checksum.
>>>
>>> Signed-off-by: Alexander Duyck 
>>> ---
>>>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 ++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> index e0adb604f461..57d8da796d50 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> @@ -2390,13 +2390,18 @@ static void mlx5e_build_netdev(struct net_device 
>>> *netdev)
>>> netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_FILTER;
>>>
>>> if (mlx5e_vxlan_allowed(mdev)) {
>>> -   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
>>> +   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL |
>>> +  NETIF_F_GSO_UDP_TUNNEL_CSUM |
>>> +  NETIF_F_GSO_PARTIAL;
>>> netdev->hw_enc_features |= NETIF_F_IP_CSUM;
>>> netdev->hw_enc_features |= NETIF_F_RXCSUM;
>>> netdev->hw_enc_features |= NETIF_F_TSO;
>>> netdev->hw_enc_features |= NETIF_F_TSO6;
>>> netdev->hw_enc_features |= NETIF_F_RXHASH;
>>> netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
>>> +   netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
>>> +  NETIF_F_GSO_PARTIAL;
>>> +   netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>> }
>>>
>>> netdev->features  = netdev->hw_features;
>>>
>>
>> Hi Alex,
>>
>> Adding Matt, VxLAN feature owner from Mellanox,
>> Matt please correct me if am wrong, but We already tested GSO VxLAN
>> and we saw the TCP/IP checksum offloads for both inner and outer
>> headers handled by the hardware.
>>
>> And looking at mlx5e_sq_xmit:
>>
>> if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>> eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>> if (skb->encapsulation) {
>> eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>> MLX5_ETH_WQE_L4_INNER_CSUM;
>> sq->stats.csum_offload_inner++;
>> } else {
>> eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>> }
>>
>> We enable inner/outer hardware checksumming unconditionally without
>> looking at the features Alex is suggesting in this patch,
>> Alex, can you elaborate more on the meaning of those features ? and
>> why would it work for us without declaring them ?
>
>Well right now the feature list exposed by the device indicates that
>TSO is not used if a VxLAN tunnel has a checksum in an outer header.
>Since that is not exposed currently that is completely offloaded in
>software via GSO.

The mlx5 hardware requires the outer UDP checksum is not set when offloading 
encapsulated packets. 

>
>What the GSO partial does is allow us to treat GSO for tunnels with
>checksum like it is GSO for tunnels without checksum by precomputing
>the UDP checksum as though the frame had already been segmented and
>restricts us to an even multiple of MSS bytes that are to be segmented
>between all the frames.  One side effect though is that all of the IP
>and UDP header fields are also precomputed, but from what I can tell
>it looks like the values that would be changed by a change in length
>are ignored or overwritten by the hardware and driver anyway.
>
>- Alex

[PATCH RFT v2 2/2] macb: kill PHY reset code

2016-04-28 Thread Sergei Shtylyov

With  the 'phylib' now  being aware of  the "reset-gpios" PHY node property,
there should be no need to frob the PHY reset in this  driver anymore...

Signed-off-by: Sergei Shtylyov 

---
 drivers/net/ethernet/cadence/macb.c |   17 -
 drivers/net/ethernet/cadence/macb.h |1 -
 2 files changed, 18 deletions(-)

Index: net-next/drivers/net/ethernet/cadence/macb.c
===
--- net-next.orig/drivers/net/ethernet/cadence/macb.c
+++ net-next/drivers/net/ethernet/cadence/macb.c
@@ -2884,7 +2884,6 @@ static int macb_probe(struct platform_de
  = macb_clk_init;
int (*init)(struct platform_device *) = macb_init;
struct device_node *np = pdev->dev.of_node;
-   struct device_node *phy_node;
const struct macb_config *macb_config = NULL;
struct clk *pclk, *hclk = NULL, *tx_clk = NULL;
unsigned int queue_mask, num_queues;
@@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de
else
macb_get_hwaddr(bp);
 
-   /* Power up the PHY if there is a GPIO reset */
-   phy_node =  of_get_next_available_child(np, NULL);
-   if (phy_node) {
-   int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
-
-   if (gpio_is_valid(gpio)) {
-   bp->reset_gpio = gpio_to_desc(gpio);
-   gpiod_direction_output(bp->reset_gpio, 1);
-   }
-   }
-   of_node_put(phy_node);
-
err = of_get_phy_mode(np);
if (err < 0) {
pdata = dev_get_platdata(>dev);
@@ -3054,10 +3041,6 @@ static int macb_remove(struct platform_d
mdiobus_unregister(bp->mii_bus);
mdiobus_free(bp->mii_bus);
 
-   /* Shutdown the PHY if there is a GPIO reset */
-   if (bp->reset_gpio)
-   gpiod_set_value(bp->reset_gpio, 0);
-
unregister_netdev(dev);
clk_disable_unprepare(bp->tx_clk);
clk_disable_unprepare(bp->hclk);
Index: net-next/drivers/net/ethernet/cadence/macb.h
===
--- net-next.orig/drivers/net/ethernet/cadence/macb.h
+++ net-next/drivers/net/ethernet/cadence/macb.h
@@ -832,7 +832,6 @@ struct macb {
unsigned intdma_burst_length;
 
phy_interface_t phy_interface;
-   struct gpio_desc*reset_gpio;
 
/* AT91RM9200 transmit */
struct sk_buff *skb;/* holds skb until xmit 
interrupt completes */

[PATCH RFT 1/2] phylib: add device reset GPIO support

2016-04-28 Thread Sergei Shtylyov

The PHY devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question; that solution, when applied to the device trees, led
to adding the PHY reset GPIO properties to the MAC device node, with one
exception: Cadence MACB driver which could handle the "reset-gpios" prop
in a PHY device subnode. I believe that the correct approach is to teach
the 'phylib' to get the MDIO device reset GPIO from the device tree node
corresponding to this device -- which this patch is doing...

Note that I had to modify the  AT803x PHY driver as it would stop working
otherwise as it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov 

---
Changes in version 2:
- reformatted the changelog;
- resolved rejects, refreshed the patch.

Documentation/devicetree/bindings/net/phy.txt |2 +
 Documentation/devicetree/bindings/net/phy.txt |2 +
 drivers/net/phy/at803x.c  |   19 ++
 drivers/net/phy/mdio_bus.c|4 +++
 drivers/net/phy/mdio_device.c |   27 +++--
 drivers/net/phy/phy_device.c  |   33 --
 drivers/of/of_mdio.c  |   16 
 include/linux/mdio.h  |3 ++
 include/linux/phy.h   |5 +++
 8 files changed, 89 insertions(+), 20 deletions(-)

Index: net-next/Documentation/devicetree/bindings/net/phy.txt
===
--- net-next.orig/Documentation/devicetree/bindings/net/phy.txt
+++ net-next/Documentation/devicetree/bindings/net/phy.txt
@@ -35,6 +35,8 @@ Optional Properties:
 - broken-turn-around: If set, indicates the PHY device does not correctly
   release the turn around line low at the end of a MDIO transaction.
 
+- reset-gpios: The GPIO phandle and specifier for the PHY reset signal.
+
 Example:
 
 ethernet-phy@0 {
Index: net-next/drivers/net/phy/at803x.c
===
--- net-next.orig/drivers/net/phy/at803x.c
+++ net-next/drivers/net/phy/at803x.c
@@ -65,7 +65,6 @@ MODULE_LICENSE("GPL");
 
 struct at803x_priv {
bool phy_reset:1;
-   struct gpio_desc *gpiod_reset;
 };
 
 struct at803x_context {
@@ -271,22 +270,10 @@ static int at803x_probe(struct phy_devic
 {
struct device *dev = >mdio.dev;
struct at803x_priv *priv;
-   struct gpio_desc *gpiod_reset;
 
priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;
-
-   if (phydev->drv->phy_id != ATH8030_PHY_ID)
-   goto does_not_require_reset_workaround;
-
-   gpiod_reset = devm_gpiod_get_optional(dev, "reset", GPIOD_OUT_LOW);
-   if (IS_ERR(gpiod_reset))
-   return PTR_ERR(gpiod_reset);
-
-   priv->gpiod_reset = gpiod_reset;
-
-does_not_require_reset_workaround:
phydev->priv = priv;
 
return 0;
@@ -361,14 +348,14 @@ static void at803x_link_change_notify(st
 */
if (phydev->drv->phy_id == ATH8030_PHY_ID) {
if (phydev->state == PHY_NOLINK) {
-   if (priv->gpiod_reset && !priv->phy_reset) {
+   if (phydev->mdio.reset && !priv->phy_reset) {
struct at803x_context context;
 
at803x_context_save(phydev, );
 
-   gpiod_set_value(priv->gpiod_reset, 1);
+   phy_device_reset(phydev, 1);
msleep(1);
-   gpiod_set_value(priv->gpiod_reset, 0);
+   phy_device_reset(phydev, 0);
msleep(1);
 
at803x_context_restore(phydev, );
Index: net-next/drivers/net/phy/mdio_bus.c
===
--- net-next.orig/drivers/net/phy/mdio_bus.c
+++ net-next/drivers/net/phy/mdio_bus.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -371,6 +372,9 @@ void mdiobus_unregister(struct mii_bus *
if (!mdiodev)
continue;
 
+   if (mdiodev->reset)
+   gpiod_put(mdiodev->reset);
+
mdiodev->device_remove(mdiodev);
mdiodev->device_free(mdiodev);
}
Index: net-next/drivers/net/phy/mdio_device.c
===
--- net-next.orig/drivers/net/phy/mdio_device.c
+++ net-next/drivers/net/phy/mdio_device.c
@@ -12,6 +12,8 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
+#include

Re: [PATCH] net: davinci_mdio: Set of_node in the mdio bus

2016-04-28 Thread J.D. Schroeder

On 04/28/2016 02:44 PM, David Miller wrote:
>> --- a/drivers/net/ethernet/ti/davinci_mdio.c
>> +++ b/drivers/net/ethernet/ti/davinci_mdio.c
>> @@ -343,6 +343,7 @@ static int davinci_mdio_probe(struct platform_device 
>> *pdev)
>>  if (davinci_mdio_probe_dt(>pdata, pdev))
>>  data->pdata = default_pdata;
>>  snprintf(data->bus->id, MII_BUS_ID_SIZE, "%s", pdev->name);
>> +data->bus->dev.of_node = dev->of_node;
>>  } else {
>>  data->pdata = pdata ? (*pdata) : default_pdata;
>>  snprintf(data->bus->id, MII_BUS_ID_SIZE, "%s-%x",
> 
> You can't do this.
> 
> First of all, of_node objects are reference counted.  So even if this was a
> legal thing to do you would have to drop the reference to the existing of_node
> pointer and gain a reference to dev->of_node.
> 
> But even more importantly, it is the job of the bus driver to set that
> bus->dev.of_node correctly, you should never override it in a driver like
> this.

David, thanks for your review. I understand your point about the reference 
count.

One thing to note is that it is always null for the davinci mdio bus when
going through this path. I'm not trying to override it. I'm trying to make
sure it has a way to find the davinci mdio bus. Do you see the problem I'm
trying to solve?

Is there another way to be able to make the of_mdio_find_bus() call be able to
find the davinci mdio bus?

Thanks,
JD

Re: [PATCH net-next] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Andrew Lunn

On Thu, Apr 28, 2016 at 02:55:10PM -0700, Florian Fainelli wrote:
> Add a check whether the 'struct device_node' pointer passed to
> of_mdiobus_register() is an available (aka enabled) node in the Device
> Tree.
> 
> Rationale for doing this are cases where an Ethernet MAC provides a MDIO
> bus controller and node, and an additional Ethernet MAC might be
> connecting its PHY/switches to that first MDIO bus controller, while
> still embedding one internally which is therefore marked as "disabled".
> 
> Instead of sprinkling checks like these in callers of
> of_mdiobus_register(), do this in a central location.

I think this discussion has shown there is no documented best
practices for MDIO bus drivers and how PHYs nodes are placed within
device tree. Maybe you could document the generic MDIO binding, both
as integrated into a MAC device node, and as a separate device?

   Andrew

[PATCH RFT v2 0/2] Teach phylib hard-resetting devices

2016-04-28 Thread Sergei Shtylyov

Hello.

   Here's the set of 2 patches against DaveM's 'net-next.git' repo. They add to
the 'phylib' support for resetting devices via GPIO and do some clean up after
doing that...

[1/2] phylib: add device reset GPIO support
[2/2] macb: kill PHY reset code

MBR, Sergei

Re: [PATCH net-next] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Andrew Lunn

On Thu, Apr 28, 2016 at 02:55:10PM -0700, Florian Fainelli wrote:
> Add a check whether the 'struct device_node' pointer passed to
> of_mdiobus_register() is an available (aka enabled) node in the Device
> Tree.
> 
> Rationale for doing this are cases where an Ethernet MAC provides a MDIO
> bus controller and node, and an additional Ethernet MAC might be
> connecting its PHY/switches to that first MDIO bus controller, while
> still embedding one internally which is therefore marked as "disabled".
> 
> Instead of sprinkling checks like these in callers of
> of_mdiobus_register(), do this in a central location.
> 
> Signed-off-by: Florian Fainelli 
>
> ---
>  drivers/of/of_mdio.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
> index b622b33dbf93..2f497790be1b 100644
> --- a/drivers/of/of_mdio.c
> +++ b/drivers/of/of_mdio.c
> @@ -209,6 +209,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
> device_node *np)
>   bool scanphys = false;
>   int addr, rc;
>  
> + /* Do not continue if the node is disabled */
> + if (!of_device_is_available(np))
> + return -EINVAL;

Could be bike shedding, but would ENODEV be better?

Some callers are going to have to look at the return value and decide
if it is a fatal error, and fail the whole probe, or a non-fatal error
and they should keep going. ENODEV seems less fatal...

Other than that,

Reviewed-by: Andrew Lunn 

Andrew

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-28 Thread Alexander Duyck

On Thu, Apr 28, 2016 at 2:43 PM, Matthew Finlay  wrote:
>
>
>
>
>
> On 4/20/16, 11:06 AM, "Alexander Duyck"  wrote:
>
>>On Wed, Apr 20, 2016 at 10:40 AM, Saeed Mahameed
>> wrote:
>>> On Tue, Apr 19, 2016 at 10:06 PM, Alexander Duyck  
>>> wrote:
 This patch assumes that the mlx5 hardware will ignore existing IPv4/v6
 header fields for length and checksum as well as the length and checksum
 fields for outer UDP headers.

 I have no means of testing this as I do not have any mlx5 hardware but
 thought I would submit it as an RFC to see if anyone out there wants to
 test this and see if this does in fact enable this functionality allowing
 us to to segment UDP tunneled frames that have an outer checksum.

 Signed-off-by: Alexander Duyck 
 ---
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

 diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
 b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 index e0adb604f461..57d8da796d50 100644
 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 @@ -2390,13 +2390,18 @@ static void mlx5e_build_netdev(struct net_device 
 *netdev)
 netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_FILTER;

 if (mlx5e_vxlan_allowed(mdev)) {
 -   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
 +   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL |
 +  NETIF_F_GSO_UDP_TUNNEL_CSUM |
 +  NETIF_F_GSO_PARTIAL;
 netdev->hw_enc_features |= NETIF_F_IP_CSUM;
 netdev->hw_enc_features |= NETIF_F_RXCSUM;
 netdev->hw_enc_features |= NETIF_F_TSO;
 netdev->hw_enc_features |= NETIF_F_TSO6;
 netdev->hw_enc_features |= NETIF_F_RXHASH;
 netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
 +   netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
 +  NETIF_F_GSO_PARTIAL;
 +   netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
 }

 netdev->features  = netdev->hw_features;

>>>
>>> Hi Alex,
>>>
>>> Adding Matt, VxLAN feature owner from Mellanox,
>>> Matt please correct me if am wrong, but We already tested GSO VxLAN
>>> and we saw the TCP/IP checksum offloads for both inner and outer
>>> headers handled by the hardware.
>>>
>>> And looking at mlx5e_sq_xmit:
>>>
>>> if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>>> eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>>> if (skb->encapsulation) {
>>> eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>>> MLX5_ETH_WQE_L4_INNER_CSUM;
>>> sq->stats.csum_offload_inner++;
>>> } else {
>>> eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>>> }
>>>
>>> We enable inner/outer hardware checksumming unconditionally without
>>> looking at the features Alex is suggesting in this patch,
>>> Alex, can you elaborate more on the meaning of those features ? and
>>> why would it work for us without declaring them ?
>>
>>Well right now the feature list exposed by the device indicates that
>>TSO is not used if a VxLAN tunnel has a checksum in an outer header.
>>Since that is not exposed currently that is completely offloaded in
>>software via GSO.
>
> The mlx5 hardware requires the outer UDP checksum is not set when offloading 
> encapsulated packets.

The Intel documentation said the same thing.  That was due to the fact
that the hardware didn't computer the outer UDP header checksum.  I
suspect the Mellanox hardware has the same issue.  Also I have tested
on a ConnectX-4 board with the latest firmware and what I am seeing is
that with my patches applied the outer checksum is being correctly
applied for segmentation offloads.

My thought is that that the hardware appears to ignore the UDP
checksum so if it is non-zero you cannot guarantee the checksum would
be correct on the last frame if it is a different size than the rest
of the segments.  In the case of these patches that issue has been
resolved as I have precomputed the UDP checksum for the outer UDP
header and all of the segments will be the same length so there should
be no variation in the UDP checksum of the outer header.  Unless you
can tell my exactly the reason why we cannot provide the outer UDP
checksum I would assume that the reason is due to the fact that the
hardware doesn't compute it so you cannot handle a fragment on the end
which is resolved already via GSO_PARTIAL.

- Alex

Re: [PATCH v2 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek

On Thu, Apr 28, 2016 at 5:59 PM, Eric Dumazet  wrote:
> On Thu, 2016-04-28 at 17:07 -0400, Craig Gallek wrote:
>> From: Craig Gallek 
>>
>> I forgot to include a check for listener port equality when deciding
>> if two sockets should belong to the same reuseport group.  This was
>> not caught previously because it's only necessary when two listening
>> sockets for the same user happen to hash to the same listener bucket.
>> This change also includes a check for network namespace equality.
>> The same error does not exist in the UDP path.
>>
>> Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
>> Signed-off-by: Craig Gallek 
>> ---
>> v2 Changes
>>   - Suggestions from Eric Dumazet to include network namespace equality
>> check and to avoid a dreference by simply checking inet_bind_bucket
>> pointer equality.
>> ---
>>  net/ipv4/inet_hashtables.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index bc68eced0105..5c5658268d5e 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -470,15 +470,19 @@ static int inet_reuseport_add_sock(struct sock *sk,
>>const struct sock *sk2,
>>bool match_wildcard))
>>  {
>> + struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
>> + struct net *net = sock_net(sk);
>>   struct sock *sk2;
>>   struct hlist_nulls_node *node;
>>   kuid_t uid = sock_i_uid(sk);
>>
>>   sk_nulls_for_each_rcu(sk2, node, >head) {
>> - if (sk2 != sk &&
>> + if (net_eq(sock_net(sk2), net) &&
>> + sk2 != sk &&
>>   sk2->sk_family == sk->sk_family &&
>>   ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
>>   sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
>> + inet_csk(sk2)->icsk_bind_hash == tb &&
>>   sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
>>   saddr_same(sk, sk2, false))
>>   return reuseport_add_sock(sk, sk2);
>
> Note that I suggested to only use "inet_csk(sk2)->icsk_bind_hash == tb"
>
> If test is true, it means that sockets share same name space and same
> port ;)
>
> Therefore the added net_eq(sock_net(sk2), net) test is redundant.
Thanks for the quick review Eric, sorry I miss read :\  I'll send a v3...

> No strong opinion, as this patch works, and this is not fast path
> anyway.
>
> Acked-by: Eric Dumazet 
>
>

Re: [PATCH net-next] tcp: give prequeue mode some care

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 17:15 -0400, David Miller wrote:

> There was a conflict due to the stats macro renaming, but that was trivial
> to resolve so I did it.
> 
> Applied, thanks Eric.
Ah great, I was preparing a V2, you were fast David.

Thanks

Re: [PATCH v2 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 17:07 -0400, Craig Gallek wrote:
> From: Craig Gallek 
> 
> I forgot to include a check for listener port equality when deciding
> if two sockets should belong to the same reuseport group.  This was
> not caught previously because it's only necessary when two listening
> sockets for the same user happen to hash to the same listener bucket.
> This change also includes a check for network namespace equality.
> The same error does not exist in the UDP path.
> 
> Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
> Signed-off-by: Craig Gallek 
> ---
> v2 Changes
>   - Suggestions from Eric Dumazet to include network namespace equality
> check and to avoid a dreference by simply checking inet_bind_bucket
> pointer equality.
> ---
>  net/ipv4/inet_hashtables.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index bc68eced0105..5c5658268d5e 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -470,15 +470,19 @@ static int inet_reuseport_add_sock(struct sock *sk,
>const struct sock *sk2,
>bool match_wildcard))
>  {
> + struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
> + struct net *net = sock_net(sk);
>   struct sock *sk2;
>   struct hlist_nulls_node *node;
>   kuid_t uid = sock_i_uid(sk);
>  
>   sk_nulls_for_each_rcu(sk2, node, >head) {
> - if (sk2 != sk &&
> + if (net_eq(sock_net(sk2), net) &&
> + sk2 != sk &&
>   sk2->sk_family == sk->sk_family &&
>   ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
>   sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
> + inet_csk(sk2)->icsk_bind_hash == tb &&
>   sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
>   saddr_same(sk, sk2, false))
>   return reuseport_add_sock(sk, sk2);

Note that I suggested to only use "inet_csk(sk2)->icsk_bind_hash == tb"

If test is true, it means that sockets share same name space and same
port ;)

Therefore the added net_eq(sock_net(sk2), net) test is redundant.

No strong opinion, as this patch works, and this is not fast path
anyway.

Acked-by: Eric Dumazet

[PATCH net-next] of: of_mdio: Check if MDIO bus controller is available

2016-04-28 Thread Florian Fainelli

Add a check whether the 'struct device_node' pointer passed to
of_mdiobus_register() is an available (aka enabled) node in the Device
Tree.

Rationale for doing this are cases where an Ethernet MAC provides a MDIO
bus controller and node, and an additional Ethernet MAC might be
connecting its PHY/switches to that first MDIO bus controller, while
still embedding one internally which is therefore marked as "disabled".

Instead of sprinkling checks like these in callers of
of_mdiobus_register(), do this in a central location.

Signed-off-by: Florian Fainelli 
---
 drivers/of/of_mdio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index b622b33dbf93..2f497790be1b 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -209,6 +209,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
bool scanphys = false;
int addr, rc;
 
+   /* Do not continue if the node is disabled */
+   if (!of_device_is_available(np))
+   return -EINVAL;
+
/* Mask out all PHYs from auto probing.  Instead the PHYs listed in
 * the device tree are populated after the bus has been registered */
mdio->phy_mask = ~0;
-- 
2.1.0

Re: [PATCH] net: davinci_mdio: Set of_node in the mdio bus

2016-04-28 Thread David Miller

From: "J.D. Schroeder" 
Date: Thu, 28 Apr 2016 16:39:36 -0500

> On 04/28/2016 02:44 PM, David Miller wrote:
>>> --- a/drivers/net/ethernet/ti/davinci_mdio.c
>>> +++ b/drivers/net/ethernet/ti/davinci_mdio.c
>>> @@ -343,6 +343,7 @@ static int davinci_mdio_probe(struct platform_device 
>>> *pdev)
>>> if (davinci_mdio_probe_dt(>pdata, pdev))
>>> data->pdata = default_pdata;
>>> snprintf(data->bus->id, MII_BUS_ID_SIZE, "%s", pdev->name);
>>> +   data->bus->dev.of_node = dev->of_node;
>>> } else {
>>> data->pdata = pdata ? (*pdata) : default_pdata;
>>> snprintf(data->bus->id, MII_BUS_ID_SIZE, "%s-%x",
>> 
>> You can't do this.
>> 
>> First of all, of_node objects are reference counted.  So even if this was a
>> legal thing to do you would have to drop the reference to the existing 
>> of_node
>> pointer and gain a reference to dev->of_node.
>> 
>> But even more importantly, it is the job of the bus driver to set that
>> bus->dev.of_node correctly, you should never override it in a driver like
>> this.
> 
> David, thanks for your review. I understand your point about the reference 
> count.
> 
> One thing to note is that it is always null for the davinci mdio bus when
> going through this path. I'm not trying to override it. I'm trying to make
> sure it has a way to find the davinci mdio bus. Do you see the problem I'm
> trying to solve?
> 
> Is there another way to be able to make the of_mdio_find_bus() call be able to
> find the davinci mdio bus?

You should reference the device which actually has an OF node attached with it,
rather than pretending that the MDIO bus does.

Don't fake this stuff, reference the proper device to get the OF node.

Thanks.

Re: [PATCH net 0/3] bpf: fix several bugs

2016-04-28 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 27 Apr 2016 18:56:19 -0700

> First two patches address bugs found by Jann Horn.
> Last patch is a minor samples fix spotted during the testing.

Series applied and queued up for -stable, thanks.

Re: [PATCH net v3 0/5] drivers: net: cpsw: phy-handle fixes

2016-04-28 Thread David Miller

From: "David Rivshin (Allworx)" 
Date: Wed, 27 Apr 2016 21:10:03 -0400

> This series fixes a number of related issues around using phy-handle
> properties in cpsw emac nodes.

Series applied, thanks.

Re: [PATCH next v2] ipvlan: Fix failure path in dev registration during link creation

2016-04-28 Thread David Miller

From: Mahesh Bandewar 
Date: Wed, 27 Apr 2016 14:59:27 -0700

> From: Mahesh Bandewar 
> 
> When newlink creation fails at device-registration, the port->count
> is decremented twice. Francesco Ruggeri (frugg...@arista.com) found
> this issue in Macvlan and the same exists in IPvlan driver too.
> 
> While fixing this issue I noticed another issue of missing unregister
> in case of failure, so adding it to the fix which is similar to the
> macvlan fix by Francesco in commit 308379607548 ("macvlan: fix failure
> during registration v3")
> 
> Reported-by: Francesco Ruggeri 
> Signed-off-by: Mahesh Bandewar 

Applied, thanks.

Re: [PATCH v2] net: macb: do not scan PHYs manually

2016-04-28 Thread Andrew Lunn

On Thu, Apr 28, 2016 at 04:03:57PM -0500, Josh Cartwright wrote:
> On Thu, Apr 28, 2016 at 08:59:32PM +0200, Andrew Lunn wrote:
> > On Thu, Apr 28, 2016 at 01:55:27PM -0500, Nathan Sullivan wrote:
> > > On Thu, Apr 28, 2016 at 08:43:03PM +0200, Andrew Lunn wrote:
> > > > > I agree that is a valid fix for AT91, however it won't solve our 
> > > > > problem, since
> > > > > we have no children on the second ethernet MAC in our devices' device 
> > > > > trees. I'm
> > > > > starting to feel like our second MAC shouldn't even really register 
> > > > > the MDIO bus
> > > > > since it isn't being used - maybe adding a DT property to not have a 
> > > > > bus is a
> > > > > better option?
> > > > 
> > > > status = "disabled"
> > > > 
> > > > would be the unusual way.
> > > > 
> > > >   Andrew
> > > 
> > > Oh, sorry, I meant we use both MACs on Zynq, however the PHYs are on the 
> > > MDIO
> > > bus of the first MAC.  So, the second MAC is used for ethernet but not 
> > > for MDIO,
> > > and so it does not have any PHYs under its DT node.  It would be nice if 
> > > there
> > > were a way to tell macb not to bother with MDIO for the second MAC, since 
> > > that's
> > > handled by the first MAC.
> > 
> > Yes, exactly, add support for status = "disabled" in the mdio node.
> 
> Unfortunately, the 'macb' doesn't have a "mdio node", or alternatively:
> the node representing the mdio bus is the same node which represents the
> macb instance itself.  Setting 'status = "disabled"' on this node will
> just prevent the probing of the macb instance.

:-(

It is very common to have an mdio node within the MAC node, for example 
imx6sx-sdb.dtsi

 {
pinctrl-names = "default";
pinctrl-0 = <_enet1>;
phy-supply = <_enet_3v3>;
phy-mode = "rgmii";
phy-handle = <>;
status = "okay";

mdio {
#address-cells = <1>;
#size-cells = <0>;

ethphy1: ethernet-phy@1 {
reg = <1>;
};

ethphy2: ethernet-phy@2 {
reg = <2>;
};
};
};

 {
pinctrl-names = "default";
pinctrl-0 = <_enet2>;
phy-mode = "rgmii";
phy-handle = <>;
status = "okay";
};

This even has the two phys on one bus, as you described...

 Andrew

Re: [PATCH net-next #2 1/1] pch_gbe: replace private tx ring lock with common netif_tx_lock

2016-04-28 Thread David Miller

From: Francois Romieu 
Date: Wed, 27 Apr 2016 23:29:44 +0200

> pch_gbe_tx_ring.tx_lock is only used in the hard_xmit handler and
> in the transmit completion reaper called from NAPI context.
> 
> Compile-tested only. Potential victims Cced.
> 
> Someone more knowledgeable may check if pch_gbe_tx_queue could
> have some use for a mmiowb.
> 
> Signed-off-by: Francois Romieu 
> Cc: Darren Hart 
> Cc: Andy Cress 
> Cc: br...@fossetcon.org
> 
> ---
>  Includes Nikolay's fix.

Applied, thank you.

Re: [PATCH v3 2/2] net: Add Qualcomm IPC router

2016-04-28 Thread David Miller

From: Bjorn Andersson 
Date: Wed, 27 Apr 2016 12:13:03 -0700

> From: Courtney Cavin 
> 
> Add an implementation of Qualcomm's IPC router protocol, used to
> communicate with service providing remote processors.
> 
> Signed-off-by: Courtney Cavin 
> Signed-off-by: Bjorn Andersson 
> [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
> Signed-off-by: Bjorn Andersson 
> ---
> 
> Changes since v2:
> - Altered Kconfig dependency for QRTR_SMD to be compile testable
> 
> Changes since v1:
> - Made node 0 (normally the Qualcomm modem) a valid node
> - Implemented RTM_NEWADDR for specifying the local node id

Please adjust this so that CONFIG_QRTR can be modular.

Re: [PATCH net-next 0/6] net: make TCP preemptible

2016-04-28 Thread Marcelo Ricardo Leitner

On Wed, Apr 27, 2016 at 10:25:46PM -0700, Eric Dumazet wrote:
> Most of TCP stack assumed it was running from BH handler.
> 
> This is great for most things, as TCP behavior is very sensitive
> to scheduling artifacts.
> 
> However, the prequeue and backlog processing are problematic,
> as they need to be flushed with BH being blocked.
> 
> To cope with modern needs, TCP sockets have big sk_rcvbuf values,
> in the order of 16 MB.
> This means that backlog can hold thousands of packets, and things
> like TCP coalescing or collapsing on this amount of packets can
> lead to insane latency spikes, since BH are blocked for too long.

And due to that, it may potentially lead to packet drops on NIC ring
buffers.  Great, thanks Eric.

> It is time to make UDP/TCP stacks preemptible.
> 
> Note that fast path still runs from BH handler.
> 
> Eric Dumazet (6):
>   tcp: do not assume TCP code is non preemptible
>   tcp: do not block bh during prequeue processing
>   dccp: do not assume DCCP code is non preemptible
>   udp: prepare for non BH masking at backlog processing
>   sctp: prepare for socket backlog behavior change
>   net: do not block BH while processing socket backlog
> 
>  net/core/sock.c  |  22 +++--
>  net/dccp/input.c |   2 +-
>  net/dccp/ipv4.c  |   4 +-
>  net/dccp/ipv6.c  |   4 +-
>  net/dccp/options.c   |   2 +-
>  net/ipv4/tcp.c   |   6 +--
>  net/ipv4/tcp_cdg.c   |  20 
>  net/ipv4/tcp_cubic.c |  20 
>  net/ipv4/tcp_fastopen.c  |  12 ++---
>  net/ipv4/tcp_input.c | 126 
> +++
>  net/ipv4/tcp_ipv4.c  |  14 --
>  net/ipv4/tcp_minisocks.c |   2 +-
>  net/ipv4/tcp_output.c|   7 ++-
>  net/ipv4/tcp_recovery.c  |   4 +-
>  net/ipv4/tcp_timer.c |  10 ++--
>  net/ipv4/udp.c   |   4 +-
>  net/ipv6/tcp_ipv6.c  |  12 ++---
>  net/ipv6/udp.c   |   4 +-
>  net/sctp/inqueue.c   |   2 +
>  19 files changed, 124 insertions(+), 153 deletions(-)
> 
> -- 
> 2.8.0.rc3.226.g39d4020
>

Re: [PATCH net-next] tcp: give prequeue mode some care

2016-04-28 Thread David Miller

From: Eric Dumazet 
Date: Wed, 27 Apr 2016 10:12:25 -0700

> From: Eric Dumazet 
> 
> TCP prequeue goal is to defer processing of incoming packets
> to user space thread currently blocked in a recvmsg() system call.
> 
> Intent is to spend less time processing these packets on behalf
> of softirq handler, as softirq handler is unfair to normal process
> scheduler decisions, as it might interrupt threads that do not
> even use networking.
> 
> Current prequeue implementation has following issues :
> 
> 1) It only checks size of the prequeue against sk_rcvbuf
> 
>It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
>But we now have ~8MB values to cope with modern networking needs.
>We have to add sk_rmem_alloc in the equation, since out of order
>packets can definitely use up to sk_rcvbuf memory themselves.
> 
> 2) Even with a fixed memory truesize check, prequeue can be filled
>by thousands of packets. When prequeue needs to be flushed, either
>from sofirq context (in tcp_prequeue() or timer code), or process
>context (in tcp_prequeue_process()), this adds a latency spike
>which is often not desirable.
>I added a fixed limit of 32 packets, as this translated to a max
>flush time of 60 us on my test hosts.
> 
>Also note that all packets in prequeue are not accounted for tcp_mem,
>since they are not charged against sk_forward_alloc at this point.
>This is probably not a big deal.
> 
> Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
> which is misnamed, as packets are not dropped at all, but rather pushed
> to the stack (where they can be either consumed or dropped)
> 
> Signed-off-by: Eric Dumazet 

There was a conflict due to the stats macro renaming, but that was trivial
to resolve so I did it.

Applied, thanks Eric.

Re: [PATCH net-next] net: dsa: Provide CPU port statistics to master netdev

2016-04-28 Thread David Miller

From: Florian Fainelli 
Date: Wed, 27 Apr 2016 11:45:14 -0700

> This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
> include switch-side statistics, which is useful for debugging purposes,
> when the switch is not properly connected to the Ethernet MAC (duplex
> mismatch, (RG)MII electrical issues etc.).
> 
> We accomplish this by retaining the original copy of the master netdev's
> ethtool_ops, and just overload the 3 operations we care about:
> get_sset_count, get_strings and get_ethtool_stats so as to intercept
> these calls and call into the original master_netdev ethtool_ops, plus
> our own.
> 
> We take this approach as opposed to providing a set of DSA helper
> functions that would retrive the CPU port's statistics, because the
> entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
> used as CPU conduit interfaces, therefore, statistics overlay in such
> drivers would simply not scale.
> 
> The new ethtool -S  output would therefore look like this now:
>  statistics
> p<2 digits cpu port number>_
> 
> Signed-off-by: Florian Fainelli 

Applied, thanks Florian.

Re: [PATCH] MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver

2016-04-28 Thread David Miller

From: Andreas Larsson 
Date: Wed, 27 Apr 2016 16:46:10 +0200

> Signed-off-by: Andreas Larsson 

Applied.

Re: [PATCH v2 1/2] net: nps_enet: Sync access to packet sent flag

2016-04-28 Thread David Miller

From: Elad Kanfi 
Date: Wed, 27 Apr 2016 16:18:29 +0300

> From: Elad Kanfi 
> 
> Below is a description of a possible problematic
> sequence. CPU-A is sending a frame and CPU-B handles
> the interrupt that indicates the frame was sent. CPU-B
> reads an invalid value of tx_packet_sent.
> 
>   CPU-A   CPU-B
>   -   -
>   nps_enet_send_frame
>   .
>   .
>   tx_packet_sent = true
>   order HW to start tx
>   .
>   .
>   HW complete tx
>   --> get tx complete interrupt
>   .
>   .
>   if(tx_packet_sent == true)
> 
>   end memory transaction
>   (tx_packet_sent actually
>written)
> 
> Problem solution:
> 
> Add a memory barrier after setting tx_packet_sent,
> in order to make sure that it is written before
> the packet is sent.
> 
> Signed-off-by: Elad Kanfi 
> 
> Acked-by: Noam Camus 

Please address the feedback about memory barrier pairing.

Also, for both patches, do not put empty lines between the various
tags at the end of the commit message.  They should all be together
in one continuous group.

Re: [PATCH net] gre: reject GUE and FOU in collect metadata mode

2016-04-28 Thread David Miller

From: Jiri Benc 
Date: Wed, 27 Apr 2016 14:08:01 +0200

> The collect metadata mode does not support GUE nor FOU. This might be
> implemented later; until then, we should reject such config.
> 
> I think this is okay to be changed. It's unlikely anyone has such
> configuration (as it doesn't work anyway) and we may need a way to
> distinguish whether it's supported or not by the kernel later.
> 
> For backwards compatibility with iproute2, it's not possible to just check
> the attribute presence (iproute2 always includes the attribute), the actual
> value has to be checked, too.
> 
> Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
> Signed-off-by: Jiri Benc 
> ---
> Discovered this only after I already sent v3 of the previous gre set.
> Submitting this patch on its own, it's an indepent fix anyway (though fixing
> the same commit).

Applied, thank you.

Re: [PATCH v3 0/2] pegasus: correct buffer & packet sizes

2016-04-28 Thread David Miller

From: Petko Manolov 
Date: Wed, 27 Apr 2016 14:24:48 +0300

> As noticed by Lincoln Ramsay  some old (usb 1.1) Pegasus
> based devices may actually return more bytes than the specified in the 
> datasheet
> amount.  That would not be a problem if the allocated space for the SKB was
> equal to the parameter passed to usb_fill_bulk_urb().  Some poor bugger (i
> really hope it was not me, but 'git blame' is useless in this case, so anyway)
> decided to add '+ 8' to the buffer length parameter.  Sometimes the usb 
> transfer
> overflows and corrupts the socket structure, leading to kernel panic.
> 
> The above doesn't seem to happen for newer (Pegasus2 based) devices which did
> help this bug to hide for so long.
> 
> The new default is to not include the CRC at the end of each received 
> package.  
> So far CRC has been ignored which makes no sense to do it in a first place.
> 
> The patch is against v4.6-rc5 and was tested on ADM8515 device by transferring
> multiple gigabytes of data over a couple of days without any complaints from 
> the
> kernel.  Please apply it to whatever net tree you deem fit.
> 
> Changes since v1:
> 
>  - split the patch in two parts;
>  - corrected the subject lines;
> 
> Changes since v2:
> 
>  - do not append CRC by default (based on a discussion with Johannes Berg);

Series applied, thanks.

[PATCH v2 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek

From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
This change also includes a check for network namespace equality.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
v2 Changes
  - Suggestions from Eric Dumazet to include network namespace equality
check and to avoid a dreference by simply checking inet_bind_bucket
pointer equality.
---
 net/ipv4/inet_hashtables.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..5c5658268d5e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,15 +470,19 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
+   struct net *net = sock_net(sk);
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
 
sk_nulls_for_each_rcu(sk2, node, >head) {
-   if (sk2 != sk &&
+   if (net_eq(sock_net(sk2), net) &&
+   sk2 != sk &&
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash == tb &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v2] net: macb: do not scan PHYs manually

2016-04-28 Thread Josh Cartwright

On Thu, Apr 28, 2016 at 08:59:32PM +0200, Andrew Lunn wrote:
> On Thu, Apr 28, 2016 at 01:55:27PM -0500, Nathan Sullivan wrote:
> > On Thu, Apr 28, 2016 at 08:43:03PM +0200, Andrew Lunn wrote:
> > > > I agree that is a valid fix for AT91, however it won't solve our 
> > > > problem, since
> > > > we have no children on the second ethernet MAC in our devices' device 
> > > > trees. I'm
> > > > starting to feel like our second MAC shouldn't even really register the 
> > > > MDIO bus
> > > > since it isn't being used - maybe adding a DT property to not have a 
> > > > bus is a
> > > > better option?
> > > 
> > > status = "disabled"
> > > 
> > > would be the unusual way.
> > > 
> > >   Andrew
> > 
> > Oh, sorry, I meant we use both MACs on Zynq, however the PHYs are on the 
> > MDIO
> > bus of the first MAC.  So, the second MAC is used for ethernet but not for 
> > MDIO,
> > and so it does not have any PHYs under its DT node.  It would be nice if 
> > there
> > were a way to tell macb not to bother with MDIO for the second MAC, since 
> > that's
> > handled by the first MAC.
> 
> Yes, exactly, add support for status = "disabled" in the mdio node.

Unfortunately, the 'macb' doesn't have a "mdio node", or alternatively:
the node representing the mdio bus is the same node which represents the
macb instance itself.  Setting 'status = "disabled"' on this node will
just prevent the probing of the macb instance.

  Josh


signature.asc
Description: PGP signature

Re: [PATCH] fq: split out backlog update logic

2016-04-28 Thread David Miller

From: Michal Kazior 
Date: Wed, 27 Apr 2016 12:59:13 +0200

> mac80211 (which will be the first user of the
> fq.h) recently started to support software A-MSDU
> aggregation. It glues skbuffs together into a
> single one so the backlog accounting needs to be
> more fine-grained.
> 
> To avoid backlog sorting logic duplication split
> it up for re-use.
> 
> Signed-off-by: Michal Kazior 

Applied, thanks.

Re: [PATCH net v3 0/2] gre: fix lwtunnel support

2016-04-28 Thread David Miller

From: Jiri Benc 
Date: Wed, 27 Apr 2016 11:29:05 +0200

> This patchset fixes a few bugs in ipgre metadata mode implementation.
> 
> As an example, in this setup:
> 
> ip a a 192.168.1.1/24 dev eth0
> ip l a gre1 type gre external
> ip l s gre1 up
> ip a a 192.168.99.1/24 dev gre1
> ip r a 192.168.99.2/32 encap ip dst 192.168.1.2 ttl 10 dev gre1
> ping 192.168.99.2
> 
> the traffic does not go through before this patchset and does as expected
> with it applied.
> 
> v3: Back to v1 in order not to break existing users. Dropped patch 3, will
> be fixed in iproute2 instead.
> v2: Rejecting invalid configuration, added patch 3, dropped patch for
> ETH_P_TEB (will target net-next).

Series applied, thanks Jiri.

Re: [PATCH v6 2/6] Documentation: Bindings: Add STM32 DWMAC glue

2016-04-28 Thread Rob Herring

On Mon, Apr 25, 2016 at 01:53:58PM +0200, Alexandre TORGUE wrote:
> Signed-off-by: Alexandre TORGUE 

Acked-by: Rob Herring

Re: [PATCHv2] netem: Segment GSO packets on enqueue.

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 16:09 -0400, Neil Horman wrote:
> This was recently reported to me, and reproduced on the latest net kernel, 
> when
> attempting to run netperf from a host that had a netem qdisc attached to the
> egress interface:

>  
> - return NET_XMIT_SUCCESS;
> +finish_segs:
> + while (segs) {
> + skb2 = segs->next;
> + segs->next = NULL;
> + qdisc_skb_cb(segs)->pkt_len = segs->len;
> + rc = qdisc_enqueue(segs, sch);
> + if (rc != NET_XMIT_SUCCESS) {
> + if (net_xmit_drop_count(rc))
> + qdisc_qstats_drop(sch);
> + }
> + segs = skb2;
> + }
> + return rc;
>  }

It seems you missed the qdisc_tree_reduce_backlog() call ?

Re: pull-request: mac80211 2016-04-27

2016-04-28 Thread David Miller

From: Johannes Berg 
Date: Wed, 27 Apr 2016 10:57:26 +0200

> While writing some new code yesterday, I found and fixed a per-CPU memory
> leak, this pull request has just a single patch addressing that.
> 
> Let me know if there's any problem.

Pulled, thanks a lot.

Re: [patch] tipc: remove an unnecessary NULL check

2016-04-28 Thread David Miller

From: Dan Carpenter 
Date: Wed, 27 Apr 2016 11:05:28 +0300

> This is never called with a NULL "buf" and anyway, we dereference 's' on
> the lines before so it would Oops before we reach the check.
> 
> Signed-off-by: Dan Carpenter 

Applied to net-next, thanks.

Re: [PATCH net-next 0/5] stmmac: dwmac-socfpga refactor+cleanup

2016-04-28 Thread David Miller

From: Joachim Eastwood 
Date: Tue, 26 Apr 2016 23:24:54 +0200

> Couple of heads-up here:
>  1. This patch set depend on Marek's "Remove re-registration of
> reset controller" patch [1] which is not in net-next yet.
> Without that patch this set will not apply!
> 
>  2. The first patch changes the prototype of a couple of
> functions used in Alexandre's "add Ethernet glue logic for
> stm32 chip" patch [2] and will cause build failures for
> dwmac-stm32.c if not fixed up!
> If Alexandre's patch set is applied first I will gladly
> rebase my patch set to account for his driver as well.
 ...
> Dave: Please let me know if you have any preferred way of
>   handling this.

You could cherry pick the patch in #1 and add it to this patch set,
in fact please respin this series that way.

For #2 it'll get sorted based upon who gets applied first.

Re: [PATCH net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Eric Dumazet

On Thu, 2016-04-28 at 16:11 -0400, Craig Gallek wrote:
> From: Craig Gallek 
> 
> I forgot to include a check for listener port equality when deciding
> if two sockets should belong to the same reuseport group.  This was
> not caught previously because it's only necessary when two listening
> sockets for the same user happen to hash to the same listener bucket.
> The same error does not exist in the UDP path.
> 
> Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
> Signed-off-by: Craig Gallek 
> ---
>  net/ipv4/inet_hashtables.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index bc68eced0105..326d26c7a9e6 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -470,6 +470,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
>const struct sock *sk2,
>bool match_wildcard))
>  {
> + struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
>   struct sock *sk2;
>   struct hlist_nulls_node *node;
>   kuid_t uid = sock_i_uid(sk);
> @@ -479,6 +480,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
>   sk2->sk_family == sk->sk_family &&
>   ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
>   sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
> + inet_csk(sk2)->icsk_bind_hash->port == tb->port &&
>   sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
>   saddr_same(sk, sk2, false))
>   return reuseport_add_sock(sk, sk2);


Not sure it is network namespace ready ?

I would simply compare the tb pointer itself, and not deref it to get
the port.
  ...  inet_csk(sk2)->icsk_bind_hash == tb

Re: [PATCH] net: phy: at803x: only the AT8030 needs a hardware reset on link change

2016-04-28 Thread David Miller

From: Timur Tabi 
Date: Tue, 26 Apr 2016 12:44:18 -0500

> Commit 13a56b44 ("at803x: Add support for hardware reset") added a
> work-around for a hardware bug on the AT8030.  However, the work-around
> was being called for all 803x PHYs, even those that don't need it.
> Function at803x_link_change_notify() checks to make sure that it only
> resets the PHY on the 8030, but it makes more sense to not call that
> function at all if it isn't needed.
> 
> Signed-off-by: Timur Tabi 

Applied, thanks.

Re: [PATCH v2] net/mlx5e: avoid stack overflow in mlx5e_open_channels

2016-04-28 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 26 Apr 2016 17:52:33 +0200

> struct mlx5e_channel_param is a large structure that is allocated
> on the stack of mlx5e_open_channels, and with a recent change
> it has grown beyond the warning size for the maximum stack
> that a single function should use:
> 
> mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels':
> mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is 
> larger than 1024 bytes [-Werror=frame-larger-than=]
> 
> The function is already using dynamic allocation and is not in
> a fast path, so the easiest workaround is to use another kzalloc
> for allocating the channel parameters.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: d3c9bc2743dc ("net/mlx5e: Added ICO SQs")
> ---
> v2: move allocation back into caller, as suggested by Saeed Mahameed

Applied, thanks Arnd.

Re: [PATCH v3 0/2] sctp: delay calls to sk_data_ready() as much as possible

2016-04-28 Thread marcelo . leitner

On Thu, Apr 14, 2016 at 05:19:00PM -0300, marcelo.leit...@gmail.com wrote:
> On Thu, Apr 14, 2016 at 04:03:51PM -0400, Neil Horman wrote:
> > On Thu, Apr 14, 2016 at 02:59:16PM -0400, David Miller wrote:
> > > From: Marcelo Ricardo Leitner 
> > > Date: Thu, 14 Apr 2016 14:00:49 -0300
> > > 
> > > > Em 14-04-2016 10:03, Neil Horman escreveu:
> > > >> On Wed, Apr 13, 2016 at 11:05:32PM -0400, David Miller wrote:
> > > >>> From: Marcelo Ricardo Leitner 
> > > >>> Date: Fri,  8 Apr 2016 16:41:26 -0300
> > > >>>
> > >  1st patch is a preparation for the 2nd. The idea is to not call
> > >  ->sk_data_ready() for every data chunk processed while processing
> > >  packets but only once before releasing the socket.
> > > 
> > >  v2: patchset re-checked, small changelog fixes
> > >  v3: on patch 2, make use of local vars to make it more readable
> > > >>>
> > > >>> Applied to net-next, but isn't this reduced overhead coming at the
> > > >>> expense of latency?  What if that lower latency is important to the
> > > >>> application and/or consumer?
> > > >> Thats a fair point, but I'd make the counter argument that, as it
> > > >> currently
> > > >> stands, any latency introduced (or removed), is an artifact of our
> > > >> implementation rather than a designed feature of it.  That is to say,
> > > >> we make no
> > > >> guarantees at the application level regarding how long it takes to
> > > >> signal data
> > > >> readines from the time we get data off the wire, so I would rather see
> > > >> our
> > > >> throughput raised if we can, as thats been sctp's more pressing
> > > >> achilles heel.
> > > >>
> > > >>
> > > >> Thats not to say I'd like to enable lower latency, but I'd rather have
> > > >> this now,
> > > >> and start pondering how to design that in.  Perhaps we can convert the
> > > >> pending
> > > >> flag to a counter to count the number of events we enqueue, and call
> > > >> sk_data_ready every  time we reach a sysctl defined threshold.
> > > > 
> > > > That and also that there is no chance of the application reading the
> > > > first chunks before all current ToDo's are performed by either the bh
> > > > or backlog handlers for that packet. Socket lock won't be cycled in
> > > > between chunks so the application is going to wait all the processing
> > > > one way or another.
> > > 
> > > But it takes time to signal the wakeup to the remote cpu the process
> > > was running on, schedule out the current process on that cpu (if it
> > > has in fact lost it's timeslice), and then finally look at the socket
> > > queue.
> > > 
> > > Of course this is all assuming the process was sleeping in the first
> > > place, either in recv or more likely poll.
> > > 
> > > I really think signalling early helps performance.
> > > 
> > 
> > Early, yes, often, not so much :).  Perhaps what would be adventageous 
> > would be
> > to signal at the start of a set of enqueues, rather than at the end.  That 
> > would
> > be equivalent in terms of not signaling more than needed, but would 
> > eliminate
> > the signaling on every chunk.   Perhaps what you could do Marcelo would be 
> > to
> > change the sense of the signal_ready flag to be a has_signaled flag.  e.g. 
> > call
> > sk_data_ready in ulp_event_tail like we used to, but only if the 
> > has_signaled
> > flag isn't set, then set the flag, and clear it at the end of the command
> > interpreter.
> > 
> > That would be a best of both worlds solution, as long as theres no chance of
> > race with user space reading from the socket before we were done enqueuing 
> > (i.e.
> > you have to guarantee that the socket lock stays held, which I think we do).
> 
> That is my feeling too. Will work on it. Thanks :-)

I did the change and tested it on real machines set all for performance.
I couldn't spot any difference between both implementations.

Set RSS and queue irq affinity for a cpu and taskset netperf and another
app I wrote to run on another cpu. It hits socket backlog quite often
but still do direct processing every now and then.

With current state, netperf, scenario above. Results of perf sched
record for the CPUs in use, reported by perf sched latency:

  Task  |   Runtime ms  | Switches | Average delay ms |
  Maximum delay ms | Maximum delay at   |
  netserver:3205|   .490 ms |   10 | avg:0.003 ms |
  max:0.004 ms | max at:  69087.753356 s

another run
  netserver:3483|   .412 ms |   15 | avg:0.003 ms |
  max:0.004 ms | max at:  69194.749814 s

With the patch below, same test:
  netserver:2643|  1.110 ms |   14 | avg:0.003 ms |
  max:0.004 ms | max at:172.006315 s

another run:
  netserver:2698|  1.049 ms |   15 | avg:0.003 ms |
  max:0.004 ms | max at:368.061672 s

I'll be happy to do more tests if you have any suggestions on how/what
to test.

---8<---

Re: pull request [net]: batman-adv-0160426

2016-04-28 Thread David Miller

From: Antonio Quartulli 
Date: Tue, 26 Apr 2016 11:27:14 +0800

> In this patchset you can find the following fixes:

Pulled, even though there were some typos in the commit messages.

> Patch 2 and 3 have no "Fixes:" tag because the offending commits date
> back to when batman-adv was not yet officially in the net tree.

This is not correct.  Instead, in the future, you should provide a
Fixes: tag that indicates the commit that merged batman-adv into the
upstream tree initially.

Thanks.

Re: [PATCH net-next V2] tuntap: calculate rps hash only when needed

2016-04-28 Thread David Miller

From: Jason Wang 
Date: Mon, 25 Apr 2016 23:13:42 -0400

> There's no need to calculate rps hash if it was not enabled. So this
> patch export rps_needed and check it before trying to get rps
> hash. Tests (using pktgen to inject packets to guest) shows this can
> improve pps about 13% (when rps is disabled).
> 
> Before:
> ~115 pps
> After:
> ~130 pps
> 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 

Applied, thanks Jason.

Re: [PATCH] ps3_gelic: fix memcpy parameter

2016-04-28 Thread David Miller

From: Christophe JAILLET 
Date: Tue, 26 Apr 2016 04:33:43 +0200

> The size allocated for target->hwinfo and the number of bytes copied in it
> should be consistent.
> 
> Signed-off-by: Christophe JAILLET 

Applied, thanks.

Re: [PATCH 1/2 net] lan78xx: fix statistics counter error

2016-04-28 Thread David Miller

From: 
Date: Mon, 25 Apr 2016 22:22:32 +

> From: Woojung Huh 
> 
> Fix rx_bytes, tx_bytes and tx_frames error in netdev.stats.
> - rx_bytes counted bytes excluding size of struct ethhdr.
> - tx_packets didn't count multiple packets in a single urb
> - tx_bytes included 8 bytes of extra commands.
> 
> Signed-off-by: Woojung Huh 

Applied.

Re: [PATCH 2/2 net] lan78xx: workaround of forced 100 Full/Half duplex mode error

2016-04-28 Thread David Miller

From: 
Date: Mon, 25 Apr 2016 22:22:36 +

> From: Woojung Huh 
> 
> At forced 100 Full & Half duplex mode, chip may fail to set mode correctly
> when cable is switched between long(~50+m) and short one.
> As workaround, set to 10 before setting to 100 at forced 100 F/H mode.
> 
> Signed-off-by: Woojung Huh 

Applied.

Re: [PATCH] net: dsa: mv88e6xxx: fix uninitialized error return

2016-04-28 Thread David Miller

From: Colin King 
Date: Mon, 25 Apr 2016 23:11:22 +0100

> From: Colin Ian King 
> 
> The error return err is not initialized and there is a possibility
> that err is not assigned causing mv88e6xxx_port_bridge_join to
> return a garbage error return status. Fix this by initializing err
> to 0.
> 
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg

2016-04-28 Thread David Miller

From: Martin KaFai Lau 
Date: Mon, 25 Apr 2016 14:44:47 -0700

 ...
> One potential use case is to use MSG_EOR with
> SOF_TIMESTAMPING_TX_ACK to get a more accurate
> TCP ack timestamping on application protocol with
> multiple outgoing response messages (e.g. HTTP2).
> 
> One of our use case is at the webserver.  The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received.  In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.

Looks good, series applied, thanks!

[PATCH net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek

From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..326d26c7a9e6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,6 +470,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
@@ -479,6 +480,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash->port == tb->port &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v2] net: macb: do not scan PHYs manually

2016-04-28 Thread Andrew Lunn

On Thu, Apr 28, 2016 at 01:03:15PM -0700, Florian Fainelli wrote:
> On 28/04/16 11:59, Andrew Lunn wrote:
> > On Thu, Apr 28, 2016 at 01:55:27PM -0500, Nathan Sullivan wrote:
> >> On Thu, Apr 28, 2016 at 08:43:03PM +0200, Andrew Lunn wrote:
>  I agree that is a valid fix for AT91, however it won't solve our 
>  problem, since
>  we have no children on the second ethernet MAC in our devices' device 
>  trees. I'm
>  starting to feel like our second MAC shouldn't even really register the 
>  MDIO bus
>  since it isn't being used - maybe adding a DT property to not have a bus 
>  is a
>  better option?
> >>>
> >>> status = "disabled"
> >>>
> >>> would be the unusual way.
> >>>
> >>>   Andrew
> >>
> >> Oh, sorry, I meant we use both MACs on Zynq, however the PHYs are on the 
> >> MDIO
> >> bus of the first MAC.  So, the second MAC is used for ethernet but not for 
> >> MDIO,
> >> and so it does not have any PHYs under its DT node.  It would be nice if 
> >> there
> >> were a way to tell macb not to bother with MDIO for the second MAC, since 
> >> that's
> >> handled by the first MAC.
> > 
> > Yes, exactly, add support for status = "disabled" in the mdio node.
> 
> Something like that, just so we do not have to sprinkle tests all other
> the place:
> 
> diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
> index b622b33dbf93..2f497790be1b 100644
> --- a/drivers/of/of_mdio.c
> +++ b/drivers/of/of_mdio.c
> @@ -209,6 +209,10 @@ int of_mdiobus_register(struct mii_bus *mdio,
> struct device_node *np)
> bool scanphys = false;
> int addr, rc;
> 
> +   /* Do not continue if the node is disabled */
> +   if (!of_device_is_available(np))
> +   return -EINVAL;
> +
> /* Mask out all PHYs from auto probing.  Instead the PHYs listed in
>  * the device tree are populated after the bus has been
> registered */
> mdio->phy_mask = ~0;

Yes, that looks good.

 Andrew

[PATCHv2] netem: Segment GSO packets on enqueue.

2016-04-28 Thread Neil Horman

This was recently reported to me, and reproduced on the latest net kernel, when
attempting to run netperf from a host that had a netem qdisc attached to the
egress interface:

[  788.073771] [ cut here ]
[  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[  788.129521] bnx2: caps=(0x0001801949b3, 0x) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: GW
   3.10.0-327.el7.x86_64 #1
[  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[  788.542260]  880437c036b8 f7afc56532a53db9 880437c03670
816351f1
[  788.576332]  880437c036a8 8107b200 880633e74200
880231674000
[  788.611943]  0001 0003 
880437c03710
[  788.647241] Call Trace:
[  788.658817][] dump_stack+0x19/0x1b
[  788.686193]  [] warn_slowpath_common+0x70/0xb0
[  788.713803]  [] warn_slowpath_fmt+0x5c/0x80
[  788.741314]  [] ? ___ratelimit+0x93/0x100
[  788.767018]  [] skb_warn_bad_offload+0xcd/0xda
[  788.796117]  [] skb_checksum_help+0x17c/0x190
[  788.823392]  [] netem_enqueue+0x741/0x7c0 [sch_netem]
[  788.854487]  [] dev_queue_xmit+0x2a8/0x570
[  788.880870]  [] ip_finish_output+0x53d/0x7d0
...

The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).

The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), except here we always
segment, instead of only when the interface needs us to do it.  This allows
netem to properly drop/mangle/pass/etc the correct percentages of frames as per
its qdisc configuration, and avoid failing its checksum operations

tested successfully by myself on the latest net kernel, to whcih this applies

---
Change Notes:
V2) As per request from Eric Dumazet, I rewrote this to limit the need to
segment the skb. Instead of doing so unilaterally, we no only do so now when the
netem qdisc requires determines that a packet must be corrupted, thus avoiding
the failure in skb_checksum_help.  This still leaves open concerns with
statistical measurements made on GSO packets being dropped or reordered (i.e.
they are counted as a single packet rather than multiple packets), but I'd
rather fix the immediate problem before we go rewriting everything to fix that
larger issue.

Signed-off-by: Neil Horman 
CC: Jamal Hadi Salim 
CC: "David S. Miller" 
CC: ne...@lists.linux-foundation.org
CC: eric.duma...@gmail.com
---
 net/sched/sch_netem.c | 51 ---
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 9640bb3..7cde5d3 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -395,6 +395,25 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct 
Qdisc *sch)
sch->q.qlen++;
 }
 
+/* netem can't properly corrupt a megapacket (like we get from GSO), so instead
+ * when we statistically choose to corrupt one, we instead segment it, 
returning
+ * the first packet to be corrupted, and re-enqueue the remaining frames
+ */
+static struct sk_buff* netem_segment(struct sk_buff *skb, struct Qdisc *sch)
+{
+   struct sk_buff *segs;
+   netdev_features_t features = netif_skb_features(skb);
+
+   segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+
+   if (IS_ERR_OR_NULL(segs)) {
+   qdisc_reshape_fail(skb, sch);
+   return NULL;
+   }
+   consume_skb(skb);
+   return segs;
+}
+
 /*
  * Insert one skb into qdisc.
  * Note: parent depends on return value to account for queue length.
@@ -407,7 +426,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch)
/* We don't fill cb now as skb_unshare() may invalidate it */
struct netem_skb_cb *cb;
struct sk_buff *skb2;
+   struct sk_buff *segs = NULL;
int count = 1;
+   int rc = NET_XMIT_SUCCESS;
 
/* Random duplication */
if (q->duplicate && q->duplicate >= get_crandom(>dup_cor))

1 2 3 >

1 - 100 of 261 matches

Mail list logo