date:20160226

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya



Yes, I did, but have no effect.

I want to ask is, why David's patch not used.

Thanks.



Cong Wang said, at 2/27/2016 2:29 PM:
> On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
>> From: Zhao Ya 
>> Date: Sat, 27 Feb 2016 10:06:44 +0800
>> Subject: [PATCH] IPIP tunnel performance improvement
>>
>> bypass the logic of each packet's own neighbour creation when using
>> pointopint or loopback device.
>>
>> Recently, in our tests, met a performance problem.
>> In a large number of packets with different target IP address through
>> ipip tunnel, PPS will decrease sharply.
>>
>> The output of perf top are as follows, __write_lock_failed is of the first:
>>   - 5.89% [kernel]  [k] __write_lock_failed
>>-__write_lock_failed a
>>-_raw_write_lock_bh  a
>>-__neigh_create  a
>>-ip_finish_outputa
>>-ip_output   a
>>-ip_local_outa
>>
>> The neighbour subsystem will create a neighbour object for each target
>> when using pointopint device. When massive amounts of packets with diff-
>> erent target IP address to be xmit through a pointopint device, these
>> packets will suffer the bottleneck at write_lock_bh(>lock) after
>> creating the neighbour object and then inserting it into a hash-table
>> at the same time.
>>
>> This patch correct it. Only one or little amounts of neighbour objects
>> will be created when massive amounts of packets with different target IP
>> address through ipip tunnel.
>>
>> As the result, performance will be improved.
> 
> Well, you just basically revert another bug fix:
> 
> commit 0bb4087cbec0ef74fd416789d6aad67957063057
> Author: David S. Miller 
> Date:   Fri Jul 20 16:00:53 2012 -0700
> 
> ipv4: Fix neigh lookup keying over loopback/point-to-point devices.
> 
> We were using a special key "0" for all loopback and point-to-point
> device neigh lookups under ipv4, but we wouldn't use that special
> key for the neigh creation.
> 
> So basically we'd make a new neigh at each and every lookup :-)
> 
> This special case to use only one neigh for these device types
> is of dubious value, so just remove it entirely.
> 
> Reported-by: Eric Dumazet 
> Signed-off-by: David S. Miller 
> 
> which would bring the neigh entries counting problem back...
> 
> Did you try to tune the neigh gc parameters for your case?
> 
> Thanks.
>

我的相片在

2016-02-26 Thread 我的相片在

你的老朋友邀你来Q群:343257759

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread Cong Wang

On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
>
> bypass the logic of each packet's own neighbour creation when using
> pointopint or loopback device.
>
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through
> ipip tunnel, PPS will decrease sharply.
>
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel]  [k] __write_lock_failed
>-__write_lock_failed a
>-_raw_write_lock_bh  a
>-__neigh_create  a
>-ip_finish_outputa
>-ip_output   a
>-ip_local_outa
>
> The neighbour subsystem will create a neighbour object for each target
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these
> packets will suffer the bottleneck at write_lock_bh(>lock) after
> creating the neighbour object and then inserting it into a hash-table
> at the same time.
>
> This patch correct it. Only one or little amounts of neighbour objects
> will be created when massive amounts of packets with different target IP
> address through ipip tunnel.
>
> As the result, performance will be improved.

Well, you just basically revert another bug fix:

commit 0bb4087cbec0ef74fd416789d6aad67957063057
Author: David S. Miller 
Date:   Fri Jul 20 16:00:53 2012 -0700

ipv4: Fix neigh lookup keying over loopback/point-to-point devices.

We were using a special key "0" for all loopback and point-to-point
device neigh lookups under ipv4, but we wouldn't use that special
key for the neigh creation.

So basically we'd make a new neigh at each and every lookup :-)

This special case to use only one neigh for these device types
is of dubious value, so just remove it entirely.

Reported-by: Eric Dumazet 
Signed-off-by: David S. Miller 

which would bring the neigh entries counting problem back...

Did you try to tune the neigh gc parameters for your case?

Thanks.

[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code

2016-02-26 Thread Santosh Shilimkar

No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exit:
-   rds_ib_fmr_exit();
+out_mr_exit:
+   rds_ib_mr_exit();
 out:
return ret;
 }
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 09cd8e3..c88cb22 100644
---

[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-26 Thread Santosh Shilimkar

RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = _iw_transport;
-   else
+

[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-26 Thread Santosh Shilimkar

The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(>i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(>i_item, >rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  >i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1

[net-next][PATCH 03/13] MAINTAINERS: update RDS entry

2016-02-26 Thread Santosh Shilimkar

Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 355e1c8..9d79bea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9081,10 +9081,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: netdev@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1

[net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-26 Thread Santosh Shilimkar

Series is generated against net-next but also applies against Linus's tip
cleanly. The diff-stat looks bit scary since almost ~4K lines of code is
getting removed.

Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list [1], am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7 timeframe. 

Entire patchset is available below git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds

Feedback/comments welcome !!

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  51 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 492 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1068 insertions(+), 5033 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c


Regards,
Santosh

[1] http://www.spinics.net/lists/linux-rdma/msg30769.html

-- 
1.9.1

[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-26 Thread Santosh Shilimkar

This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(>i_send_ring);
 
-   completed = rds_ib_ring_completed(>i_send_ring,
- (wc->wr_id & ~RDS_IB_SEND_OP),
- oldest);
+   completed =

[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr

2016-02-26 Thread Santosh Shilimkar

Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1

[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-26 Thread Santosh Shilimkar

Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used. Using module parameter 'prefer_frmr',
user can choose its preferred MR method for RDS. Ofcourse the
module parameter has no effect if the HCA support only FRMR
or only FRMR.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 14 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..68c94b0 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -47,6 +47,7 @@
 unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
 unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
+bool prefer_frmr;
 
 module_param(rds_ib_mr_1m_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR 
supported");
 
 /*
  * we have a clumsy combination of RCU and a rwsem protecting this list
@@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr &&
+(!rds_ibdev->has_fmr || prefer_frmr));
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1

[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-26 Thread Santosh Shilimkar

Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
-  (atomic_read(>i_signaled_sends) == 0));
+  (atomic_read(>i_signaled_sends) == 0) &&
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
-- 
1.9.1

[net-next][PATCH 10/13] RDS: IB: add mr reused stats

2016-02-26 Thread Santosh Shilimkar

Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 20ff191..00e9064 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(>clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1

[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-26 Thread Santosh Shilimkar

Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   >c_laddr, >c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1

[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-26 Thread Santosh Shilimkar

From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to RDS to have RDS functional
on them.

Some of the older HCAs support both FMR and FRMR modes. So to try out
FRMR on older HCAs, one can use module parameter 'prefer_frmr'

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..a86de13
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+

[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file

2016-02-26 Thread Santosh Shilimkar

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 105 ++---
 3 files changed, 133 insertions(+), 104 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(>dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, >flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(>dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(>item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(>item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, );
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = >u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
+
+   ret = ib_unmap_fmr(_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = >u.fmr;
+   *unpinned += ibmr->sg_len;
+   __rds_ib_teardown_mr(ibmr);
+   if (freed < goal ||
+   ibmr->remap_count >= ibmr->pool->fmr_attr.max_maps) {
+   if (ibmr->pool->pool_type

[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-26 Thread Santosh Shilimkar

Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = >u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
>fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(>item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = >u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, _list, unmap_list)
-   list_add(>fmr->list, _list);
+   list_for_each_entry(ibmr, _list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
 
ret = ib_unmap_fmr(_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, _list, unmap_list) {
unpinned += ibmr->sg_len;
+   fmr = >u.fmr;
__rds_ib_teardown_mr(ibmr);
if (nfreed < free_goal ||
ibmr->remap_count >= pool->fmr_attr.max_maps) {
@@

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Linus Torvalds

On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
 wrote:
>
> I'm still not a fan of the unaligned reads.  They may be okay but it
> just seems like we are going run into corner cases all over the place
> where this ends up biting us.

No.

Unaligned reads are not just "ok".

The fact is, not doing unaligned reads is just stupid.

Guys, the RISC people tried the whole "only do aligned crap". It was a
mistake. It's stupid. It's wrong.

Every single successful remaining RISC architecture learnt from their
mistakes. That should tell you something.

It should tell you that the people who tried to teach you that
unaligned reads were bad were charlatans.

It's *much* better to do unaligned reads in software and let hardware
sort it out than to try to actively avoid them.

On x86, unaligned reads have never even been expensive (except back in
the dark days when people started doing vector extensions and got them
wrong - those people learnt their lesson too).

And on other architectures, that historically got this wrong (ARM got
it *really* wrong originally), sanity eventually prevailed. So there
isn't a single relevant architecture left where it would make sense to
do extra work in order to only do aligned reads.

The whole "unaligned reads are bad" ship sailed long ago, and it sank.
Let is be.

Linus

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya


BTW,before the version 3.5 kernel, the source code contains the logic.
2.6.32, for example, in arp_bind_neighbour function, there are the following 
logic:

__be32 nexthop = ((struct rtable *) DST) - > rt_gateway;
if (dev - > flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
nexthop = 0;
n = __neigh_lookup_errno (
...

zhao ya said, at 2/27/2016 12:40 PM:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
> 
> bypass the logic of each packet's own neighbour creation when using 
> pointopint or loopback device.
> 
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through 
> ipip tunnel, PPS will decrease sharply.
> 
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel][k] __write_lock_failed
>-__write_lock_failed   a
>-_raw_write_lock_bha
>-__neigh_createa
>-ip_finish_output  a
>-ip_output a
>-ip_local_out  a
> 
> The neighbour subsystem will create a neighbour object for each target 
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these 
> packets will suffer the bottleneck at write_lock_bh(>lock) after 
> creating the neighbour object and then inserting it into a hash-table 
> at the same time. 
> 
> This patch correct it. Only one or little amounts of neighbour objects 
> will be created when massive amounts of packets with different target IP 
> address through ipip tunnel. 
> 
> As the result, performance will be improved.
> 
> 
> Signed-off-by: Zhao Ya 
> Signed-off-by: Zhaoya 
> ---
>  net/ipv4/ip_output.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 64878ef..d7c0594 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
> *sk, struct sk_buff *s
>  
>   rcu_read_lock_bh();
>   nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
> + if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
> + nexthop = 0;
>   neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
>   if (unlikely(!neigh))
>   neigh = __neigh_create(_tbl, , dev, false);
> 
>

Re: [PATCH] appletalk: Pass IP-over-DDP packets through when 'ipddp0' interface is not present

2016-02-26 Thread Adam Seering

On Thu, 2016-02-25 at 19:46 -0500, Adam Seering wrote:
> On Thu, 2016-02-25 at 14:33 -0500, David Miller wrote:
> > From: Adam Seering 
> > Date: Tue, 23 Feb 2016 09:19:13 -0500
> > 
> > > Let userspace programs transmit and receive raw IP-over-DDP
> > > packets
> > > with a kernel where "ipddp" was compiled as a module but is not
> > loaded
> > > (so no "ipddp0" network interface is exposed).  This makes the
> > "module
> > > is not loaded" behavior match the "module was never compiled"
> > behavior.
> > > 
> > > Signed-off-by: Adam Seering 
> > 
> > I think a better approache is to somehow autoload the module.
> 
> Could you elaborate?  Specifically: the kernel currently suppresses
> packets on behalf of the module even after the module is unloaded. 
>  How
> would autoloading the module help with that?

Re-reading this thread -- perhaps I didn't explain the problem well. 
 Let me elaborate.  Apologies if this is obvious to folks here:

I want my userspace program to send and receive DDP packets that
encapsulate IP traffic.

Problem:  On some kernel builds, these DDP packets are never delivered
to the DDP socket opened by my program.

The "ipddp" module is supposed to prevent those packets from being
delivered to DDP sockets when it is loaded -- it handles them itself. 
 Ok, that's fine; I just want to unload that module, right?

Wrong!  Unloading the module is not sufficient.  I have to re-compile
the kernel with the module disabled completely.  (No other config
options; simply setting the module to not build does the trick.)
whose sole purpose is to handle it.  If not, unload it.  This patch
makes that happen.  Thoughts?

Thanks,
Adam

[PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya

From: Zhao Ya 
Date: Sat, 27 Feb 2016 10:06:44 +0800
Subject: [PATCH] IPIP tunnel performance improvement

bypass the logic of each packet's own neighbour creation when using 
pointopint or loopback device.

Recently, in our tests, met a performance problem.
In a large number of packets with different target IP address through 
ipip tunnel, PPS will decrease sharply.

The output of perf top are as follows, __write_lock_failed is of the first:
  - 5.89% [kernel]  [k] __write_lock_failed
   -__write_lock_failed a
   -_raw_write_lock_bh  a
   -__neigh_create  a
   -ip_finish_outputa
   -ip_output   a
   -ip_local_outa

The neighbour subsystem will create a neighbour object for each target 
when using pointopint device. When massive amounts of packets with diff-
erent target IP address to be xmit through a pointopint device, these 
packets will suffer the bottleneck at write_lock_bh(>lock) after 
creating the neighbour object and then inserting it into a hash-table 
at the same time. 

This patch correct it. Only one or little amounts of neighbour objects 
will be created when massive amounts of packets with different target IP 
address through ipip tunnel. 

As the result, performance will be improved.

Signed-off-by: Zhao Ya 
Signed-off-by: Zhaoya 
---
 net/ipv4/ip_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 64878ef..d7c0594 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *s

rcu_read_lock_bh();
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
+   if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+   nexthop = 0;
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(_tbl, , dev, false);

Re: [net-next PATCH v3 1/3] net: sched: consolidate offload decision in cls_u32

2016-02-26 Thread John Fastabend

On 16-02-26 09:39 AM, Cong Wang wrote:
> On Fri, Feb 26, 2016 at 7:53 AM, John Fastabend
>  wrote:
>> The offload decision was originally very basic and tied to if the dev
>> implemented the appropriate ndo op hook. The next step is to allow
>> the user to more flexibly define if any paticular rule should be
>> offloaded or not. In order to have this logic in one function lift
>> the current check into a helper routine tc_should_offload().
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  include/net/pkt_cls.h |5 +
>>  net/sched/cls_u32.c   |8 
>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index 2121df5..e64d20b 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -392,4 +392,9 @@ struct tc_cls_u32_offload {
>> };
>>  };
>>
>> +static inline bool tc_should_offload(struct net_device *dev)
>> +{
>> +   return dev->netdev_ops->ndo_setup_tc;
>> +}
>> +
> 
> These should be protected by CONFIG_NET_CLS_U32, no?
> 

Its not necessary it is a completely general function and I only
lifted it out of cls_u32 so that the cls_flower classifier could
also use it.

I don't see the need off-hand to have it wrapped in an ORd ifdef
statement where its (CONFIG_NET_CLS_U32 | CONFIG_NET_CLS_X ...).
Any particular reason you were thnking it should be wrapped in ifdefs?

Thanks for taking a look at the patches.

.John

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Alexander Duyck

On Fri, Feb 26, 2016 at 7:11 PM, Tom Herbert  wrote:
> On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
>  wrote:
>> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert  wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>>- went back to C code with inline assembly for critical routines
>>>- implemented suggestion from Linus to deal with lengths < 8
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>  Old  New %
>>> Len/Aln  nsecsnsecs   Improv
>>> +---++---
>>> 1400/0195.6181.7   7% (Big packet)
>>> 40/0  11.4 6.2 45%(Ipv6 hdr cmn case)
>>> 8/4   7.9  3.2 59%(UDP, VXLAN in IPv4)
>>> 14/0  8.9  5.9 33%(Eth hdr)
>>> 14/4  9.2  5.9 35%(Eth hdr in IPv4)
>>> 14/3  9.6  5.9 38%(Eth with odd align)
>>> 20/0  9.0  6.2 31%(IP hdr without options)
>>> 7/1   8.9  4.2 52%(buffer in one quad)
>>> 100/017.4 13.9 20%(medium-sized pkt)
>>> 100/217.8 14.2 20%(medium-sized pkt w/ alignment)
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also tested on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch  prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between the
>>> two scenarios (perf stat output is below), neither does either case
>>> show a significant number of branch misses.
>>>
>>> Interleave lengths case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>> ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1
>>>
>>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 
>>> -c 1' (10 runs):
>>>
>>>  9,556,693,202  instructions   ( +-  0.00% )
>>>  1,176,208,640   branches   
>>>   ( +-  0.00% )
>>> 19,487   branch-misses#0.00% of all 
>>> branches  ( +-  6.07% )
>>>
>>>2.049732539 seconds time elapsed
>>>
>>> Non-interleave case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>  ./csum -M new-thrash -l 100 -S 24 -a 1 -c 1
>>>
>>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
>>> 1' (10 runs):
>>>
>>>  9,782,188,310  instructions   ( +-  0.00% )
>>>  1,251,286,958   branches   
>>>   ( +-  0.01% )
>>> 18,950   branch-misses#0.00% of all 
>>> branches  ( +- 12.74% )
>>>
>>>2.271789046 seconds time elapsed
>>>
>>> Signed-off-by: Tom Herbert 
>>> ---
>>>  arch/x86/include/asm/checksum_64.h |  21 
>>>  arch/x86/lib/csum-partial_64.c | 225 
>>> -
>>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/checksum_64.h 
>>> b/arch/x86/include/asm/checksum_64.h
>>> index cd00e17..e20c35b 100644
>>> --- a/arch/x86/include/asm/checksum_64.h
>>> +++ b/arch/x86/include/asm/checksum_64.h
>>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
>>> unsigned b)
>>> return a;
>>>  }
>>>
>>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned 
>>> long b)
>>> +{
>>> +   asm("addq %2,%0\n\t"
>>> +   "adcq $0,%0"
>>> +   : "=r" (a)
>>> +   : "0" (a), "rm" (b));
>>> +   return a;
>>> +}
>>> +
>>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int 
>>> b,
>>> +unsigned int c)
>>> +{
>>> +

Re: [PATCH V2 11/12] net-next: mediatek: add Kconfig and Makefile

2016-02-26 Thread kbuild test robot

Hi John,

[auto build test ERROR on net/master]
[also build test ERROR on v4.5-rc5 next-20160226]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/John-Crispin/net-next-mediatek-add-ethernet-driver/20160226-223245
config: arm64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_init_fq_dma':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:771:22: warning: passing 
>> argument 3 of 'dma_alloc_coherent' from incompatible pointer type
 eth->scratch_ring = dma_alloc_coherent(eth->dev,
 ^
   In file included from drivers/net/ethernet/mediatek/mtk_eth_soc.c:18:0:
   include/linux/dma-mapping.h:396:21: note: expected 'dma_addr_t *' but 
argument is of type 'unsigned int *'
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
^
   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_probe':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:2059:2: warning: ignoring return 
>> value of 'device_reset', declared with attribute warn_unused_result 
>> [-Wunused-result]
 device_reset(>dev);
 ^
--
   drivers/net/ethernet/mediatek/ethtool.c: In function 'mtk_set_settings':
>> drivers/net/ethernet/mediatek/ethtool.c:49:38: error: 'struct phy_device' 
>> has no member named 'addr'
 if (cmd->phy_address != mac->phy_dev->addr) {
 ^
>> drivers/net/ethernet/mediatek/ethtool.c:54:23: error: 'struct mii_bus' has 
>> no member named 'phy_map'
  mac->hw->mii_bus->phy_map[cmd->phy_address]) {
  ^
   drivers/net/ethernet/mediatek/ethtool.c:56:21: error: 'struct mii_bus' has 
no member named 'phy_map'
mac->hw->mii_bus->phy_map[cmd->phy_address];
^

vim +49 drivers/net/ethernet/mediatek/ethtool.c

79b0e682 John Crispin 2016-02-26  43  {
79b0e682 John Crispin 2016-02-26  44struct mtk_mac *mac = netdev_priv(dev);
79b0e682 John Crispin 2016-02-26  45  
79b0e682 John Crispin 2016-02-26  46if (!mac->phy_dev)
79b0e682 John Crispin 2016-02-26  47return -ENODEV;
79b0e682 John Crispin 2016-02-26  48  
79b0e682 John Crispin 2016-02-26 @49if (cmd->phy_address != 
mac->phy_dev->addr) {
79b0e682 John Crispin 2016-02-26  50if 
(mac->hw->phy->phy_node[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  51mac->phy_dev = 
mac->hw->phy->phy[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  52mac->phy_flags = 
MTK_PHY_FLAG_PORT;
79b0e682 John Crispin 2016-02-26  53} else if (mac->hw->mii_bus &&
79b0e682 John Crispin 2016-02-26 @54   
mac->hw->mii_bus->phy_map[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  55mac->phy_dev =
79b0e682 John Crispin 2016-02-26  56
mac->hw->mii_bus->phy_map[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  57mac->phy_flags = 
MTK_PHY_FLAG_ATTACH;

:: The code at line 49 was first introduced by commit
:: 79b0e682b3b2ed2a983b0263c6b8b3af61fdbf8e net-next: mediatek: add the 
drivers core files

:: TO: John Crispin <blo...@openwrt.org>
:: CC: 0day robot <fengguang...@intel.com>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 10:47:38PM +, Kevin Smith wrote:
> Hi Andrew,
> 
> On 02/26/2016 04:35 PM, Andrew Lunn wrote:
> > On Fri, Feb 26, 2016 at 10:12:28PM +, Kevin Smith wrote:
> >> Hi Vivien, Andrew,
> >>
> >> On 02/26/2016 03:37 PM, Vivien Didelot wrote:
> >>> Here, 5 is the CPU port and 6 is a DSA port.
> >>>
> >>> After joining ports 0, 1, 2 in the same bridge, we end up with:
> >>>
> >>> Port  0  1  2  3  4  5  6
> >>> 0   -  *  *  -  -  *  *
> >>> 1   *  -  *  -  -  *  *
> >>> 2   *  *  -  -  -  *  *
> >>> 3   -  -  -  -  -  *  *
> >>> 4   -  -  -  -  -  *  *
> >>> 5   *  *  *  *  *  -  *
> >>> 6   *  *  *  *  *  *  -
> >> The case I am concerned about is if the switch connected over DSA in
> >> this example has a WAN port on it, which can legitimately route to the
> >> CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does
> >> this VLAN allow direct communication between the WAN and LAN?  Or is
> >> this prevented by DSA or some other mechanism?
> > A typical WIFI access point with a connection to a cable modem.
> >
> > So in linux you have interfaces like
> >
> > lan0, lan1, lan2, lan3, wan0
> >
> > DSA provides you these interface. And by default they are all
> > separated. There is no path between them. You can consider them as
> > being separate physical ethernet cards, just like all other interfaces
> > in linux.
> >
> > What you would typically do is:
> >
> > brctl addbr br0
> > brctl addif br0 lan0
> > brctl addif br0 lan1
> > brctl addif br0 lan2
> > brctl addif br0 lan3
> >
> > to create a bridge between the lan ports. The linux kernel will then
> > push this bridge configuration down into the hardware, so the switch
> > can forward frames between these ports.
> >
> > The wan port is not part of the bridge, so there is no L2 path to the
> > WAN port. You need to do IP routing on the CPU.
> >
> > Linux takes the stance that switch ports interfaces should act just
> > like any other linux interface and you configure them in the normal
> > linux way.
> >
> >  Andrew
> 
> Thanks for the explanation.  I am a bit befuddled by the combination of 
> all the possible configurations of the switch and how they interact with 
> Linux.  :)  I think I understand what is happening now.

You might also be looking at this the wrong way around. It is best to
think of the switch as a hardware accelerator. It offers functions to
the linux network stack to accelerate part of the linux network
stack. We only push out to the hardware functions it is capable of
accelerating. What it cannot accelerate stays in software. Think of it
as a GPU, but for networking...

  Andrew

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert

On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
 wrote:
> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert  wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>
>> - v4
>>- went back to C code with inline assembly for critical routines
>>- implemented suggestion from Linus to deal with lengths < 8
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>  Old  New %
>> Len/Aln  nsecsnsecs   Improv
>> +---++---
>> 1400/0195.6181.7   7% (Big packet)
>> 40/0  11.4 6.2 45%(Ipv6 hdr cmn case)
>> 8/4   7.9  3.2 59%(UDP, VXLAN in IPv4)
>> 14/0  8.9  5.9 33%(Eth hdr)
>> 14/4  9.2  5.9 35%(Eth hdr in IPv4)
>> 14/3  9.6  5.9 38%(Eth with odd align)
>> 20/0  9.0  6.2 31%(IP hdr without options)
>> 7/1   8.9  4.2 52%(buffer in one quad)
>> 100/017.4 13.9 20%(medium-sized pkt)
>> 100/217.8 14.2 20%(medium-sized pkt w/ alignment)
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also tested on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch  prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between the
>> two scenarios (perf stat output is below), neither does either case
>> show a significant number of branch misses.
>>
>> Interleave lengths case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>> ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1
>>
>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 
>> 1' (10 runs):
>>
>>  9,556,693,202  instructions   ( +-  0.00% )
>>  1,176,208,640   branches
>>  ( +-  0.00% )
>> 19,487   branch-misses#0.00% of all branches 
>>  ( +-  6.07% )
>>
>>2.049732539 seconds time elapsed
>>
>> Non-interleave case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>  ./csum -M new-thrash -l 100 -S 24 -a 1 -c 1
>>
>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
>> 1' (10 runs):
>>
>>  9,782,188,310  instructions   ( +-  0.00% )
>>  1,251,286,958   branches
>>  ( +-  0.01% )
>> 18,950   branch-misses#0.00% of all branches 
>>  ( +- 12.74% )
>>
>>2.271789046 seconds time elapsed
>>
>> Signed-off-by: Tom Herbert 
>> ---
>>  arch/x86/include/asm/checksum_64.h |  21 
>>  arch/x86/lib/csum-partial_64.c | 225 
>> -
>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/checksum_64.h 
>> b/arch/x86/include/asm/checksum_64.h
>> index cd00e17..e20c35b 100644
>> --- a/arch/x86/include/asm/checksum_64.h
>> +++ b/arch/x86/include/asm/checksum_64.h
>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
>> unsigned b)
>> return a;
>>  }
>>
>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long 
>> b)
>> +{
>> +   asm("addq %2,%0\n\t"
>> +   "adcq $0,%0"
>> +   : "=r" (a)
>> +   : "0" (a), "rm" (b));
>> +   return a;
>> +}
>> +
>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>> +unsigned int c)
>> +{
>> +   asm("addl %2,%0\n\t"
>> +   "adcl %3,%0\n\t"
>> +   "adcl $0,%0"
>> +   : "=r" (a)
>> +   : "" (a), "rm" (b), "rm" (c));
>> +
>> +   return a;
>> +}
>> +
>>  #define

Re: [PATCH 3/4] net: ipv4: tcp_probe: Replace timespec with timespec64

2016-02-26 Thread Deepa Dinamani

On Thu, Feb 25, 2016 at 8:31 PM, Arnd Bergmann  wrote:
> On Wednesday 24 February 2016 23:07:10 Deepa Dinamani wrote:
>> TCP probe log timestamps use struct timespec which is
>> not y2038 safe. Even though timespec might be good enough here
>> as it is used to represent delta time, the plan is to get rid
>> of all uses of timespec in the kernel.
>> Replace with struct timespec64 which is y2038 safe.
>>
>> Prints still use unsigned long format and type.
>> This is because long is 64 bit on 64 bit systems and 32 bit on
>> 32 bit systems. Hence, time64_t(64 bit signed number) does not
>> have a specifier that matches on both architectures.
>
> Actually time64_t is always 'long long', but tv_sec is time_t
> (long) instead of time64_t on 64-bit architectures.
>
> Using a %ll format string and a cast to s64 would work as well,
> but as you say above, it's not important here.

You are right. A cast to u64 would work as well.
I missed that the size of long long on 64 bit architectures according
to all current
data models is equivalent to long.

I will leave the prints to be in long format.
But, will reword the commit text in v2.

Thanks,
-Deepa

ip v6 routing behavior difference between linux 3.4 and linux 3.18

2016-02-26 Thread Ani Sinha

Hi guys,

I am a little puzzled with a behavior difference I see between linux
3.4 and linux 3.18. Here's my setup where the numbers in hex are ipv6
addresses of the interfaces in parenthesis :

fd7a:629f:52a4:fffd::1 (lo0)
  ∣
  ∣
 fd7a:629f:52a4:fffe::1 (vlan_dev1)
 ∣ linux box 2 (unit under test)
 ---
  ∣   linux box1 (Test Driver)
  ∣
 fd7a:629f:52a4:fffe::2 (e0)

Linux box2 is running linux kernel 3.4. Linux box1 is running linux
kernel 3.18.

I am running a small test script on box1 where I try to ping the
loopback interface. Before I do that, I set up a static route for
loopback device lo on box1, something like this :

fd7a:629f:52a4:fffd::1 via fd7a:629f:52a4:fffe::1 dev e0  metric 1024

Then I bring down the real device under the vlan_dev1 interface on
box2. The ping to loopback fails. So far so good.

Now I bring the real device under vlan_dev1 back up. This time, the
ping6 to lo0 on box1 keeps failing with "destination unreachable: no
route". I don't understand why the ping would fail even with a static
route programmed. I have also noticed that when I ping6 vlan_dev1 from
box1 and then ping6 lo0 from box1, the ping6 to lo0 then succeeds.
Alternatively, if I ping6 e0 from box2, then ping6 from box1 to lo0,
it succeeds.

Now as another experiment data point, I run linux kernel 3.4 on box1.
The behavior is slightly different.  The moment I bring back up the
underlying device for vlan_dev1, the pings succeed right away without
any tinkering. I don't understand why this subtle difference in
behavior in the two kernels?

Any pointers would be greatly appreciated.

thanks
ani

Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode

2016-02-26 Thread Tom Herbert

On Thu, Feb 25, 2016 at 11:48 PM, Jiri Benc  wrote:
> Implement VXLAN-GPE. Only L2 mode (i.e. encapsulated Ethernet frame) is
> supported by this patch.
>
> L3 mode will be added by subsequent patches.
>
> Signed-off-by: Jiri Benc 
> ---
>  drivers/net/vxlan.c  | 68 
> ++--
>  include/net/vxlan.h  | 62 +++-
>  include/uapi/linux/if_link.h |  8 ++
>  3 files changed, 135 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 775ddb48388d..c7844bae339d 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -1192,6 +1192,33 @@ out:
> unparsed->vx_flags &= ~VXLAN_GBP_USED_BITS;
>  }
>
> +static bool vxlan_parse_gpe_hdr(struct vxlanhdr *unparsed,
> +   struct sk_buff *skb, u32 vxflags)
> +{
> +   struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)unparsed;
> +
> +   /* Need to have Next Protocol set for interfaces in GPE mode. */
> +   if (!gpe->np_applied)
> +   return false;
> +   /* "The initial version is 0. If a receiver does not support the
> +* version indicated it MUST drop the packet.
> +*/
> +   if (gpe->version != 0)
> +   return false;
> +   /* "When the O bit is set to 1, the packet is an OAM packet and OAM
> +* processing MUST occur." However, we don't implement OAM
> +* processing, thus drop the packet.
> +*/
> +   if (gpe->oam_flag)
> +   return false;
> +
> +   if (gpe->next_protocol != VXLAN_GPE_NP_ETHERNET)
> +   return false;
> +
> +   unparsed->vx_flags &= ~VXLAN_GPE_USED_BITS;
> +   return true;
> +}
> +
>  static bool vxlan_set_mac(struct vxlan_dev *vxlan,
>   struct vxlan_sock *vs,
>   struct sk_buff *skb)
> @@ -1307,6 +1334,9 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff 
> *skb)
> /* For backwards compatibility, only allow reserved fields to be
>  * used by VXLAN extensions if explicitly requested.
>  */
> +   if (vs->flags & VXLAN_F_GPE)
> +   if (!vxlan_parse_gpe_hdr(, skb, vs->flags))
> +   goto drop;

I don't think this is right. VXLAN-GPE is a separate protocol than
VXLAN, they are not compatible on the wire and don't share flags or
fields (for instance GPB uses bits in VXLAN that hold the next
protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in
VXLAN to differentiate the two. So VXLAN-GPE would be used on a
different port and probably needs its own rcv functions.

Tom

> if (vs->flags & VXLAN_F_REMCSUM_RX)
> if (!vxlan_remcsum(, skb, vs->flags))
> goto drop;
> @@ -1685,6 +1715,14 @@ static void vxlan_build_gbp_hdr(struct vxlanhdr *vxh, 
> u32 vxflags,
> gbp->policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);
>  }
>
> +static void vxlan_build_gpe_hdr(struct vxlanhdr *vxh, u32 vxflags)
> +{
> +   struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)vxh;
> +
> +   gpe->np_applied = 1;
> +   gpe->next_protocol = VXLAN_GPE_NP_ETHERNET;
> +}
> +
>  static int vxlan_build_skb(struct sk_buff *skb, struct dst_entry *dst,
>int iphdr_len, __be32 vni,
>struct vxlan_metadata *md, u32 vxflags,
> @@ -1744,6 +1782,8 @@ static int vxlan_build_skb(struct sk_buff *skb, struct 
> dst_entry *dst,
>
> if (vxflags & VXLAN_F_GBP)
> vxlan_build_gbp_hdr(vxh, vxflags, md);
> +   if (vxflags & VXLAN_F_GPE)
> +   vxlan_build_gpe_hdr(vxh, vxflags);
>
> skb_set_inner_protocol(skb, htons(ETH_P_TEB));
> return 0;
> @@ -2515,6 +2555,7 @@ static const struct nla_policy 
> vxlan_policy[IFLA_VXLAN_MAX + 1] = {
> [IFLA_VXLAN_REMCSUM_RX] = { .type = NLA_U8 },
> [IFLA_VXLAN_GBP]= { .type = NLA_FLAG, },
> [IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
> +   [IFLA_VXLAN_GPE_MODE]   = { .type = NLA_U8, },
>  };
>
>  static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
> @@ -2714,6 +2755,10 @@ static int vxlan_dev_configure(struct net *src_net, 
> struct net_device *dev,
> __be16 default_port = vxlan->cfg.dst_port;
> struct net_device *lowerdev = NULL;
>
> +   if (((conf->flags & VXLAN_F_LEARN) && (conf->flags & VXLAN_F_GPE)) ||
> +   ((conf->flags & VXLAN_F_GBP) && (conf->flags & VXLAN_F_GPE)))
> +   return -EINVAL;
> +
> vxlan->net = src_net;
>
> dst->remote_vni = conf->vni;
> @@ -2770,8 +2815,12 @@ static int vxlan_dev_configure(struct net *src_net, 
> struct net_device *dev,
> dev->needed_headroom = needed_headroom;
>
> memcpy(>cfg, conf, sizeof(*conf));
> -   if (!vxlan->cfg.dst_port)
> -   vxlan->cfg.dst_port =

Re: [net-next-2.6 v3 1/3] introduce IFE action

2016-02-26 Thread Cong Wang

On Fri, Feb 26, 2016 at 2:43 PM, Jamal Hadi Salim  wrote:

[...]


Just some quick reviews... ;)


> +#define IFE_TAB_MASK 15
> +
> +static int ife_net_id;
> +static int max_metacnt = IFE_META_MAX + 1;
> +
> +static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = {
> +   [TCA_IFE_PARMS] = { .len = sizeof(struct tc_ife)},
> +   [TCA_IFE_DMAC] = { .len = ETH_ALEN},
> +   [TCA_IFE_SMAC] = { .len = ETH_ALEN},
> +   [TCA_IFE_TYPE] = { .type = NLA_U16},
> +};
> +
> +/* Caller takes care of presenting data in network order
> +*/
> +int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void 
> *dval)
> +{
> +   u32 *tlv = (u32 *)(skbdata);
> +   u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
> +   char *dptr = (char *)tlv + NLA_HDRLEN;
> +   u32 htlv = attrtype << 16 | totlen;
> +
> +   *tlv = htonl(htlv);
> +   memset(dptr, 0, totlen - NLA_HDRLEN);
> +   memcpy(dptr, dval, dlen);
> +
> +   return totlen;
> +}
> +EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
> +
> +int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi)
> +{
> +   if (mi->metaval)
> +   return nla_put_u32(skb, mi->metaid, *(u32 *)mi->metaval);
> +   else
> +   return nla_put(skb, mi->metaid, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(ife_get_meta_u32);
> +
> +int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi)
> +{
> +   if (metaval || mi->metaval)
> +   return 8; /* T+L+V == 2+2+4 */
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(ife_check_meta_u32);
> +
> +int ife_encode_meta_u32(u32 metaval, void *skbdata, struct tcf_meta_info *mi)
> +{
> +   u32 edata = metaval;
> +
> +   if (mi->metaval)
> +   edata = *(u32 *)mi->metaval;
> +   else if (metaval)
> +   edata = metaval;
> +
> +   if (!edata) /* will not encode */
> +   return 0;
> +
> +   edata = htonl(edata);
> +   return ife_tlv_meta_encode(skbdata, mi->metaid, 4, );
> +}
> +EXPORT_SYMBOL_GPL(ife_encode_meta_u32);
> +
> +int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi)
> +{
> +   if (mi->metaval)
> +   return nla_put_u16(skb, mi->metaid, *(u16 *)mi->metaval);
> +   else
> +   return nla_put(skb, mi->metaid, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(ife_get_meta_u16);
> +
> +int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval)
> +{
> +   mi->metaval = kmemdup(, sizeof(u32), GFP_KERNEL);
> +   if (!mi->metaval)
> +   return -ENOMEM;
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(ife_alloc_meta_u32);
> +
> +int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval)
> +{
> +   mi->metaval = kmemdup(, sizeof(u16), GFP_KERNEL);
> +   if (!mi->metaval)
> +   return -ENOMEM;
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(ife_alloc_meta_u16);
> +
> +void ife_release_meta_gen(struct tcf_meta_info *mi)
> +{
> +   kfree(mi->metaval);
> +}
> +EXPORT_SYMBOL_GPL(ife_release_meta_gen);
> +
> +int ife_validate_meta_u32(void *val, int len)
> +{
> +   if (len == 4)
> +   return 0;
> +
> +   return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(ife_validate_meta_u32);
> +
> +int ife_validate_meta_u16(void *val, int len)
> +{
> +   /* length will include padding */
> +   if (len == NLA_ALIGN(2))
> +   return 0;
> +
> +   return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(ife_validate_meta_u16);
> +
> +static LIST_HEAD(ifeoplist);
> +static DEFINE_RWLOCK(ife_mod_lock);
> +
> +struct tcf_meta_ops *find_ife_oplist(u16 metaid)


static?


> +{
> +   struct tcf_meta_ops *o;
> +
> +   read_lock(_mod_lock);
> +   list_for_each_entry(o, , list) {
> +   if (o->metaid == metaid) {
> +   if (!try_module_get(o->owner))
> +   o = NULL;
> +   read_unlock(_mod_lock);
> +   return o;
> +   }
> +   }
> +   read_unlock(_mod_lock);
> +
> +   return NULL;
> +}
> +
> +int register_ife_op(struct tcf_meta_ops *mops)
> +{
> +   struct tcf_meta_ops *m;
> +
> +   if (!mops->metaid || !mops->metatype || !mops->name ||
> +   !mops->check_presence || !mops->encode || !mops->decode ||
> +   !mops->get || !mops->alloc)
> +   return -EINVAL;
> +
> +   write_lock(_mod_lock);
> +
> +   list_for_each_entry(m, , list) {
> +   if (m->metaid == mops->metaid ||
> +   (strcmp(mops->name, m->name) == 0)) {
> +   write_unlock(_mod_lock);
> +   return -EEXIST;
> +   }
> +   }
> +
> +   if (!mops->release)
> +   mops->release = ife_release_meta_gen;
> +
> +   list_add_tail(>list, );
> +   write_unlock(_mod_lock);
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(unregister_ife_op);
> +
> +int unregister_ife_op(struct

RE: [PATCH net] bna: fix list corruption

2016-02-26 Thread Rasesh Mody

> From: Ivan Vecera [mailto:ivec...@redhat.com]
> Sent: Friday, February 26, 2016 12:16 AM
> 
> Use list_move_tail() to move MAC address entry from list of pending to list
> of active entries. Simple list_add_tail() leaves the entry also in the first 
> list,
> this leads to list corruption.
> 
> Cc: Rasesh Mody 
> Signed-off-by: Ivan Vecera 

Acked-by: Rasesh Mody 

> ---
>  drivers/net/ethernet/brocade/bna/bna_tx_rx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
> b/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
> index 04b0d16..95bc470 100644
> --- a/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
> +++ b/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
> @@ -987,7 +987,7 @@ bna_rxf_ucast_cfg_apply(struct bna_rxf *rxf)
>   if (!list_empty(>ucast_pending_add_q)) {
>   mac = list_first_entry(>ucast_pending_add_q,
>  struct bna_mac, qe);
> - list_add_tail(>qe, >ucast_active_q);
> + list_move_tail(>qe, >ucast_active_q);
>   bna_bfi_ucast_req(rxf, mac,
> BFI_ENET_H2I_MAC_UCAST_ADD_REQ);
>   return 1;
>   }
> --
> 2.4.10

Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode

2016-02-26 Thread Tom Herbert

On Fri, Feb 26, 2016 at 2:22 PM, Jesse Gross  wrote:
> On Thu, Feb 25, 2016 at 11:48 PM, Jiri Benc  wrote:
>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> index c2b2b7462731..ee4f7198aa21 100644
>> --- a/include/uapi/linux/if_link.h
>> +++ b/include/uapi/linux/if_link.h
>> @@ -464,6 +464,7 @@ enum {
>>  enum vxlan_gpe_mode {
>> VXLAN_GPE_MODE_DISABLED = 0,
>> VXLAN_GPE_MODE_L2,
>> +   VXLAN_GPE_MODE_L3,
>
> Given that VXLAN_GPE_MODE_L3 will eventually come to be used by NSH,
> MPLS, etc. in addition to IPv4/v6, most of which are not really L3, it
> seems like something along the lines of NO_ARP might be better since
> that's what it really indicates. Once that is in, I don't really see
> the need to explicitly block Ethernet packets from being handled in
> this mode. If they are received, then they can just be handed off to
> the stack - at that point it would look like an extra header, the same
> as if an NSH packet is received.

Agreed, and I don't see why there even needs to be modes. VXLAN-GPE
can carry arbitrary protocols with a next-header field. For Ethernet,
MPLS, IPv4, and IPv6 it should just be a simple mapping of the next
header to Ethertype for purposes of processing the payload.

Re: [PATCH net 1/3] r8169:fix nic sometimes doesn't work after changing the mac address.

2016-02-26 Thread Francois Romieu

Chunhao Lin  :
> When there is no AC power, NIC doesn't work after changing mac address.
> Please refer to following link.
> http://www.spinics.net/lists/netdev/msg356572.html
> 
> This issue is caused by runtime power management. When there is no AC power, 
> if we
> put NIC down (ifconfig down), the driver will be put in runtime suspend state 
> and
> device will in D3 state. In this time, driver cannot access hardware 
> regisers. So
> if you set new mac address during this time, it will not work. And then, after
> resume, the NIC will keep using the old mac address and so the network will 
> not
> work normally.
> 
> In this patch I add detecting runtime pm state when setting mac address. If
> driver is in runtime suspend, I will skip setting mac address and  set the new
> mac address during runtime resume.

Instead of taking the device out of suspended mode to perform the required
action, the driver is moving to a model where 1) said action may be scheduled
to a later time - or result from past time work - and 2) rpm handler must
handle a lot of pm unrelated work.

rtl8169_ethtool_ops.{get_wol, get_regs, get_settings} aren't even fixed
yet (what about the .set_xyz handlers ?).

I can't help thinking that the driver should return to a state where it
stupidly does what it is asked to. No software caching, plain device
access, resume when needed, suspend as "suspend" instead of suspend as
"anticipate whatever may happen to avoid waking up".

-- 
Ueimor

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Alexander Duyck

On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert  wrote:
> This patch implements performant csum_partial for x86_64. The intent is
> to speed up checksum calculation, particularly for smaller lengths such
> as those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>
> - v4
>- went back to C code with inline assembly for critical routines
>- implemented suggestion from Linus to deal with lengths < 8
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Performance:
>
> Isolating old and new implementation for some common cases:
>
>  Old  New %
> Len/Aln  nsecsnsecs   Improv
> +---++---
> 1400/0195.6181.7   7% (Big packet)
> 40/0  11.4 6.2 45%(Ipv6 hdr cmn case)
> 8/4   7.9  3.2 59%(UDP, VXLAN in IPv4)
> 14/0  8.9  5.9 33%(Eth hdr)
> 14/4  9.2  5.9 35%(Eth hdr in IPv4)
> 14/3  9.6  5.9 38%(Eth with odd align)
> 20/0  9.0  6.2 31%(IP hdr without options)
> 7/1   8.9  4.2 52%(buffer in one quad)
> 100/017.4 13.9 20%(medium-sized pkt)
> 100/217.8 14.2 20%(medium-sized pkt w/ alignment)
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> Also tested on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> Branch  prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between the
> two scenarios (perf stat output is below), neither does either case
> show a significant number of branch misses.
>
> Interleave lengths case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
> ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1
>
>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 
> 1' (10 runs):
>
>  9,556,693,202  instructions   ( +-  0.00% )
>  1,176,208,640   branches 
> ( +-  0.00% )
> 19,487   branch-misses#0.00% of all branches  
> ( +-  6.07% )
>
>2.049732539 seconds time elapsed
>
> Non-interleave case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>  ./csum -M new-thrash -l 100 -S 24 -a 1 -c 1
>
> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
> 1' (10 runs):
>
>  9,782,188,310  instructions   ( +-  0.00% )
>  1,251,286,958   branches 
> ( +-  0.01% )
> 18,950   branch-misses#0.00% of all branches  
> ( +- 12.74% )
>
>2.271789046 seconds time elapsed
>
> Signed-off-by: Tom Herbert 
> ---
>  arch/x86/include/asm/checksum_64.h |  21 
>  arch/x86/lib/csum-partial_64.c | 225 
> -
>  2 files changed, 143 insertions(+), 103 deletions(-)
>
> diff --git a/arch/x86/include/asm/checksum_64.h 
> b/arch/x86/include/asm/checksum_64.h
> index cd00e17..e20c35b 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
> unsigned b)
> return a;
>  }
>
> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long 
> b)
> +{
> +   asm("addq %2,%0\n\t"
> +   "adcq $0,%0"
> +   : "=r" (a)
> +   : "0" (a), "rm" (b));
> +   return a;
> +}
> +
> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
> +unsigned int c)
> +{
> +   asm("addl %2,%0\n\t"
> +   "adcl %3,%0\n\t"
> +   "adcl $0,%0"
> +   : "=r" (a)
> +   : "" (a), "rm" (b), "rm" (c));
> +
> +   return a;
> +}
> +
>  #define HAVE_ARCH_CSUM_ADD
>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>  {
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index 9845371..df82c9b 100644
> ---

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Kevin Smith

Hi Andrew,

On 02/26/2016 04:35 PM, Andrew Lunn wrote:
> On Fri, Feb 26, 2016 at 10:12:28PM +, Kevin Smith wrote:
>> Hi Vivien, Andrew,
>>
>> On 02/26/2016 03:37 PM, Vivien Didelot wrote:
>>> Here, 5 is the CPU port and 6 is a DSA port.
>>>
>>> After joining ports 0, 1, 2 in the same bridge, we end up with:
>>>
>>> Port  0  1  2  3  4  5  6
>>> 0   -  *  *  -  -  *  *
>>> 1   *  -  *  -  -  *  *
>>> 2   *  *  -  -  -  *  *
>>> 3   -  -  -  -  -  *  *
>>> 4   -  -  -  -  -  *  *
>>> 5   *  *  *  *  *  -  *
>>> 6   *  *  *  *  *  *  -
>> The case I am concerned about is if the switch connected over DSA in
>> this example has a WAN port on it, which can legitimately route to the
>> CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does
>> this VLAN allow direct communication between the WAN and LAN?  Or is
>> this prevented by DSA or some other mechanism?
> A typical WIFI access point with a connection to a cable modem.
>
> So in linux you have interfaces like
>
> lan0, lan1, lan2, lan3, wan0
>
> DSA provides you these interface. And by default they are all
> separated. There is no path between them. You can consider them as
> being separate physical ethernet cards, just like all other interfaces
> in linux.
>
> What you would typically do is:
>
> brctl addbr br0
> brctl addif br0 lan0
> brctl addif br0 lan1
> brctl addif br0 lan2
> brctl addif br0 lan3
>
> to create a bridge between the lan ports. The linux kernel will then
> push this bridge configuration down into the hardware, so the switch
> can forward frames between these ports.
>
> The wan port is not part of the bridge, so there is no L2 path to the
> WAN port. You need to do IP routing on the CPU.
>
> Linux takes the stance that switch ports interfaces should act just
> like any other linux interface and you configure them in the normal
> linux way.
>
>  Andrew

Thanks for the explanation.  I am a bit befuddled by the combination of 
all the possible configurations of the switch and how they interact with 
Linux.  :)  I think I understand what is happening now.

Kevin

[net-next-2.6 v3 3/3] Support to encoding decoding skb prio on IFE action

2016-02-26 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Example usage:
xxx: Set the skb priority using skbedit then allow it to be encoded
sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit prio 17 \
action ife encode \
allow prio \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb priority earlier. A zero skb priority will not be sent

Alternative hard code static priority of decimal 33 (unlike skbedit)
then mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use prio 33 \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/Kconfig|  5 +++
 net/sched/Makefile   |  1 +
 net/sched/act_meta_skbprio.c | 76 
 3 files changed, 82 insertions(+)
 create mode 100644 net/sched/act_meta_skbprio.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 85854c0..b148302 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -756,6 +756,11 @@ config NET_IFE_SKBMARK
 depends on NET_ACT_IFE
 ---help---
 
+config NET_IFE_SKBPRIO
+tristate "Support to encoding decoding skb prio on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 3f7a182..84bddb3 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
+obj-$(CONFIG_NET_IFE_SKBPRIO)  += act_meta_skbprio.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
 obj-$(CONFIG_NET_SCH_HTB)  += sch_htb.o
diff --git a/net/sched/act_meta_skbprio.c b/net/sched/act_meta_skbprio.c
new file mode 100644
index 000..26bf4d8
--- /dev/null
+++ b/net/sched/act_meta_skbprio.c
@@ -0,0 +1,76 @@
+/*
+ * net/sched/act_meta_prio.c IFE skb->priority metadata module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * copyright Jamal Hadi Salim (2015)
+ *
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int skbprio_check(struct sk_buff *skb, struct tcf_meta_info *e)
+{
+   return ife_check_meta_u32(skb->priority, e);
+}
+
+static int skbprio_encode(struct sk_buff *skb, void *skbdata,
+ struct tcf_meta_info *e)
+{
+   u32 ifeprio = skb->priority; /* avoid having to cast skb->priority*/
+
+   return ife_encode_meta_u32(ifeprio, skbdata, e);
+}
+
+static int skbprio_decode(struct sk_buff *skb, void *data, u16 len)
+{
+   u32 ifeprio = *(u32 *)data;
+
+   skb->priority = ntohl(ifeprio);
+   return 0;
+}
+
+static struct tcf_meta_ops ife_prio_ops = {
+   .metaid = IFE_META_PRIO,
+   .metatype = NLA_U32,
+   .name = "skbprio",
+   .synopsis = "skb prio metadata",
+   .check_presence = skbprio_check,
+   .encode = skbprio_encode,
+   .decode = skbprio_decode,
+   .get = ife_get_meta_u32,
+   .alloc = ife_alloc_meta_u32,
+   .owner = THIS_MODULE,
+};
+
+static int __init ifeprio_init_module(void)
+{
+   return register_ife_op(_prio_ops);
+}
+
+static void __exit ifeprio_cleanup_module(void)
+{
+   unregister_ife_op(_prio_ops);
+}
+
+module_init(ifeprio_init_module);
+module_exit(ifeprio_cleanup_module);
+
+MODULE_AUTHOR("Jamal Hadi Salim(2015)");
+MODULE_DESCRIPTION("Inter-FE skb prio metadata action");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_IFE_META(IFE_META_PRIO);
-- 
1.9.1

[net-next-2.6 v3 0/3] net_sched: Add support for IFE action

2016-02-26 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

As agreed at netconf in Seville, here's the patch finally (1 year
was just too long to wait for an ethertype. Now we are just going
have the user configure one).
Described in netdev01 paper:
"Distributing Linux Traffic Control Classifier-Action Subsystem"
 Authors: Jamal Hadi Salim and Damascene M. Joachimpillai

The original motivation and deployment of this work was to horizontally
scale packet processing at scope of a chasis or rack. This means one
could take a tc policy and split it across machines connected over
L2. The paper refers to this as "pipeline stage indexing". Other
use cases which evolved out of the original intent include but are
not limited to carrying OAM information, carrying exception handling
metadata, carrying programmed authentication and authorization information,
encapsulating programmed compliance information, service IDs etc.
Read the referenced paper for more details.

The architecture allows for incremental updates for new metadatum support
to cover different use cases.
This patch set includes support for basic skb metadatum.
Followup patches will have more examples of metadata and other features.

v3 changes:
Integrate with the new namespace changes 
Remove skbhash and queue mapping metadata (but keep their claim for ids)
Integrate feedback from Cong 
Integrate feedback from Daniel

v2 changes:
Remove module option for an upper bound of metadata
Integrate feedback from Cong 
Integrate feedback from Daniel

Jamal Hadi Salim (3):
  introduce IFE action
  Support to encoding decoding skb mark on IFE action
  Support to encoding decoding skb prio on IFE action

 include/net/tc_act/tc_ife.h|  61 +++
 include/uapi/linux/tc_act/tc_ife.h |  38 ++
 net/sched/Kconfig  |  22 +
 net/sched/Makefile |   3 +
 net/sched/act_ife.c| 883 +
 net/sched/act_meta_mark.c  |  79 
 net/sched/act_meta_skbprio.c   |  76 
 7 files changed, 1162 insertions(+)
 create mode 100644 include/net/tc_act/tc_ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_ife.h
 create mode 100644 net/sched/act_ife.c
 create mode 100644 net/sched/act_meta_mark.c
 create mode 100644 net/sched/act_meta_skbprio.c

-- 
1.9.1

[net-next-2.6 v3 1/3] introduce IFE action

2016-02-26 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Described in netdev01 paper:
  "Distributing Linux Traffic Control Classifier-Action Subsystem"
   Authors: Jamal Hadi Salim and Damascene M. Joachimpillai

This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.

Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.

: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress

xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent : prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent : prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent : prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok

xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent : protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 
0 success 0)
  match / at 0 (success 0 )
action order 1: ife decode action reclassify
 index 1 ref 1 bind 1 installed 14 sec used 14 sec
 type: 0x0
 Metadata: allow mark allow hash allow prio allow qmap
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules

: Lets show the sender side now ..

xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx:ethertype 0xdead
xxx:add skb->mark to whitelist of metadatum to send
xxx:rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15

xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule 
hit 0 success 0)
  match c0a87aed/ at 16 (success 0 )
  match 0001/00ff at 8 (success 0 )

action order 1:  skbedit mark 17
 index 6 ref 1 bind 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: ife encode action pipe
 index 3 ref 1 bind 1
 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
 Metadata: allow mark
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
test by sending ping from sender to destination

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_ife.h|  61 +++
 include/uapi/linux/tc_act/tc_ife.h |  38 ++
 net/sched/Kconfig  |  12 +
 net/sched/Makefile |   1 +
 net/sched/act_ife.c| 883 +
 5 files changed, 995 insertions(+)
 create mode 100644 include/net/tc_act/tc_ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_ife.h
 create mode 100644 net/sched/act_ife.c

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
new file mode 100644
index 000..dc9a09a
--- /dev/null
+++ b/include/net/tc_act/tc_ife.h
@@ -0,0 +1,61 @@
+#ifndef __NET_TC_IFE_H
+#define __NET_TC_IFE_H
+
+#include 
+#include 
+#include 
+#include 
+
+#define IFE_METAHDRLEN 2
+struct tcf_ife_info {
+   struct tcf_common common;
+   u8 eth_dst[ETH_ALEN];
+   u8 eth_src[ETH_ALEN];
+   u16 eth_type;
+   u16 flags;
+   /* list of metaids allowed */
+   struct list_head metalist;
+};
+#define to_ife(a) \
+   container_of(a->priv,

[net-next-2.6 v3 2/3] Support to encoding decoding skb mark on IFE action

2016-02-26 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Example usage:
Set the skb using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit mark 17 \
action ife encode \
allow mark \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb mark earlier. A zero skb mark, when seen, will not be encoded.

Alternative hard code static mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/Kconfig |  5 +++
 net/sched/Makefile|  1 +
 net/sched/act_meta_mark.c | 79 +++
 3 files changed, 85 insertions(+)
 create mode 100644 net/sched/act_meta_mark.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 4d48ef5..85854c0 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -751,6 +751,11 @@ config NET_ACT_IFE
  To compile this code as a module, choose M here: the
  module will be called act_ife.
 
+config NET_IFE_SKBMARK
+tristate "Support to encoding decoding skb mark on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 3d17667..3f7a182 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_NET_ACT_VLAN)+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
+obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
 obj-$(CONFIG_NET_SCH_HTB)  += sch_htb.o
diff --git a/net/sched/act_meta_mark.c b/net/sched/act_meta_mark.c
new file mode 100644
index 000..8289217
--- /dev/null
+++ b/net/sched/act_meta_mark.c
@@ -0,0 +1,79 @@
+/*
+ * net/sched/act_meta_mark.c IFE skb->mark metadata module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * copyright Jamal Hadi Salim (2015)
+ *
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int skbmark_encode(struct sk_buff *skb, void *skbdata,
+ struct tcf_meta_info *e)
+{
+   u32 ifemark = skb->mark;
+
+   return ife_encode_meta_u32(ifemark, skbdata, e);
+}
+
+static int skbmark_decode(struct sk_buff *skb, void *data, u16 len)
+{
+   u32 ifemark = *(u32 *)data;
+
+   skb->mark = ntohl(ifemark);
+   return 0;
+}
+
+static int skbmark_check(struct sk_buff *skb, struct tcf_meta_info *e)
+{
+   return ife_check_meta_u32(skb->mark, e);
+}
+
+static struct tcf_meta_ops ife_skbmark_ops = {
+   .metaid = IFE_META_SKBMARK,
+   .metatype = NLA_U32,
+   .name = "skbmark",
+   .synopsis = "skb mark 32 bit metadata",
+   .check_presence = skbmark_check,
+   .encode = skbmark_encode,
+   .decode = skbmark_decode,
+   .get = ife_get_meta_u32,
+   .alloc = ife_alloc_meta_u32,
+   .release = ife_release_meta_gen,
+   .validate = ife_validate_meta_u32,
+   .owner = THIS_MODULE,
+};
+
+static int __init ifemark_init_module(void)
+{
+   return register_ife_op(_skbmark_ops);
+}
+
+static void __exit ifemark_cleanup_module(void)
+{
+   unregister_ife_op(_skbmark_ops);
+}
+
+module_init(ifemark_init_module);
+module_exit(ifemark_cleanup_module);
+
+MODULE_AUTHOR("Jamal Hadi Salim(2015)");
+MODULE_DESCRIPTION("Inter-FE skb mark metadata module");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_IFE_META(IFE_META_SKBMARK);
-- 
1.9.1

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 10:12:28PM +, Kevin Smith wrote:
> Hi Vivien, Andrew,
> 
> On 02/26/2016 03:37 PM, Vivien Didelot wrote:
> > Here, 5 is the CPU port and 6 is a DSA port.
> >
> > After joining ports 0, 1, 2 in the same bridge, we end up with:
> >
> > Port  0  1  2  3  4  5  6
> >0   -  *  *  -  -  *  *
> >1   *  -  *  -  -  *  *
> >2   *  *  -  -  -  *  *
> >3   -  -  -  -  -  *  *
> >4   -  -  -  -  -  *  *
> >5   *  *  *  *  *  -  *
> >6   *  *  *  *  *  *  -
> The case I am concerned about is if the switch connected over DSA in 
> this example has a WAN port on it, which can legitimately route to the 
> CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does 
> this VLAN allow direct communication between the WAN and LAN?  Or is 
> this prevented by DSA or some other mechanism?

A typical WIFI access point with a connection to a cable modem.

So in linux you have interfaces like

lan0, lan1, lan2, lan3, wan0

DSA provides you these interface. And by default they are all
separated. There is no path between them. You can consider them as
being separate physical ethernet cards, just like all other interfaces
in linux.

What you would typically do is:

brctl addbr br0
brctl addif br0 lan0
brctl addif br0 lan1
brctl addif br0 lan2
brctl addif br0 lan3

to create a bridge between the lan ports. The linux kernel will then
push this bridge configuration down into the hardware, so the switch
can forward frames between these ports.

The wan port is not part of the bridge, so there is no L2 path to the
WAN port. You need to do IP routing on the CPU.

Linux takes the stance that switch ports interfaces should act just
like any other linux interface and you configure them in the normal
linux way.

Andrew

Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode

2016-02-26 Thread Jesse Gross

On Thu, Feb 25, 2016 at 11:48 PM, Jiri Benc  wrote:
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index c2b2b7462731..ee4f7198aa21 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -464,6 +464,7 @@ enum {
>  enum vxlan_gpe_mode {
> VXLAN_GPE_MODE_DISABLED = 0,
> VXLAN_GPE_MODE_L2,
> +   VXLAN_GPE_MODE_L3,

Given that VXLAN_GPE_MODE_L3 will eventually come to be used by NSH,
MPLS, etc. in addition to IPv4/v6, most of which are not really L3, it
seems like something along the lines of NO_ARP might be better since
that's what it really indicates. Once that is in, I don't really see
the need to explicitly block Ethernet packets from being handled in
this mode. If they are received, then they can just be handed off to
the stack - at that point it would look like an extra header, the same
as if an NSH packet is received.

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Kevin Smith

Hi Vivien, Andrew,

On 02/26/2016 03:37 PM, Vivien Didelot wrote:
> Here, 5 is the CPU port and 6 is a DSA port.
>
> After joining ports 0, 1, 2 in the same bridge, we end up with:
>
> Port  0  1  2  3  4  5  6
>0   -  *  *  -  -  *  *
>1   *  -  *  -  -  *  *
>2   *  *  -  -  -  *  *
>3   -  -  -  -  -  *  *
>4   -  -  -  -  -  *  *
>5   *  *  *  *  *  -  *
>6   *  *  *  *  *  *  -
The case I am concerned about is if the switch connected over DSA in 
this example has a WAN port on it, which can legitimately route to the 
CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does 
this VLAN allow direct communication between the WAN and LAN?  Or is 
this prevented by DSA or some other mechanism?

Thanks,
Kevin

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 04:37:39PM -0500, Vivien Didelot wrote:
> Hi Kevin, Andrew,
> 
> Andrew Lunn  writes:
> 
> > On Fri, Feb 26, 2016 at 08:45:28PM +, Kevin Smith wrote:
> >> Hi Vivien,
> >> 
> >> On 02/26/2016 12:16 PM, Vivien Didelot wrote:
> >> > +/* allow CPU port or DSA link(s) to send frames to every port */
> >> > +if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
> >> > +output_ports = mask;
> >> > +} else {
> >
> >> Is this always correct?  Are there situations where a CPU or neighboring 
> >> switch should not be allowed to access another port? (e.g. Figure 6 or 7 
> >> in the 88E6352 functional specification).
> 
> Given Linux expectations (described below by Andrew) I'd say yes, this
> is always correct. But I'd be curious to know if someone has counter
> examples for this.
> 
> > What do these figures show?
> 
> The figure shows the following VLANTable config:
> 
> Port  0  1  2  3  4  5  6
>   0   -  *  *  *  -  -  *
>   1   *  -  *  *  -  -  *
>   2   *  *  -  *  -  -  *
>   3   *  *  *  -  -  -  *
>   4   -  -  -  -  -  *  -
>   5   -  -  -  -  *  -  -
>   6   *  *  *  *  -  -  -
> 
> There is two independant groups: 0, 1, 2, 3, 6 (LAN, 6 is CPU/Router),
> and 4, 5 (4 is WAN and 5 is CPU/Router):

Ah, two CPU interfaces. We don't support that yet.  I do have patches,
but i took a different approach. They just load balance, by some
definition of 'load balance' between the two CPU ports.

   Andrew

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Vivien Didelot

Hi Kevin, Andrew,

Andrew Lunn  writes:

> On Fri, Feb 26, 2016 at 08:45:28PM +, Kevin Smith wrote:
>> Hi Vivien,
>> 
>> On 02/26/2016 12:16 PM, Vivien Didelot wrote:
>> > +  /* allow CPU port or DSA link(s) to send frames to every port */
>> > +  if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
>> > +  output_ports = mask;
>> > +  } else {
>
>> Is this always correct?  Are there situations where a CPU or neighboring 
>> switch should not be allowed to access another port? (e.g. Figure 6 or 7 
>> in the 88E6352 functional specification).

Given Linux expectations (described below by Andrew) I'd say yes, this
is always correct. But I'd be curious to know if someone has counter
examples for this.

> What do these figures show?

The figure shows the following VLANTable config:

Port  0  1  2  3  4  5  6
  0   -  *  *  *  -  -  *
  1   *  -  *  *  -  -  *
  2   *  *  -  *  -  -  *
  3   *  *  *  -  -  -  *
  4   -  -  -  -  -  *  -
  5   -  -  -  -  *  -  -
  6   *  *  *  *  -  -  -

There is two independant groups: 0, 1, 2, 3, 6 (LAN, 6 is CPU/Router),
and 4, 5 (4 is WAN and 5 is CPU/Router):

Port #   Port Type VLANTable Setting
0LAN   0x4E
1LAN   0x4D
2LAN   0x4B
3LAN   0x47
4WAN   0x20
5CPU   0x10
6CPU   0x0F

> The CPU port needs to be able to send to each external port. The whole
> DSA concept is that Linux has a netdev per external port, and can send
> frames using the netdev out a specific port. Such frames have a DSA
> header indicating which port they are destined to.  When you have a
> multi chip setup, the frame needs to traverse DSA ports.

This current patch produces to following setup at setup:

Port  0  1  2  3  4  5  6
  0   -  -  -  -  -  *  *
  1   -  -  -  -  -  *  *
  2   -  -  -  -  -  *  *
  3   -  -  -  -  -  *  *
  4   -  -  -  -  -  *  *
  5   *  *  *  *  *  -  *
  6   *  *  *  *  *  *  -

Here, 5 is the CPU port and 6 is a DSA port.

After joining ports 0, 1, 2 in the same bridge, we end up with:

Port  0  1  2  3  4  5  6
  0   -  *  *  -  -  *  *
  1   *  -  *  -  -  *  *
  2   *  *  -  -  -  *  *
  3   -  -  -  -  -  *  *
  4   -  -  -  -  -  *  *
  5   *  *  *  *  *  -  *
  6   *  *  *  *  *  *  -

Thanks,
-v

[PATCH] net/mlx5e: make VXLAN support conditional

2016-02-26 Thread Arnd Bergmann

VXLAN can be disabled at compile-time or it can be a loadable
module while mlx5 is built-in, which leads to a link error:

drivers/net/built-in.o: In function `mlx5e_create_netdev':
ntb_netdev.c:(.text+0x106de4): undefined reference to `vxlan_get_rx_port'

This avoids the link error and makes the vxlan code optional,
like the other ethernet drivers do as well.

Signed-off-by: Arnd Bergmann 
Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |  7 +++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |  4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.h   | 11 +--
 5 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 1cf722eba607..f5c3b9465d8d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -31,3 +31,10 @@ config MLX5_CORE_EN_DCB
  This flag is depended on the kernel's DCB support.
 
  If unsure, set to Y
+
+config MLX5_CORE_EN_VXLAN
+   bool "VXLAN offloads Support"
+   default y
+   depends on MLX5_CORE_EN && VXLAN && !(MLX5_CORE=y && VXLAN=m)
+   ---help---
+ Say Y here if you want to use VXLAN offloads in the driver.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 11b592dbf16a..3ecef5f74ccf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -6,6 +6,8 @@ mlx5_core-y :=  main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \
en_main.o en_fs.o en_ethtool.o en_tx.o en_rx.o \
-   en_txrx.o en_clock.o vxlan.o
+   en_txrx.o en_clock.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_EN_VXLAN) += vxlan.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1dca3dcf90f5..18040da1c3a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -552,7 +552,9 @@ struct mlx5e_priv {
struct mlx5e_flow_tables   fts;
struct mlx5e_eth_addr_db   eth_addr;
struct mlx5e_vlan_db   vlan;
+#ifdef CONFIG_MLX5_CORE_EN_VXLAN
struct mlx5e_vxlan_db  vxlan;
+#endif
 
struct mlx5e_paramsparams;
spinlock_t async_events_spinlock; /* sync hw events */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0d45f35aee72..44fc4bc35ffd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2116,6 +2116,9 @@ static netdev_features_t 
mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
u16 proto;
u16 port = 0;
 
+   if (!IS_ENABLED(CONFIG_MLX5_CORE_EN_VXLAN))
+   goto out;
+
switch (vlan_get_protocol(skb)) {
case htons(ETH_P_IP):
proto = ip_hdr(skb)->protocol;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h 
b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
index a01685056ab1..8c57861e0f8a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
@@ -41,14 +41,21 @@ struct mlx5e_vxlan {
 
 static inline bool mlx5e_vxlan_allowed(struct mlx5_core_dev *mdev)
 {
-   return (MLX5_CAP_ETH(mdev, tunnel_stateless_vxlan) &&
+   return IS_ENABLED(CONFIG_MLX5_CORE_EN_VXLAN) &&
+   (MLX5_CAP_ETH(mdev, tunnel_stateless_vxlan) &&
mlx5_core_is_pf(mdev));
 }
 
+#ifdef CONFIG_MLX5_CORE_EN_VXLAN
 void mlx5e_vxlan_init(struct mlx5e_priv *priv);
+void mlx5e_vxlan_cleanup(struct mlx5e_priv *priv);
+#else
+static inline void mlx5e_vxlan_init(struct mlx5e_priv *priv) {}
+static inline void mlx5e_vxlan_cleanup(struct mlx5e_priv *priv) {}
+#endif
+
 int  mlx5e_vxlan_add_port(struct mlx5e_priv *priv, u16 port);
 void mlx5e_vxlan_del_port(struct mlx5e_priv *priv, u16 port);
 struct mlx5e_vxlan *mlx5e_vxlan_lookup_port(struct mlx5e_priv *priv, u16 port);
-void mlx5e_vxlan_cleanup(struct mlx5e_priv *priv);
 
 #endif /* __MLX5_VXLAN_H__ */
-- 
2.7.0

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 08:45:28PM +, Kevin Smith wrote:
> Hi Vivien,
> 
> On 02/26/2016 12:16 PM, Vivien Didelot wrote:
> > +   /* allow CPU port or DSA link(s) to send frames to every port */
> > +   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
> > +   output_ports = mask;
> > +   } else {

> Is this always correct?  Are there situations where a CPU or neighboring 
> switch should not be allowed to access another port? (e.g. Figure 6 or 7 
> in the 88E6352 functional specification).

What do these figures show?

The CPU port needs to be able to send to each external port. The whole
DSA concept is that Linux has a netdev per external port, and can send
frames using the netdev out a specific port. Such frames have a DSA
header indicating which port they are destined to.  When you have a
multi chip setup, the frame needs to traverse DSA ports.

  Andrew

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Kevin Smith

Hi Vivien,

On 02/26/2016 12:16 PM, Vivien Didelot wrote:
> + /* allow CPU port or DSA link(s) to send frames to every port */
> + if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
> + output_ports = mask;
> + } else {
Is this always correct?  Are there situations where a CPU or neighboring 
switch should not be allowed to access another port? (e.g. Figure 6 or 7 
in the 88E6352 functional specification).

Thanks,
Kevin

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert

On Fri, Feb 26, 2016 at 12:29 PM, Linus Torvalds
 wrote:
> Looks ok to me.
>
> I am left wondering if the code should just do that
>
> add32_with_carry3(sum, result >> 32, result);
>
> in the caller instead - right now pretty much every return point in
> do_csum() effectively does that,  with the exception of
>
>  - the 0-length case, which is presumably not really an issue in real
> life and could as well just return 0
>
>  - the 8-byte case that does two 32-bit loads instead, but could just
> do a single 8-byte load and return it (and then the generic version in
> the caller would do a shift).

Right, it is slightly faster for those two cases to return the result
directly. (csum over 8 bytes might be common with some encapsulation
protocols).

>
> That would simplifiy the code a bit - it wouldn't need to pass in
> "sum" to do_csum at all, and we'd have just a single case of that
> special 3-input carry add..
>
> But I'm certainly ok with it as-is. I'm not sure how performance
> critical the whole csum routine is, but at least now it doesn't
> introduce a lot of new complex asm.
>
Micro-performance optimizations may become more relevant as we
introduce more high performance network paths to the kernel. But the
short term reason for this is to dispel any remaining notion that NIC
HW support for CHECKSUM_UNNECESSARY is somehow better than
CHECKSUM_COMPLETE because pulling up checksums in the protocols is too
expensive.

Tom

> And this version might be reasonable to make generic, so that non-x86
> architectures could use the same approach. That's what we ended up
> doing for the dcache word-at-a-time code too in the end.
>
> Linus

pull-request: mac80211-next 2016-02-26

2016-02-26 Thread Johannes Berg

Hi Dave,

Let's try this again. I backed out some of the rfkill changes
that are buggy and fixed some of that too. I also left out the
one that generated the big discussion, but I still think it's
the saner thing to do rather than requiring userspace to poke
around that much with sysfs when all it wants to do is tell
us what it thinks should be "airplane mode". Anyway, wanted to
get these things in before sorting that out.

Still the ARM patch in here - acked by the relevant people to
fit into the series (rfkill -> ARM -> rfkill with dependencies
that way); and, as Emmanuel reminded me, an iwlwifi patch that
has similar dependency issues and we decided to take through
my tree.

Let me know if there's any problem.

Thanks,
johannes



The following changes since commit 725da8dee445662beea77d3f42c3f4c79f7a7a0e:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-01-13 
00:22:13 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git 
tags/mac80211-next-for-davem-2016-02-26

for you to fetch changes up to 50ee738d7271fe825e4024cdfa5c5301a871e2c2:

  rfkill: Add documentation about LED triggers (2016-02-24 09:13:12 +0100)


Here's another round of updates for -next:
 * big A-MSDU RX performance improvement (avoid linearize of paged RX)
 * rfkill changes: cleanups, documentation, platform properties
 * basic PBSS support in cfg80211
 * MU-MIMO action frame processing support
 * BlockAck reordering & duplicate detection offload support
 * various cleanups & little fixes


Arnd Bergmann (1):
  mac80211: avoid excessive stack usage in sta_info

Beni Lev (1):
  cfg80211: Add global RRM capability

Bjorn Andersson (1):
  mac80211: Make addr const in SET_IEEE80211_PERM_ADDR()

Bob Copeland (1):
  mac80211: mesh: drop constant field mean_chain_len

Eliad Peller (3):
  mac80211: move TKIP TX IVs to public part of key struct
  iwlwifi: mvm: move TX PN assignment for TKIP to the driver
  mac80211: remove ieee80211_get_key_tx_seq/ieee80211_set_key_tx_seq

Emmanuel Grumbach (1):
  mac80211: limit the A-MSDU Tx based on peer's capabilities

Felix Fietkau (5):
  mac80211: move A-MSDU skb_linearize call to ieee80211_amsdu_to_8023s
  cfg80211: add function for 802.3 conversion with separate output buffer
  cfg80211: add support for non-linear skbs in ieee80211_amsdu_to_8023s
  cfg80211: fix faulty variable initialization in ieee80211_amsdu_to_8023s
  cfg80211: reuse existing page fragments in A-MSDU rx

Geliang Tang (1):
  cfg80211/mac80211: use to_delayed_work

Grzegorz Bajorski (1):
  mac80211: allow drivers to report (non-)monitor frames

Heikki Krogerus (4):
  net: rfkill: add rfkill_find_type function
  net: rfkill: gpio: get the name and type from device property
  ARM: tegra: use build-in device properties with rfkill_gpio
  net: rfkill: gpio: remove rfkill_gpio_platform_data

Henning Rogge (3):
  mac80211: Remove MPP table entries with MPath
  mac80211: let unused MPP table entries timeout
  mac80211: Unify mesh and mpp path removal function

Ilan Peer (1):
  mac80211: Recalc min chandef when station is associated

Johannes Berg (8):
  cfg80211: remove CFG80211_REG_DEBUG
  mac80211: document status.freq restrictions
  mac80211: refactor HT/VHT to chandef code
  mac80211_hwsim: remove shadowing variable
  rfkill: disentangle polling pause and suspend
  mac80211: add RX_FLAG_MACTIME_PLCP_START
  mac80211: always print a message when disconnecting
  mac80211: change ieee80211_rx_reorder_ready() arguments

Jouni Malinen (1):
  mac80211: Interoperability workaround for 80+80 and 160 MHz channels

João Paulo Rechi Vita (10):
  rfkill: use variable instead of duplicating the expression
  rfkill: remove/inline __rfkill_set_hw_state
  rfkill: Remove obsolete "claim" sysfs interface
  rfkill: Update userspace API documentation
  rfkill: Improve documentation language
  rfkill: Remove extra blank line
  rfkill: Point to the correct deprecated doc location
  rfkill: Move "state" sysfs file back to stable
  rfkill: Factor rfkill_global_states[].cur assignments
  rfkill: Add documentation about LED triggers

Lior David (1):
  cfg80211: basic support for PBSS network type

Lorenzo Bianconi (2):
  mac80211: fix wiphy supported_band access
  cfg80211: add radiotap VHT info to rtap_namespace_sizes

Michal Kazior (3):
  mac80211: fix txq queue related crashes
  mac80211: fix unnecessary frame drops in mesh fwding
  mac80211: expose txq queue depth and size to drivers

Ola Olsson (2):
  cfg80211: add more warnings for inconsistent ops
  cfg80211: Fix some linguistics in Kconfig

Sara Sharon (10):
  mac80211: process and save VHT

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Linus Torvalds

Looks ok to me.

I am left wondering if the code should just do that

add32_with_carry3(sum, result >> 32, result);

in the caller instead - right now pretty much every return point in
do_csum() effectively does that,  with the exception of

 - the 0-length case, which is presumably not really an issue in real
life and could as well just return 0

 - the 8-byte case that does two 32-bit loads instead, but could just
do a single 8-byte load and return it (and then the generic version in
the caller would do a shift).

That would simplifiy the code a bit - it wouldn't need to pass in
"sum" to do_csum at all, and we'd have just a single case of that
special 3-input carry add..

But I'm certainly ok with it as-is. I'm not sure how performance
critical the whole csum routine is, but at least now it doesn't
introduce a lot of new complex asm.

And this version might be reasonable to make generic, so that non-x86
architectures could use the same approach. That's what we ended up
doing for the dcache word-at-a-time code too in the end.

Linus

Re: [PATCH] net: ezchip: adapt driver to little endian architecture

2016-02-26 Thread David Miller

From: Arnd Bergmann 
Date: Fri, 26 Feb 2016 21:10:31 +0100

> On Friday 26 February 2016 22:05:09 Lada Trimasova wrote:
>> for (i = 0; i < len; i++, reg++) {
>> u32 buf = nps_enet_reg_get(priv, 
>> NPS_ENET_REG_RX_BUF);
>> +   buf = be32_to_cpu(buf);
>> put_unaligned(buf, reg);
>> }
> 
> I think most of the changes can make use of the put_unaligned_be32()
> etc helpers that might also be more efficient.

Agreed.

[PATCH net-next v2 4/4] bridge: mcast: add support for more router port information dumping

2016-02-26 Thread Nikolay Aleksandrov

Allow for more multicast router port information to be dumped such as
timer and type attributes. For that that purpose we need to extend the
MDBA_ROUTER_PORT attribute similar to how it was done for the mdb entries
recently. The new format is thus:
[MDBA_ROUTER_PORT] = { <- nested attribute
u32 ifindex <- router port ifindex for user-space compatibility
[MDBA_ROUTER_PATTR attributes]
}
This way it remains compatible with older users (they'll simply retrieve
the u32 in the beginning) and new users can parse the remaining
attributes. It would also allow to add future extensions to the router
port without breaking compatibility.

Signed-off-by: Nikolay Aleksandrov 
---
v2: new patch, adds the extended router port netlink information

 include/uapi/linux/if_bridge.h | 14 +-
 net/bridge/br_mdb.c| 16 ++--
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index b281d02051cc..af98f6855b7e 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -161,7 +161,10 @@ enum {
  * }
  * }
  * [MDBA_ROUTER] = {
- *[MDBA_ROUTER_PORT]
+ *[MDBA_ROUTER_PORT] = {
+ *u32 ifindex
+ *[MDBA_ROUTER_PATTR attributes]
+ *}
  * }
  */
 enum {
@@ -209,6 +212,15 @@ enum {
 };
 #define MDBA_ROUTER_MAX (__MDBA_ROUTER_MAX - 1)
 
+/* router port attributes */
+enum {
+   MDBA_ROUTER_PATTR_UNSPEC,
+   MDBA_ROUTER_PATTR_TIMER,
+   MDBA_ROUTER_PATTR_TYPE,
+   __MDBA_ROUTER_PATTR_MAX
+};
+#define MDBA_ROUTER_PATTR_MAX (__MDBA_ROUTER_PATTR_MAX - 1)
+
 struct br_port_msg {
__u8  family;
__u32 ifindex;
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 73786e2fe065..253bc77eda3b 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -20,7 +20,7 @@ static int br_rports_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
 {
struct net_bridge *br = netdev_priv(dev);
struct net_bridge_port *p;
-   struct nlattr *nest;
+   struct nlattr *nest, *port_nest;
 
if (!br->multicast_router || hlist_empty(>router_list))
return 0;
@@ -30,8 +30,20 @@ static int br_rports_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
return -EMSGSIZE;
 
hlist_for_each_entry_rcu(p, >router_list, rlist) {
-   if (p && nla_put_u32(skb, MDBA_ROUTER_PORT, p->dev->ifindex))
+   if (!p)
+   continue;
+   port_nest = nla_nest_start(skb, MDBA_ROUTER_PORT);
+   if (!port_nest)
goto fail;
+   if (nla_put_nohdr(skb, sizeof(u32), >dev->ifindex) ||
+   nla_put_u32(skb, MDBA_ROUTER_PATTR_TIMER,
+   br_timer_value(>multicast_router_timer)) ||
+   nla_put_u8(skb, MDBA_ROUTER_PATTR_TYPE,
+  p->multicast_router)) {
+   nla_nest_cancel(skb, port_nest);
+   goto fail;
+   }
+   nla_nest_end(skb, port_nest);
}
 
nla_nest_end(skb, nest);
-- 
2.4.3

[PATCH net-next v2 3/4] bridge: mcast: add support for temporary port router

2016-02-26 Thread Nikolay Aleksandrov

Add support for a temporary router port which doesn't depend only on the
incoming query. It can be refreshed if set to the same value, which is
a no-op for the rest.

Signed-off-by: Nikolay Aleksandrov 
---
v2: split in two, this only adds the new temp router port type

 include/uapi/linux/if_bridge.h |  1 +
 net/bridge/br_multicast.c  | 21 +++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index f2764b739f38..b281d02051cc 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -199,6 +199,7 @@ enum {
MDB_RTR_TYPE_DISABLED,
MDB_RTR_TYPE_TEMP_QUERY,
MDB_RTR_TYPE_PERM,
+   MDB_RTR_TYPE_TEMP
 };
 
 enum {
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index f1140cf5168d..a4c15df2b792 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -759,13 +759,17 @@ static void br_multicast_router_expired(unsigned long 
data)
struct net_bridge *br = port->br;
 
spin_lock(>multicast_lock);
-   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY ||
+   if (port->multicast_router == MDB_RTR_TYPE_DISABLED ||
+   port->multicast_router == MDB_RTR_TYPE_PERM ||
timer_pending(>multicast_router_timer) ||
hlist_unhashed(>rlist))
goto out;
 
hlist_del_init_rcu(>rlist);
br_rtr_notify(br->dev, port, RTM_DELMDB);
+   /* Don't allow timer refresh if the router expired */
+   if (port->multicast_router == MDB_RTR_TYPE_TEMP)
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 
 out:
spin_unlock(>multicast_lock);
@@ -981,6 +985,9 @@ void br_multicast_disable_port(struct net_bridge_port *port)
if (!hlist_unhashed(>rlist)) {
hlist_del_init_rcu(>rlist);
br_rtr_notify(br->dev, port, RTM_DELMDB);
+   /* Don't allow timer refresh if disabling */
+   if (port->multicast_router == MDB_RTR_TYPE_TEMP)
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
}
del_timer(>multicast_router_timer);
del_timer(>ip4_own_query.timer);
@@ -1234,7 +1241,8 @@ static void br_multicast_mark_router(struct net_bridge 
*br,
return;
}
 
-   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY)
+   if (port->multicast_router == MDB_RTR_TYPE_DISABLED ||
+   port->multicast_router == MDB_RTR_TYPE_PERM)
return;
 
br_multicast_add_router(br, port);
@@ -1850,10 +1858,15 @@ static void __del_port_router(struct net_bridge_port *p)
 int br_multicast_set_port_router(struct net_bridge_port *p, unsigned long val)
 {
struct net_bridge *br = p->br;
+   unsigned long now = jiffies;
int err = -EINVAL;
 
spin_lock(>multicast_lock);
if (p->multicast_router == val) {
+   /* Refresh the temp router port timer */
+   if (p->multicast_router == MDB_RTR_TYPE_TEMP)
+   mod_timer(>multicast_router_timer,
+ now + br->multicast_querier_interval);
err = 0;
goto unlock;
}
@@ -1872,6 +1885,10 @@ int br_multicast_set_port_router(struct net_bridge_port 
*p, unsigned long val)
del_timer(>multicast_router_timer);
br_multicast_add_router(br, p);
break;
+   case MDB_RTR_TYPE_TEMP:
+   p->multicast_router = MDB_RTR_TYPE_TEMP;
+   br_multicast_mark_router(br, p);
+   break;
default:
goto unlock;
}
-- 
2.4.3

[PATCH net-next v2 1/4] bridge: mcast: use names for the different multicast_router types

2016-02-26 Thread Nikolay Aleksandrov

Using raw values makes it difficult to extend and also understand the
code, give them names and do explicit per-option manipulation in
br_multicast_set_port_router.

Signed-off-by: Nikolay Aleksandrov 
---
v2: set multicast_router first

 include/uapi/linux/if_bridge.h |  7 +
 net/bridge/br_multicast.c  | 61 +++---
 2 files changed, 40 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 610932b477c4..f2764b739f38 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -194,6 +194,13 @@ enum {
 };
 #define MDBA_MDB_EATTR_MAX (__MDBA_MDB_EATTR_MAX - 1)
 
+/* multicast router types */
+enum {
+   MDB_RTR_TYPE_DISABLED,
+   MDB_RTR_TYPE_TEMP_QUERY,
+   MDB_RTR_TYPE_PERM,
+};
+
 enum {
MDBA_ROUTER_UNSPEC,
MDBA_ROUTER_PORT,
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 8b6e4249be1b..71c109b0943f 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -759,7 +759,7 @@ static void br_multicast_router_expired(unsigned long data)
struct net_bridge *br = port->br;
 
spin_lock(>multicast_lock);
-   if (port->multicast_router != 1 ||
+   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY ||
timer_pending(>multicast_router_timer) ||
hlist_unhashed(>rlist))
goto out;
@@ -912,7 +912,7 @@ static void br_ip6_multicast_port_query_expired(unsigned 
long data)
 
 void br_multicast_add_port(struct net_bridge_port *port)
 {
-   port->multicast_router = 1;
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 
setup_timer(>multicast_router_timer, br_multicast_router_expired,
(unsigned long)port);
@@ -959,7 +959,8 @@ void br_multicast_enable_port(struct net_bridge_port *port)
 #if IS_ENABLED(CONFIG_IPV6)
br_multicast_enable(>ip6_own_query);
 #endif
-   if (port->multicast_router == 2 && hlist_unhashed(>rlist))
+   if (port->multicast_router == MDB_RTR_TYPE_PERM &&
+   hlist_unhashed(>rlist))
br_multicast_add_router(br, port);
 
 out:
@@ -1227,13 +1228,13 @@ static void br_multicast_mark_router(struct net_bridge 
*br,
unsigned long now = jiffies;
 
if (!port) {
-   if (br->multicast_router == 1)
+   if (br->multicast_router == MDB_RTR_TYPE_TEMP_QUERY)
mod_timer(>multicast_router_timer,
  now + br->multicast_querier_interval);
return;
}
 
-   if (port->multicast_router != 1)
+   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY)
return;
 
br_multicast_add_router(br, port);
@@ -1713,7 +1714,7 @@ void br_multicast_init(struct net_bridge *br)
br->hash_elasticity = 4;
br->hash_max = 512;
 
-   br->multicast_router = 1;
+   br->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
br->multicast_querier = 0;
br->multicast_query_use_ifaddr = 0;
br->multicast_last_member_count = 2;
@@ -1823,11 +1824,11 @@ int br_multicast_set_router(struct net_bridge *br, 
unsigned long val)
spin_lock_bh(>multicast_lock);
 
switch (val) {
-   case 0:
-   case 2:
+   case MDB_RTR_TYPE_DISABLED:
+   case MDB_RTR_TYPE_PERM:
del_timer(>multicast_router_timer);
/* fall through */
-   case 1:
+   case MDB_RTR_TYPE_TEMP_QUERY:
br->multicast_router = val;
err = 0;
break;
@@ -1838,6 +1839,14 @@ int br_multicast_set_router(struct net_bridge *br, 
unsigned long val)
return err;
 }
 
+static void __del_port_router(struct net_bridge_port *p)
+{
+   if (hlist_unhashed(>rlist))
+   return;
+   hlist_del_init_rcu(>rlist);
+   br_rtr_notify(p->br->dev, p, RTM_DELMDB);
+}
+
 int br_multicast_set_port_router(struct net_bridge_port *p, unsigned long val)
 {
struct net_bridge *br = p->br;
@@ -1846,29 +1855,25 @@ int br_multicast_set_port_router(struct net_bridge_port 
*p, unsigned long val)
spin_lock(>multicast_lock);
 
switch (val) {
-   case 0:
-   case 1:
-   case 2:
-   p->multicast_router = val;
-   err = 0;
-
-   if (val < 2 && !hlist_unhashed(>rlist)) {
-   hlist_del_init_rcu(>rlist);
-   br_rtr_notify(br->dev, p, RTM_DELMDB);
-   }
-
-   if (val == 1)
-   break;
-
+   case MDB_RTR_TYPE_DISABLED:
+   p->multicast_router = MDB_RTR_TYPE_DISABLED;
+   __del_port_router(p);
+   del_timer(>multicast_router_timer);
+   break;
+   case MDB_RTR_TYPE_TEMP_QUERY:
+   p->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
+

Re: [PATCH 0/4] Convert network timestamps to be y2038 safe

2016-02-26 Thread David Miller

From: Deepa Dinamani 
Date: Wed, 24 Feb 2016 23:07:07 -0800

> Introduction:
> 
> The series is aimed at transitioning network timestamps to being
> y2038 safe.
> All patches can be reviewed and merged independently, except for
> the [PATCH 2/4], which is dependent on the [PATCH 1/4].
> 
> Socket timestamps and ioctl calls will be handled separately.
> 
> Thanks to Arnd Bergmann for discussing solution options with me.
> 
> Solution:
> 
> Data type struct timespec is not y2038 safe.
> Replace timespec with struct timespec64 which is y2038 safe.

Please respin this, moving the helper into net/ipv4/af_inet.c as per
the feedback given.

Thanks.

[PATCH net-next v2 0/4] bridge: mcast: add support for temp router port

2016-02-26 Thread Nikolay Aleksandrov

Hi,
This set adds support for temporary router port which doesn't depend only
on the incoming queries. It can be refreshed by setting multicast_router to
the same value (3). The first two patches are minor changes that prepare
the code for the third which adds this new type of router port.
In order to be able to dump its information the mdb router port format
is changed in patch 04 and extended similar to how mdb entries format was
done recently.
The related iproute2 changes will be posted if this is accepted.

v2: set val first and adjust router type later in patch 01, patch 03 was
split in 2

Thanks,
 Nik


Nikolay Aleksandrov (4):
  bridge: mcast: use names for the different multicast_router types
  bridge: mcast: do nothing if port's multicast_router is set to the
same val
  bridge: mcast: add support for temporary port router
  bridge: mcast: add support for more router port information dumping

 include/uapi/linux/if_bridge.h | 22 ++-
 net/bridge/br_mdb.c| 16 +++-
 net/bridge/br_multicast.c  | 83 +++---
 3 files changed, 89 insertions(+), 32 deletions(-)

-- 
2.4.3

[PATCH net-next v2 2/4] bridge: mcast: do nothing if port's multicast_router is set to the same val

2016-02-26 Thread Nikolay Aleksandrov

This is needed for the upcoming temporary port router. There's no point
to go through the logic if the value is the same.

Signed-off-by: Nikolay Aleksandrov 
---
v2: no change

 net/bridge/br_multicast.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 71c109b0943f..f1140cf5168d 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1853,7 +1853,10 @@ int br_multicast_set_port_router(struct net_bridge_port 
*p, unsigned long val)
int err = -EINVAL;
 
spin_lock(>multicast_lock);
-
+   if (p->multicast_router == val) {
+   err = 0;
+   goto unlock;
+   }
switch (val) {
case MDB_RTR_TYPE_DISABLED:
p->multicast_router = MDB_RTR_TYPE_DISABLED;
-- 
2.4.3

BUG: ixgbe_select_queue: general protection fault in v4.4.3

2016-02-26 Thread Asbjørn Sloth Tønnesen

Hi,

It seams that v4.4.3 doesn't like having lots of vlans,
the fault occurs shortly after enabling forwarding, with in this case just 350 
net_devices defined.

The server is now running a known good v4.3.3. The NIC is a X520-DA2.

The issue quite reproducible, but I will first have some spare hardware to do
further tests on some time next week.


[ 1474.416366] general protection fault:  [#1] SMP 
[ 1474.421399] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat 
crct10dif_pclmul crct10dif_common crc32c_intel cryptd 8021q
[ 1474.433211] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.4.3-fbcr1 #144
[ 1474.439746] Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015
[ 1474.446370] task: 880276188000 ti: 88027619 task.ti: 
88027619
[ 1474.453864] RIP: 0010:[]  [] 
__netdev_pick_tx+0x61/0x140
[ 1474.462337] RSP: 0018:88027fd036d8  EFLAGS: 00010286
[ 1474.467657] RAX: 88007dc0ff10 RBX:  RCX: 816698a0
[ 1474.474800] RDX: 000c RSI: 88007ce98c00 RDI: 88007dd44000
[ 1474.481944] RBP: 88027fd03708 R08: 8802580c0af0 R09: 05dc
[ 1474.489093] R10: ec2abb211b00c637 R11: 1b00c63741e7d944 R12: 88007dd44000
[ 1474.496233] R13:  R14:  R15: 0100
[ 1474.503559] FS:  () GS:88027fd0() 
knlGS:
[ 1474.511920] CS:  0010 DS:  ES:  CR0: 8005003b
[ 1474.517854] CR2: 7f2b70dd5095 CR3: 01ad6000 CR4: 001006e0
[ 1474.525131] Stack:
[ 1474.527287]  88025911c400  88007dd44000 
88007ce98c00
[ 1474.535036]   88007dcbca00 88027fd03718 
8152dcca
[ 1474.542800]  88027fd03748 816707ac  
88007dd44000
[ 1474.550554] Call Trace:
[ 1474.553138]   
[ 1474.555077]  [] ixgbe_select_queue+0x1a/0x20
[ 1474.561433]  [] netdev_pick_tx+0x5c/0xd0
[ 1474.567063]  [] __dev_queue_xmit+0x80/0x530
[ 1474.572948]  [] dev_queue_xmit+0xb/0x10
[ 1474.578540]  [] vlan_dev_hard_start_xmit+0x93/0x110 [8021q]
[ 1474.585827]  [] dev_hard_start_xmit+0x256/0x3f0
[ 1474.592055]  [] __dev_queue_xmit+0x4cc/0x530
[ 1474.598080]  [] ? __dev_queue_xmit+0x248/0x530
[ 1474.604224]  [] dev_queue_xmit+0xb/0x10
[ 1474.609762]  [] vlan_dev_hard_start_xmit+0x93/0x110 [8021q]
[ 1474.617088]  [] dev_hard_start_xmit+0x256/0x3f0
[ 1474.623326]  [] ? eth_header+0x25/0xc0
[ 1474.628773]  [] __dev_queue_xmit+0x4cc/0x530
[ 1474.634801]  [] dev_queue_xmit+0xb/0x10
[ 1474.640348]  [] neigh_connected_output+0xc1/0xf0
[ 1474.64]  [] ip_finish_output2+0x122/0x300
[ 1474.652772]  [] ip_do_fragment+0x793/0x890
[ 1474.658571]  [] ? nf_nat_ipv4_fn+0x197/0x1f0 [nf_nat_ipv4]
[ 1474.665757]  [] ? ip_fragment.constprop.49+0x80/0x80
[ 1474.672422]  [] ip_fragment.constprop.49+0x3e/0x80
[ 1474.678922]  [] ip_finish_output+0xbd/0x1e0
[ 1474.684865]  [] NF_HOOK_COND.part.33.constprop.47+0x9/0x10
[ 1474.692112]  [] ip_output+0xb0/0xc0
[ 1474.697310]  [] ? 
__ip_flush_pending_frames.isra.40+0x80/0x80
[ 1474.704886]  [] ip_forward_finish+0x48/0x70
[ 1474.710779]  [] ip_forward+0x3cd/0x450
[ 1474.716228]  [] ? ip_frag_mem+0x40/0x40
[ 1474.721765]  [] ip_rcv_finish+0x8d/0x320
[ 1474.727442]  [] ip_rcv+0x2c3/0x370
[ 1474.732543]  [] ? inet_del_offload+0x40/0x40
[ 1474.738516]  [] __netif_receive_skb_core+0x6f1/0xa40
[ 1474.745232]  [] ? udp4_gro_receive+0x1c2/0x2e0
[ 1474.755275]  [] ? inet_gro_receive+0x18d/0x200
[ 1474.761440]  [] __netif_receive_skb+0x18/0x60
[ 1474.767499]  [] netif_receive_skb_internal+0x28/0x90
[ 1474.774165]  [] napi_gro_receive+0xc3/0xf0
[ 1474.780023]  [] ixgbe_clean_rx_irq+0x507/0x920
[ 1474.786174]  [] ixgbe_poll+0x531/0x8d0
[ 1474.791624]  [] net_rx_action+0x1f2/0x330
[ 1474.797395]  [] __do_softirq+0xa0/0x2b0
[ 1474.802937]  [] irq_exit+0x9e/0xa0
[ 1474.808046]  [] do_IRQ+0x4f/0xd0
[ 1474.812978]  [] common_interrupt+0x82/0x82
[ 1474.818823]   
[ 1474.820766]  [] ? cpuidle_enter_state+0x11e/0x2c0
[ 1474.827547]  [] ? cpuidle_enter_state+0x10c/0x2c0
[ 1474.833951]  [] cpuidle_enter+0x12/0x20
[ 1474.839540]  [] cpu_startup_entry+0x29a/0x300
[ 1474.845605]  [] start_secondary+0xed/0xf0
[ 1474.851318] Code: 87 a8 03 00 00 49 89 fc 48 85 c0 0f 84 da 00 00 00 8b 96 
ac 00 00 00 83 ea 01 48 8d 44 d0 10 4c 8b 38 4d 85 ff 0f 84 c0 00 00 00 <41> 8b 
1f 83 fb 01 0f 84 8d 00 00 00 f6 86 91 00 00 00 30 0f 84 
[ 1474.872330] RIP  [] __netdev_pick_tx+0x61/0x140
[ 1474.878595]  RSP 
[ 1474.882784] ---[ end trace 6a6e1080c88377db ]---
[ 1474.887608] Kernel panic - not syncing: Fatal exception in interrupt



[  217.407517] general protection fault:  [#1] SMP 
[  217.412740] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat 
crct10dif_pclmul crct10dif_common crc32c_intel cryptd 8021q
[  217.425240] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.4.3-fbcr1 #144
[  217.431845] Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015
[  217.438537] task:

Re: [PATCH] net: ezchip: adapt driver to little endian architecture

2016-02-26 Thread Arnd Bergmann

On Friday 26 February 2016 22:05:09 Lada Trimasova wrote:
> 
> @@ -75,6 +86,7 @@ struct nps_enet_rx_ctl {
>  * nr:  Length in bytes of Rx frame loaded by MAC to Rx buffer
>  */
> struct {
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> u32
> __reserved_1:16,
> cr:1,
> @@ -82,6 +94,15 @@ struct nps_enet_rx_ctl {
> crc:1,
> __reserved_2:2,
> nr:11;
> +#else
> +   u32
> +   nr:11,
> +   __reserved_2:2,
> +   crc:1,
> +   er:1,
> +   cr:1,
> +   __reserved_1:16;
> +#endif
> };

A nicer way to do this would be to remove all the bitfields
and use named constants for accessing the fields insode of
a u32 or u64 variable.

The order of the bits in a bit field is implementation specific
and your method might not work on all architectures. Even if the
driver is only meant to run on a single CPU architecture, it's
always better to write portable code.

Arnd

Re: [PATCH] net: ezchip: adapt driver to little endian architecture

2016-02-26 Thread Arnd Bergmann

On Friday 26 February 2016 22:05:09 Lada Trimasova wrote:
> for (i = 0; i < len; i++, reg++) {
> u32 buf = nps_enet_reg_get(priv, NPS_ENET_REG_RX_BUF);
> +   buf = be32_to_cpu(buf);
> put_unaligned(buf, reg);
> }

I think most of the changes can make use of the put_unaligned_be32()
etc helpers that might also be more efficient.

Arnd

Re: [PATCH] net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump.

2016-02-26 Thread David Miller

From: MINOURA Makoto / 箕浦 真 
Date: Thu, 25 Feb 2016 14:20:48 +0900

> When the send skbuff reaches the end, nlmsg_put and friends returns
> -EMSGSIZE but it is silently thrown away in ndo_fdb_dump. It is called
> within a for_each_netdev loop and the first fdb entry of a following
> netdev could fit in the remaining skbuff.  This breaks the mechanism
> of cb->args[0] and idx to keep track of the entries that are already
> dumped, which results missing entries in bridge fdb show command.
> 
> Signed-off-by: Minoura Makoto 

Good catch, applied, thanks.

[PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert

This patch implements performant csum_partial for x86_64. The intent is
to speed up checksum calculation, particularly for smaller lengths such
as those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.

- v4
   - went back to C code with inline assembly for critical routines
   - implemented suggestion from Linus to deal with lengths < 8

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Performance:

Isolating old and new implementation for some common cases:

 Old  New %
Len/Aln  nsecsnsecs   Improv
+---++---
1400/0195.6181.7   7% (Big packet)
40/0  11.4 6.2 45%(Ipv6 hdr cmn case)
8/4   7.9  3.2 59%(UDP, VXLAN in IPv4)
14/0  8.9  5.9 33%(Eth hdr)
14/4  9.2  5.9 35%(Eth hdr in IPv4)
14/3  9.6  5.9 38%(Eth with odd align)
20/0  9.0  6.2 31%(IP hdr without options)
7/1   8.9  4.2 52%(buffer in one quad)
100/017.4 13.9 20%(medium-sized pkt)
100/217.8 14.2 20%(medium-sized pkt w/ alignment)

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also tested on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

Branch  prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between the
two scenarios (perf stat output is below), neither does either case
show a significant number of branch misses.

Interleave lengths case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1

 Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 
1' (10 runs):

 9,556,693,202  instructions   ( +-  0.00% )
 1,176,208,640   branches   
  ( +-  0.00% )
19,487   branch-misses#0.00% of all branches
  ( +-  6.07% )

   2.049732539 seconds time elapsed

Non-interleave case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
 ./csum -M new-thrash -l 100 -S 24 -a 1 -c 1

Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
1' (10 runs):

 9,782,188,310  instructions   ( +-  0.00% )
 1,251,286,958   branches   
  ( +-  0.01% )
18,950   branch-misses#0.00% of all branches
  ( +- 12.74% )

   2.271789046 seconds time elapsed

Signed-off-by: Tom Herbert 
---
 arch/x86/include/asm/checksum_64.h |  21 
 arch/x86/lib/csum-partial_64.c | 225 -
 2 files changed, 143 insertions(+), 103 deletions(-)

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index cd00e17..e20c35b 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
unsigned b)
return a;
 }
 
+static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
+{
+   asm("addq %2,%0\n\t"
+   "adcq $0,%0"
+   : "=r" (a)
+   : "0" (a), "rm" (b));
+   return a;
+}
+
+static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
+unsigned int c)
+{
+   asm("addl %2,%0\n\t"
+   "adcl %3,%0\n\t"
+   "adcl $0,%0"
+   : "=r" (a)
+   : "" (a), "rm" (b), "rm" (c));
+
+   return a;
+}
+
 #define HAVE_ARCH_CSUM_ADD
 static inline __wsum csum_add(__wsum csum, __wsum addend)
 {
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..df82c9b 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -8,114 +8,66 @@
 #include 
 #include 
 #include 
+#include 
 
-static inline unsigned short from32to16(unsigned a) 
+static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
 {
-   unsigned short b = a >> 16; 
-   asm("addw

Re: [net-next v2 00/15][pull request] 1GbE Intel Wired LAN Driver Updates 2016-02-24

2016-02-26 Thread David Miller

From: Jeff Kirsher 
Date: Wed, 24 Feb 2016 18:14:47 -0800

> This series contains updates to e1000e, igb and igbvf.

Pulled, thanks Jeff.

Re: [PATCH net-next 0/3] bridge: mcast: add support for temp router port

2016-02-26 Thread Nikolay Aleksandrov

On 02/26/2016 07:59 PM, Nikolay Aleksandrov wrote:
> Hi,
> This set adds support for temporary router port which doesn't depend on
> the incoming queries. It can be refreshed by setting multicast_router to
> the same value (3). The first two patches are minor changes that prepare
> the code for the third which adds this new type of router port.
> In order to be able to dump its information the mdb router port format
> is changed and extended similar to how mdb entries format was done
> recently.
> The related iproute2 changes will be posted if this is accepted.
> 
> Thanks,
>  Nik
> 

Self-NAK, spotted an minor issue with the val setting I've missed.
Will post v2 after some more testing.

Thanks,
 Nik

Re: [net-next PATCH] GSO: Provide software checksum of tunneled UDP fragmentation offload

2016-02-26 Thread David Miller

From: Alexander Duyck 
Date: Wed, 24 Feb 2016 16:46:21 -0800

> On reviewing the code I realized that GRE and UDP tunnels could cause a
> kernel panic if we used GSO to segment a large UDP frame that was sent
> through the tunnel with an outer checksum and hardware offloads were not
> available.
> 
> In order to correct this we need to update the feature flags that are
> passed to the skb_segment function so that in the event of UDP
> fragmentation being requested for the inner header the segmentation
> function will correctly generate the checksum for the payload if we cannot
> segment the outer header.
> 
> Signed-off-by: Alexander Duyck 

Applied, thanks Alex.

Re: [PATCH net-next v3 0/2] net: l3mdev: Fix source address for unnumbered deployments

2016-02-26 Thread David Miller

From: David Ahern 
Date: Wed, 24 Feb 2016 11:47:01 -0800

> David Lamparter noted a use case where the source address selection fails
> to pick an address from a VRF interface - unnumbered interfaces. The use
> case has the VRF device as the VRF local loopback with addresses and
> interfaces enslaved without an address themselves. e.g,
 ...
> ping to the 10.2.2.2 through the L3 domain:
 ...
> picks up the wrong address -- the one from 'lo' not vrf0. And from tcpdump:
> 12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, 
> length 64
> 
> This patch series changes address selection to only consider devices in
> the same L3 domain and to use the VRF device as the L3 domains loopback.
 ...

Series applied, thanks David.

[PATCH] net: ezchip: adapt driver to little endian architecture

2016-02-26 Thread Lada Trimasova

Since ezchip network driver is written with big endian EZChip platform it
is necessary to add support for little endian architecture.

The first issue is that big endian machines pack bitfields from
most significant byte to least as against little endian ones.
So this patch provides reversed order of bitfields defined in header file
in case of not defined "CONFIG_CPU_BIG_ENDIAN".

And the second one is that network byte order is big endian.
For example, data on ethernet is transmitted with most-significant
octet (byte) first. So in case of little endian architecture
it is important to swap data byte order when we read it from
register. For this we should use function "be32_to_cpu" as we read from
peripheral to CPU.
And then when we are going to write data to register we need to restore
byte order using the function "cpu_to_be32" as we write from CPU to
peripheral.

The last little fix is a space between a type and a pointer to observe
coding style.

Signed-off-by: Lada Trimasova 
Cc: Alexey Brodkin 
Cc: Noam Camus 
Cc: Tal Zilcer 
Cc: Arnd Bergmann 
---
 drivers/net/ethernet/ezchip/nps_enet.c | 15 --
 drivers/net/ethernet/ezchip/nps_enet.h | 99 ++
 2 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c 
b/drivers/net/ethernet/ezchip/nps_enet.c
index b102668..43cf9d3 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -44,11 +44,15 @@ static void nps_enet_read_rx_fifo(struct net_device *ndev,
 
/* In case dst is not aligned we need an intermediate buffer */
if (dst_is_aligned)
-   for (i = 0; i < len; i++, reg++)
+   for (i = 0; i < len; i++, reg++) {
*reg = nps_enet_reg_get(priv, NPS_ENET_REG_RX_BUF);
+   /* In case of LE we need to swap bytes */
+   *reg = be32_to_cpu(*reg);
+   }
else { /* !dst_is_aligned */
for (i = 0; i < len; i++, reg++) {
u32 buf = nps_enet_reg_get(priv, NPS_ENET_REG_RX_BUF);
+   buf = be32_to_cpu(buf);
put_unaligned(buf, reg);
}
}
@@ -56,7 +60,8 @@ static void nps_enet_read_rx_fifo(struct net_device *ndev,
/* copy last bytes (if any) */
if (last) {
u32 buf = nps_enet_reg_get(priv, NPS_ENET_REG_RX_BUF);
-   memcpy((u8*)reg, , last);
+   buf = be32_to_cpu(buf);
+   memcpy((u8 *)reg, , last);
}
 }
 
@@ -368,11 +373,13 @@ static void nps_enet_send_frame(struct net_device *ndev,
/* In case src is not aligned we need an intermediate buffer */
if (src_is_aligned)
for (i = 0; i < len; i++, src++)
-   nps_enet_reg_set(priv, NPS_ENET_REG_TX_BUF, *src);
+   /* Restore endian swapped during register reading */
+   nps_enet_reg_set(priv, NPS_ENET_REG_TX_BUF,
+cpu_to_be32(*src));
else /* !src_is_aligned */
for (i = 0; i < len; i++, src++)
nps_enet_reg_set(priv, NPS_ENET_REG_TX_BUF,
-get_unaligned(src));
+cpu_to_be32(get_unaligned(src)));
 
/* Write the length of the Frame */
tx_ctrl.nt = length;
diff --git a/drivers/net/ethernet/ezchip/nps_enet.h 
b/drivers/net/ethernet/ezchip/nps_enet.h
index 6703674..2d068ad 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.h
+++ b/drivers/net/ethernet/ezchip/nps_enet.h
@@ -52,12 +52,23 @@ struct nps_enet_tx_ctl {
 * nt: Length in bytes of Tx frame loaded to Tx buffer
 */
struct {
+#ifdef CONFIG_CPU_BIG_ENDIAN
u32
__reserved_1:16,
ct:1,
et:1,
__reserved_2:3,
nt:11;
+
+#else
+   u32
+   nt:11,
+   __reserved_2:3,
+   et:1,
+   ct:1,
+   __reserved_1:16;
+
+#endif
};
 
u32 value;
@@ -75,6 +86,7 @@ struct nps_enet_rx_ctl {
 * nr:  Length in bytes of Rx frame loaded by MAC to Rx buffer
 */
struct {
+#ifdef CONFIG_CPU_BIG_ENDIAN
u32
__reserved_1:16,
cr:1,
@@ -82,6 +94,15 @@ struct nps_enet_rx_ctl {
crc:1,
__reserved_2:2,
nr:11;
+#else
+   u32
+   nr:11,
+   __reserved_2:2,
+

[PATCH net-next 2/3] bridge: mcast: do nothing if port's multicast_router is set to the same val

2016-02-26 Thread Nikolay Aleksandrov

This is needed for the upcoming temporary port router. There's no point
to go through the logic if the value is the same.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_multicast.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 015c47dd1364..496f808f9aa1 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1853,7 +1853,10 @@ int br_multicast_set_port_router(struct net_bridge_port 
*p, unsigned long val)
int err = -EINVAL;
 
spin_lock(>multicast_lock);
-
+   if (p->multicast_router == val) {
+   err = 0;
+   goto unlock;
+   }
switch (val) {
case MDB_RTR_TYPE_DISABLED:
__del_port_router(p);
-- 
2.4.3

[PATCH net-next 1/3] bridge: mcast: use names for the different multicast_router types

2016-02-26 Thread Nikolay Aleksandrov

Using raw values makes it difficult to extend and also understand the
code, give them names and do explicit per-option manipulation in
br_multicast_set_port_router.

Signed-off-by: Nikolay Aleksandrov 
---
 include/uapi/linux/if_bridge.h |  7 +
 net/bridge/br_multicast.c  | 59 ++
 2 files changed, 38 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 610932b477c4..f2764b739f38 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -194,6 +194,13 @@ enum {
 };
 #define MDBA_MDB_EATTR_MAX (__MDBA_MDB_EATTR_MAX - 1)
 
+/* multicast router types */
+enum {
+   MDB_RTR_TYPE_DISABLED,
+   MDB_RTR_TYPE_TEMP_QUERY,
+   MDB_RTR_TYPE_PERM,
+};
+
 enum {
MDBA_ROUTER_UNSPEC,
MDBA_ROUTER_PORT,
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 8b6e4249be1b..015c47dd1364 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -759,7 +759,7 @@ static void br_multicast_router_expired(unsigned long data)
struct net_bridge *br = port->br;
 
spin_lock(>multicast_lock);
-   if (port->multicast_router != 1 ||
+   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY ||
timer_pending(>multicast_router_timer) ||
hlist_unhashed(>rlist))
goto out;
@@ -912,7 +912,7 @@ static void br_ip6_multicast_port_query_expired(unsigned 
long data)
 
 void br_multicast_add_port(struct net_bridge_port *port)
 {
-   port->multicast_router = 1;
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 
setup_timer(>multicast_router_timer, br_multicast_router_expired,
(unsigned long)port);
@@ -959,7 +959,8 @@ void br_multicast_enable_port(struct net_bridge_port *port)
 #if IS_ENABLED(CONFIG_IPV6)
br_multicast_enable(>ip6_own_query);
 #endif
-   if (port->multicast_router == 2 && hlist_unhashed(>rlist))
+   if (port->multicast_router == MDB_RTR_TYPE_PERM &&
+   hlist_unhashed(>rlist))
br_multicast_add_router(br, port);
 
 out:
@@ -1227,13 +1228,13 @@ static void br_multicast_mark_router(struct net_bridge 
*br,
unsigned long now = jiffies;
 
if (!port) {
-   if (br->multicast_router == 1)
+   if (br->multicast_router == MDB_RTR_TYPE_TEMP_QUERY)
mod_timer(>multicast_router_timer,
  now + br->multicast_querier_interval);
return;
}
 
-   if (port->multicast_router != 1)
+   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY)
return;
 
br_multicast_add_router(br, port);
@@ -1713,7 +1714,7 @@ void br_multicast_init(struct net_bridge *br)
br->hash_elasticity = 4;
br->hash_max = 512;
 
-   br->multicast_router = 1;
+   br->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
br->multicast_querier = 0;
br->multicast_query_use_ifaddr = 0;
br->multicast_last_member_count = 2;
@@ -1823,11 +1824,11 @@ int br_multicast_set_router(struct net_bridge *br, 
unsigned long val)
spin_lock_bh(>multicast_lock);
 
switch (val) {
-   case 0:
-   case 2:
+   case MDB_RTR_TYPE_DISABLED:
+   case MDB_RTR_TYPE_PERM:
del_timer(>multicast_router_timer);
/* fall through */
-   case 1:
+   case MDB_RTR_TYPE_TEMP_QUERY:
br->multicast_router = val;
err = 0;
break;
@@ -1838,6 +1839,14 @@ int br_multicast_set_router(struct net_bridge *br, 
unsigned long val)
return err;
 }
 
+static void __del_port_router(struct net_bridge_port *p)
+{
+   if (hlist_unhashed(>rlist))
+   return;
+   hlist_del_init_rcu(>rlist);
+   br_rtr_notify(p->br->dev, p, RTM_DELMDB);
+}
+
 int br_multicast_set_port_router(struct net_bridge_port *p, unsigned long val)
 {
struct net_bridge *br = p->br;
@@ -1846,29 +1855,23 @@ int br_multicast_set_port_router(struct net_bridge_port 
*p, unsigned long val)
spin_lock(>multicast_lock);
 
switch (val) {
-   case 0:
-   case 1:
-   case 2:
-   p->multicast_router = val;
-   err = 0;
-
-   if (val < 2 && !hlist_unhashed(>rlist)) {
-   hlist_del_init_rcu(>rlist);
-   br_rtr_notify(br->dev, p, RTM_DELMDB);
-   }
-
-   if (val == 1)
-   break;
-
+   case MDB_RTR_TYPE_DISABLED:
+   __del_port_router(p);
+   del_timer(>multicast_router_timer);
+   break;
+   case MDB_RTR_TYPE_TEMP_QUERY:
+   __del_port_router(p);
+   break;
+   case MDB_RTR_TYPE_PERM:
del_timer(>multicast_router_timer);
-
-   if (val == 0)
-

[PATCH net-next 0/3] bridge: mcast: add support for temp router port

2016-02-26 Thread Nikolay Aleksandrov

Hi,
This set adds support for temporary router port which doesn't depend on
the incoming queries. It can be refreshed by setting multicast_router to
the same value (3). The first two patches are minor changes that prepare
the code for the third which adds this new type of router port.
In order to be able to dump its information the mdb router port format
is changed and extended similar to how mdb entries format was done
recently.
The related iproute2 changes will be posted if this is accepted.

Thanks,
 Nik

Nikolay Aleksandrov (3):
  bridge: mcast: use names for the different multicast_router types
  bridge: mcast: do nothing if port's multicast_router is set to the
same val
  bridge: mcast: add support for temporary port router

 include/uapi/linux/if_bridge.h | 22 +++-
 net/bridge/br_mdb.c| 16 +++--
 net/bridge/br_multicast.c  | 80 +++---
 3 files changed, 86 insertions(+), 32 deletions(-)

-- 
2.4.3

[PATCH net-next 3/3] bridge: mcast: add support for temporary port router

2016-02-26 Thread Nikolay Aleksandrov

Add support for a temporary router port which doesn't depend on the
incoming query and allow for more port information to be dumped. For
that purpose we need to extend the MDBA_ROUTER_PORT attribute similar to
how it was done for the mdb entries recently. The new format is thus:
[MDBA_ROUTER_PORT] = { <- nested attribute
u32 ifindex <- router port ifindex for user-space compatibility
[MDBA_ROUTER_PATTR attributes]
}
This way it remains compatible with older users (they'll simply retrieve
the u32 in the beginning) and new users can parse the remaining
attributes.

Signed-off-by: Nikolay Aleksandrov 
---
 include/uapi/linux/if_bridge.h | 15 ++-
 net/bridge/br_mdb.c| 16 ++--
 net/bridge/br_multicast.c  | 20 ++--
 3 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index f2764b739f38..af98f6855b7e 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -161,7 +161,10 @@ enum {
  * }
  * }
  * [MDBA_ROUTER] = {
- *[MDBA_ROUTER_PORT]
+ *[MDBA_ROUTER_PORT] = {
+ *u32 ifindex
+ *[MDBA_ROUTER_PATTR attributes]
+ *}
  * }
  */
 enum {
@@ -199,6 +202,7 @@ enum {
MDB_RTR_TYPE_DISABLED,
MDB_RTR_TYPE_TEMP_QUERY,
MDB_RTR_TYPE_PERM,
+   MDB_RTR_TYPE_TEMP
 };
 
 enum {
@@ -208,6 +212,15 @@ enum {
 };
 #define MDBA_ROUTER_MAX (__MDBA_ROUTER_MAX - 1)
 
+/* router port attributes */
+enum {
+   MDBA_ROUTER_PATTR_UNSPEC,
+   MDBA_ROUTER_PATTR_TIMER,
+   MDBA_ROUTER_PATTR_TYPE,
+   __MDBA_ROUTER_PATTR_MAX
+};
+#define MDBA_ROUTER_PATTR_MAX (__MDBA_ROUTER_PATTR_MAX - 1)
+
 struct br_port_msg {
__u8  family;
__u32 ifindex;
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 73786e2fe065..253bc77eda3b 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -20,7 +20,7 @@ static int br_rports_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
 {
struct net_bridge *br = netdev_priv(dev);
struct net_bridge_port *p;
-   struct nlattr *nest;
+   struct nlattr *nest, *port_nest;
 
if (!br->multicast_router || hlist_empty(>router_list))
return 0;
@@ -30,8 +30,20 @@ static int br_rports_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
return -EMSGSIZE;
 
hlist_for_each_entry_rcu(p, >router_list, rlist) {
-   if (p && nla_put_u32(skb, MDBA_ROUTER_PORT, p->dev->ifindex))
+   if (!p)
+   continue;
+   port_nest = nla_nest_start(skb, MDBA_ROUTER_PORT);
+   if (!port_nest)
goto fail;
+   if (nla_put_nohdr(skb, sizeof(u32), >dev->ifindex) ||
+   nla_put_u32(skb, MDBA_ROUTER_PATTR_TIMER,
+   br_timer_value(>multicast_router_timer)) ||
+   nla_put_u8(skb, MDBA_ROUTER_PATTR_TYPE,
+  p->multicast_router)) {
+   nla_nest_cancel(skb, port_nest);
+   goto fail;
+   }
+   nla_nest_end(skb, port_nest);
}
 
nla_nest_end(skb, nest);
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 496f808f9aa1..0fb5061c2ad4 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -759,13 +759,17 @@ static void br_multicast_router_expired(unsigned long 
data)
struct net_bridge *br = port->br;
 
spin_lock(>multicast_lock);
-   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY ||
+   if (port->multicast_router == MDB_RTR_TYPE_DISABLED ||
+   port->multicast_router == MDB_RTR_TYPE_PERM ||
timer_pending(>multicast_router_timer) ||
hlist_unhashed(>rlist))
goto out;
 
hlist_del_init_rcu(>rlist);
br_rtr_notify(br->dev, port, RTM_DELMDB);
+   /* Don't allow timer refresh if the router expired */
+   if (port->multicast_router == MDB_RTR_TYPE_TEMP)
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 
 out:
spin_unlock(>multicast_lock);
@@ -981,6 +985,9 @@ void br_multicast_disable_port(struct net_bridge_port *port)
if (!hlist_unhashed(>rlist)) {
hlist_del_init_rcu(>rlist);
br_rtr_notify(br->dev, port, RTM_DELMDB);
+   /* Don't allow timer refresh if disabling */
+   if (port->multicast_router == MDB_RTR_TYPE_TEMP)
+   port->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
}
del_timer(>multicast_router_timer);
del_timer(>ip4_own_query.timer);
@@ -1234,7 +1241,8 @@ static void br_multicast_mark_router(struct net_bridge 
*br,
return;
}
 
-   if (port->multicast_router != MDB_RTR_TYPE_TEMP_QUERY)
+   if (port->multicast_router ==

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-02-26 Thread Michal Kazior

On 26 February 2016 at 17:48, Felix Fietkau  wrote:
[...]
>> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
>> index af584f7cdd63..f42f898cb8b5 100644
>> --- a/net/mac80211/tx.c
>> +++ b/net/mac80211/tx.c
>> + [...]
>> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
>> +   struct txq_info *txqi,
>> +   struct sk_buff *skb)
>> +{
>> + struct ieee80211_fq *fq = >fq;
>> + struct ieee80211_hw *hw = >hw;
>> + struct txq_flow *flow;
>> + struct txq_flow *i;
>> + size_t idx = fq_hash(fq, skb);
>> +
>> + flow = >flows[idx];
>> +
>> + if (flow->txqi)
>> + flow = >flow;
> I'm not sure I understand this part correctly, but shouldn't that be:
> if (flow->txqi && flow->txqi != txqi)

You're correct. Good catch, thanks!


Michał

Re: [PATCH RFC 0/3] intermediate representation for jit and cls_u32 conversion

2016-02-26 Thread Alexei Starovoitov

On Fri, Feb 26, 2016 at 05:19:48PM +0100, Pablo Neira Ayuso wrote:
> 
> Good, I'm all for reaching those numbers, we can optimize the generic
> IR if this ever becomes the bottleneck.

The 'generic IR' got mentioned hundred times in this thread,
but what was proposed is not generic. It doesn't even
fully fit u32. Here is why:

> This structure contains a protocol description (defined by struct
> net_ir_proto_desc) that is the initial node of the protocol graph that
> describes the protocol translation. This initial node starts from lower
> supported layer as base (eg. link-layer) then describing the upper
> protocols up to the transport protocols through the following structure:
> 
>  struct net_ir_proto_desc {
>enum net_ir_payload_bases   base;
>u32 protonum;
>int (*jit)(struct net_ir_jit_ctx 
> *ctx,
>   const struct 
> net_ir_expr *expr,
>   void *data);
>const struct net_ir_proto_desc  *protocols[];
>  };

The above representation has builtin concept of protocols, whereas
u32 is protocol agnostic and fits this particular intel nic better.

>  struct net_ir_jit_desc {
>enum net_ir_payload_bases   base;
>const struct net_ir_proto_desc  *proto_desc;
>int (*verdict)(struct 
> net_ir_jit_ctx *ctx,
>   enum 
> net_ir_stmt_verdict verdict,
>   void *data);
>  };

imo the above is a misuse of JIT abbreviation.
Typically JIT means compiling to machine code that can be executed
directly. Converting one representation to another is not really JIT.
Also IR stands for _intermediate_ representation. It is a transitional
state when compiler converts high level language into machine code.
In this case the proposed format is protocol specific syntax tree,
so probably should be called as such.

imo the HW guys should be free to pick whatever representation
we have today and offload it. If u32 is convenient and applies
to HW architecture better, the driver should take u32 tree and
map it to HW (which is what was done already). When/If another
HW comes along with similar HW architecture we can generalize
and reuse u32->ixgbe code. And it should be done by developers
who actually have the HW and can test on it. Trying to 'generalize'
u32->ixgbe code without 2nd HW is not going to be successful.

Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread Florian Fainelli

On 26/02/16 07:24, John Crispin wrote:
> 
> Hi,
> 
> would the series be ok if we just dropped those parts and then have a
> driver in the kernel that wont do much with the out of tree patches ?
> 
> the problem here is that on one side people complain about vendors not
> sending code upstream. once they start being a good citizen and provide
> funding to send stuff upstream the feedback tends to be very bad as seen
> here.

I agree with David here, the feedback from Andrew is very constructive,
you just don't like the feedback you are being given, which is a
different thing. You can't always get a 12 series patches adding a new
driver accepted after second try, look at all the recent submissions
that occured, it took 5-6-7 maybe more submissions until things were in
a shape where they could be merged. If for your next submission you get
the feedback that switchdev/DSA is deprecated, and something new needs
to be used, then I would agree that feedback is not acceptable, I doubt
this will be the case unless we wait another 10 years to get these
patches out.

> we are planning on doing a DSA driver but one step at a time. this
> kind of feedback will inevitably lead to vendors doing second thoughts
> of upstream contributions.

If you are planning on a DSA driver, which sounds like a good plan, then
maybe drop the integrated switch parts for now, keep it as a local set
of patches for your testing, and just get the basic CPU Ethernet MAC
driver to work for data movement, so that part gets in, and later on,
when your DSA driver is ready, that's one less thing to take care of.
They ultimately are logically spearated drivers if you use DSA, a little
less if you use switchdev.
-- 
Florian

[PATCH 1/2] phy: micrel: Ensure interrupts are reenabled on resume

2016-02-26 Thread Alexandre Belloni

At least on ksz8081, when getting back from power down, interrupts are
disabled. ensure they are reenabled if they were previously enabled.

This fixes resuming which is failing on the xplained boards from atmel
since 321beec5047a (net: phy: Use interrupts when available in NOLINK
state)

Fixes: 321beec5047a (net: phy: Use interrupts when available in NOLINK state)
Signed-off-by: Alexandre Belloni 
---
 drivers/net/phy/micrel.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 03833dbfca67..a5e265b2bbfb 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -635,6 +635,21 @@ static void kszphy_get_stats(struct phy_device *phydev,
data[i] = kszphy_get_stat(phydev, i);
 }
 
+static int kszphy_resume(struct phy_device *phydev)
+{
+   int value;
+
+   mutex_lock(>lock);
+
+   value = phy_read(phydev, MII_BMCR);
+   phy_write(phydev, MII_BMCR, value & ~BMCR_PDOWN);
+
+   kszphy_config_intr(phydev);
+   mutex_unlock(>lock);
+
+   return 0;
+}
+
 static int kszphy_probe(struct phy_device *phydev)
 {
const struct kszphy_type *type = phydev->drv->driver_data;
@@ -844,7 +859,7 @@ static struct phy_driver ksphy_driver[] = {
.get_strings= kszphy_get_strings,
.get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
-   .resume = genphy_resume,
+   .resume = kszphy_resume,
 }, {
.phy_id = PHY_ID_KSZ8061,
.name   = "Micrel KSZ8061",
-- 
2.7.0

[PATCH net-next 3/9] net: dsa: mv88e6xxx: extract single FDB dump

2016-02-26 Thread Vivien Didelot

Move out the code which dumps a single FDB to its own function.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 79 ++---
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index e9e9922..6329516 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1895,6 +1895,47 @@ static int _mv88e6xxx_atu_getnext(struct dsa_switch *ds, 
u16 fid,
return 0;
 }
 
+static int _mv88e6xxx_port_fdb_dump_one(struct dsa_switch *ds, u16 fid, u16 
vid,
+   int port,
+   struct switchdev_obj_port_fdb *fdb,
+   int (*cb)(struct switchdev_obj *obj))
+{
+   struct mv88e6xxx_atu_entry addr = {
+   .mac = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+   };
+   int err;
+
+   err = _mv88e6xxx_atu_mac_write(ds, addr.mac);
+   if (err)
+   return err;
+
+   do {
+   err = _mv88e6xxx_atu_getnext(ds, fid, );
+   if (err)
+   break;
+
+   if (addr.state == GLOBAL_ATU_DATA_STATE_UNUSED)
+   break;
+
+   if (!addr.trunk && addr.portv_trunkid & BIT(port)) {
+   bool is_static = addr.state ==
+   (is_multicast_ether_addr(addr.mac) ?
+GLOBAL_ATU_DATA_STATE_MC_STATIC :
+GLOBAL_ATU_DATA_STATE_UC_STATIC);
+
+   fdb->vid = vid;
+   ether_addr_copy(fdb->addr, addr.mac);
+   fdb->ndm_state = is_static ? NUD_NOARP : NUD_REACHABLE;
+
+   err = cb(>obj);
+   if (err)
+   break;
+   }
+   } while (!is_broadcast_ether_addr(addr.mac));
+
+   return err;
+}
+
 int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, int port,
struct switchdev_obj_port_fdb *fdb,
int (*cb)(struct switchdev_obj *obj))
@@ -1907,51 +1948,23 @@ int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, int 
port,
 
mutex_lock(>smi_mutex);
 
+   /* Dump VLANs' Filtering Information Databases */
err = _mv88e6xxx_vtu_vid_write(ds, vlan.vid);
if (err)
goto unlock;
 
do {
-   struct mv88e6xxx_atu_entry addr = {
-   .mac = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
-   };
-
err = _mv88e6xxx_vtu_getnext(ds, );
if (err)
-   goto unlock;
+   break;
 
if (!vlan.valid)
break;
 
-   err = _mv88e6xxx_atu_mac_write(ds, addr.mac);
+   err = _mv88e6xxx_port_fdb_dump_one(ds, vlan.fid, vlan.vid, port,
+  fdb, cb);
if (err)
-   goto unlock;
-
-   do {
-   err = _mv88e6xxx_atu_getnext(ds, vlan.fid, );
-   if (err)
-   goto unlock;
-
-   if (addr.state == GLOBAL_ATU_DATA_STATE_UNUSED)
-   break;
-
-   if (!addr.trunk && addr.portv_trunkid & BIT(port)) {
-   bool is_static = addr.state ==
-   (is_multicast_ether_addr(addr.mac) ?
-GLOBAL_ATU_DATA_STATE_MC_STATIC :
-GLOBAL_ATU_DATA_STATE_UC_STATIC);
-
-   fdb->vid = vlan.vid;
-   ether_addr_copy(fdb->addr, addr.mac);
-   fdb->ndm_state = is_static ? NUD_NOARP :
-   NUD_REACHABLE;
-
-   err = cb(>obj);
-   if (err)
-   goto unlock;
-   }
-   } while (!is_broadcast_ether_addr(addr.mac));
-
+   break;
} while (vlan.vid < GLOBAL_VTU_VID_MASK);
 
 unlock:
-- 
2.7.1

[PATCH net-next 2/9] net: dsa: mv88e6xxx: extract single VLAN retrieval

2016-02-26 Thread Vivien Didelot

Rename _mv88e6xxx_vlan_init in _mv88e6xxx_vtu_new, eventually called
from a new _mv88e6xxx_vtu_get function, which abstracts the VTU GetNext
VID-1 trick to retrieve a single entry.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 55 -
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index d98dc63..e9e9922 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1458,8 +1458,8 @@ loadpurge:
return _mv88e6xxx_vtu_cmd(ds, GLOBAL_VTU_OP_STU_LOAD_PURGE);
 }
 
-static int _mv88e6xxx_vlan_init(struct dsa_switch *ds, u16 vid,
-   struct mv88e6xxx_vtu_stu_entry *entry)
+static int _mv88e6xxx_vtu_new(struct dsa_switch *ds, u16 vid,
+ struct mv88e6xxx_vtu_stu_entry *entry)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
struct mv88e6xxx_vtu_stu_entry vlan = {
@@ -1509,6 +1509,35 @@ static int _mv88e6xxx_vlan_init(struct dsa_switch *ds, 
u16 vid,
return 0;
 }
 
+static int _mv88e6xxx_vtu_get(struct dsa_switch *ds, u16 vid,
+ struct mv88e6xxx_vtu_stu_entry *entry, bool creat)
+{
+   int err;
+
+   if (!vid)
+   return -EINVAL;
+
+   err = _mv88e6xxx_vtu_vid_write(ds, vid - 1);
+   if (err)
+   return err;
+
+   err = _mv88e6xxx_vtu_getnext(ds, entry);
+   if (err)
+   return err;
+
+   if (entry->vid != vid || !entry->valid) {
+   if (!creat)
+   return -EOPNOTSUPP;
+   /* -ENOENT would've been more appropriate, but switchdev expects
+* -EOPNOTSUPP to inform bridge about an eventual software VLAN.
+*/
+
+   err = _mv88e6xxx_vtu_new(ds, vid, entry);
+   }
+
+   return err;
+}
+
 static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch *ds, int port,
u16 vid_begin, u16 vid_end)
 {
@@ -1593,20 +1622,10 @@ static int _mv88e6xxx_port_vlan_add(struct dsa_switch 
*ds, int port, u16 vid,
struct mv88e6xxx_vtu_stu_entry vlan;
int err;
 
-   err = _mv88e6xxx_vtu_vid_write(ds, vid - 1);
-   if (err)
-   return err;
-
-   err = _mv88e6xxx_vtu_getnext(ds, );
+   err = _mv88e6xxx_vtu_get(ds, vid, , true);
if (err)
return err;
 
-   if (vlan.vid != vid || !vlan.valid) {
-   err = _mv88e6xxx_vlan_init(ds, vid, );
-   if (err)
-   return err;
-   }
-
vlan.data[port] = untagged ?
GLOBAL_VTU_DATA_MEMBER_TAG_UNTAGGED :
GLOBAL_VTU_DATA_MEMBER_TAG_TAGGED;
@@ -1647,16 +1666,12 @@ static int _mv88e6xxx_port_vlan_del(struct dsa_switch 
*ds, int port, u16 vid)
struct mv88e6xxx_vtu_stu_entry vlan;
int i, err;
 
-   err = _mv88e6xxx_vtu_vid_write(ds, vid - 1);
-   if (err)
-   return err;
-
-   err = _mv88e6xxx_vtu_getnext(ds, );
+   err = _mv88e6xxx_vtu_get(ds, vid, , false);
if (err)
return err;
 
-   if (vlan.vid != vid || !vlan.valid ||
-   vlan.data[port] == GLOBAL_VTU_DATA_MEMBER_TAG_NON_MEMBER)
+   /* Tell switchdev if this VLAN is handled in software */
+   if (vlan.data[port] == GLOBAL_VTU_DATA_MEMBER_TAG_NON_MEMBER)
return -EOPNOTSUPP;
 
vlan.data[port] = GLOBAL_VTU_DATA_MEMBER_TAG_NON_MEMBER;
-- 
2.7.1

[PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Vivien Didelot

The In Chip Port Based VLAN Table contains bits used to restrict which
output ports this input port can send frames to.

With the VLAN filtering enabled, these tables work in conjunction with
the VLAN Table Unit to allow egressing frames.

In order to remove the current dependency to BRIDGE_VLAN_FILTERING for
basic hardware bridging to work, it is necessary to restore a fine
control of each port's VLANTable, on setup and when a port joins or
leaves a bridge.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 54 +++--
 1 file changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 0f16911..7f3036b 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1087,12 +1087,32 @@ abort:
return ret;
 }
 
-static int _mv88e6xxx_port_vlan_map_set(struct dsa_switch *ds, int port,
-   u16 output_ports)
+static int _mv88e6xxx_port_based_vlan_map(struct dsa_switch *ds, int port)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   struct net_device *bridge = ps->ports[port].bridge_dev;
const u16 mask = (1 << ps->num_ports) - 1;
+   u16 output_ports = 0;
int reg;
+   int i;
+
+   /* allow CPU port or DSA link(s) to send frames to every port */
+   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) {
+   output_ports = mask;
+   } else {
+   for (i = 0; i < ps->num_ports; ++i) {
+   /* allow sending frames to every group member */
+   if (bridge && ps->ports[i].bridge_dev == bridge)
+   output_ports |= BIT(i);
+
+   /* allow sending frames to CPU port and DSA link(s) */
+   if (dsa_is_cpu_port(ds, i) || dsa_is_dsa_port(ds, i))
+   output_ports |= BIT(i);
+   }
+   }
+
+   /* prevent frames from going back out of the port they came in on */
+   output_ports &= ~BIT(port);
 
reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_BASE_VLAN);
if (reg < 0)
@@ -2114,7 +2134,17 @@ int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, 
int port,
if (err)
goto unlock;
 
+   /* Assign the bridge and remap each port's VLANTable */
ps->ports[port].bridge_dev = bridge;
+
+   for (i = 0; i < ps->num_ports; ++i) {
+   if (ps->ports[i].bridge_dev == bridge) {
+   err = _mv88e6xxx_port_based_vlan_map(ds, i);
+   if (err)
+   break;
+   }
+   }
+
 unlock:
mutex_unlock(>smi_mutex);
 
@@ -2124,8 +2154,9 @@ unlock:
 int mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   struct net_device *bridge = ps->ports[port].bridge_dev;
u16 fid;
-   int err;
+   int i, err;
 
mutex_lock(>smi_mutex);
 
@@ -2138,7 +2169,17 @@ int mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, 
int port)
if (err)
goto unlock;
 
+   /* Unassign the bridge and remap each port's VLANTable */
ps->ports[port].bridge_dev = NULL;
+
+   for (i = 0; i < ps->num_ports; ++i) {
+   if (i == port || ps->ports[i].bridge_dev == bridge) {
+   err = _mv88e6xxx_port_based_vlan_map(ds, i);
+   if (err)
+   break;
+   }
+   }
+
 unlock:
mutex_unlock(>smi_mutex);
 
@@ -2402,15 +2443,14 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
goto abort;
 
/* Port based VLAN map: give each port its own address
-* database, and allow every port to egress frames on all other ports.
+* database, and allow bidirectional communication between the
+* CPU and DSA port(s), and the other ports.
 */
ret = _mv88e6xxx_port_fid_set(ds, port, port + 1);
if (ret)
goto abort;
 
-   reg = BIT(ps->num_ports) - 1; /* all ports */
-   reg &= ~BIT(port); /* except itself */
-   ret = _mv88e6xxx_port_vlan_map_set(ds, port, reg);
+   ret = _mv88e6xxx_port_based_vlan_map(ds, port);
if (ret)
goto abort;
 
-- 
2.7.1

[PATCH net-next 4/9] net: dsa: mv88e6xxx: assign dynamic FDB to VLANs

2016-02-26 Thread Vivien Didelot

Add a _mv88e6xxx_fid_new function which gives and flushes the lowest FID
available. Call it when preparing a new VTU entry.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 56 +
 drivers/net/dsa/mv88e6xxx.h |  2 ++
 2 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 6329516..b4b2f05 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1458,6 +1458,41 @@ loadpurge:
return _mv88e6xxx_vtu_cmd(ds, GLOBAL_VTU_OP_STU_LOAD_PURGE);
 }
 
+static int _mv88e6xxx_fid_new(struct dsa_switch *ds, u16 *fid)
+{
+   DECLARE_BITMAP(fid_bitmap, MV88E6XXX_N_FID);
+   struct mv88e6xxx_vtu_stu_entry vlan;
+   int err;
+
+   bitmap_zero(fid_bitmap, MV88E6XXX_N_FID);
+
+   /* Set every FID bit used by the VLAN entries */
+   err = _mv88e6xxx_vtu_vid_write(ds, GLOBAL_VTU_VID_MASK);
+   if (err)
+   return err;
+
+   do {
+   err = _mv88e6xxx_vtu_getnext(ds, );
+   if (err)
+   return err;
+
+   if (!vlan.valid)
+   break;
+
+   set_bit(vlan.fid, fid_bitmap);
+   } while (vlan.vid < GLOBAL_VTU_VID_MASK);
+
+   /* The reset value 0x000 is used to indicate that multiple address
+* databases are not needed. Return the next positive available.
+*/
+   *fid = find_next_zero_bit(fid_bitmap, MV88E6XXX_N_FID, 1);
+   if (unlikely(*fid == MV88E6XXX_N_FID))
+   return -ENOSPC;
+
+   /* Clear the database */
+   return _mv88e6xxx_atu_flush(ds, *fid, true);
+}
+
 static int _mv88e6xxx_vtu_new(struct dsa_switch *ds, u16 vid,
  struct mv88e6xxx_vtu_stu_entry *entry)
 {
@@ -1465,9 +1500,12 @@ static int _mv88e6xxx_vtu_new(struct dsa_switch *ds, u16 
vid,
struct mv88e6xxx_vtu_stu_entry vlan = {
.valid = true,
.vid = vid,
-   .fid = vid, /* We use one FID per VLAN */
};
-   int i;
+   int i, err;
+
+   err = _mv88e6xxx_fid_new(ds, );
+   if (err)
+   return err;
 
/* exclude all ports except the CPU and DSA ports */
for (i = 0; i < ps->num_ports; ++i)
@@ -1478,7 +1516,6 @@ static int _mv88e6xxx_vtu_new(struct dsa_switch *ds, u16 
vid,
if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds)) {
struct mv88e6xxx_vtu_stu_entry vstp;
-   int err;
 
/* Adding a VTU entry requires a valid STU entry. As VSTP is not
 * implemented, only one STU entry is needed to cover all VTU
@@ -1498,11 +1535,6 @@ static int _mv88e6xxx_vtu_new(struct dsa_switch *ds, u16 
vid,
if (err)
return err;
}
-
-   /* Clear all MAC addresses from the new database */
-   err = _mv88e6xxx_atu_flush(ds, vlan.fid, true);
-   if (err)
-   return err;
}
 
*entry = vlan;
@@ -1789,8 +1821,14 @@ static int _mv88e6xxx_port_fdb_load(struct dsa_switch 
*ds, int port,
u8 state)
 {
struct mv88e6xxx_atu_entry entry = { 0 };
+   struct mv88e6xxx_vtu_stu_entry vlan;
+   int err;
+
+   err = _mv88e6xxx_vtu_get(ds, vid, , false);
+   if (err)
+   return err;
 
-   entry.fid = vid; /* We use one FID per VLAN */
+   entry.fid = vlan.fid;
entry.state = state;
ether_addr_copy(entry.mac, addr);
if (state != GLOBAL_ATU_DATA_STATE_UNUSED) {
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index 6a30bda..9df331e 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -355,6 +355,8 @@
 #define GLOBAL2_QOS_WEIGHT 0x1c
 #define GLOBAL2_MISC   0x1d
 
+#define MV88E6XXX_N_FID4096
+
 struct mv88e6xxx_switch_id {
u16 id;
char *name;
-- 
2.7.1

[PATCH net-next 6/9] net: dsa: mv88e6xxx: assign dynamic FDB to bridges

2016-02-26 Thread Vivien Didelot

Give a new bridge a fresh FDB, assign it to its members, and restore a
fresh FDB to a port leaving a bridge.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 0f06488..0f16911 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2093,19 +2093,56 @@ int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, 
int port,
   struct net_device *bridge)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   u16 fid;
+   int i, err;
+
+   mutex_lock(>smi_mutex);
+
+   /* Get or create the bridge FID and assign it to the port */
+   for (i = 0; i < ps->num_ports; ++i)
+   if (ps->ports[i].bridge_dev == bridge)
+   break;
+
+   if (i < ps->num_ports)
+   err = _mv88e6xxx_port_fid_get(ds, i, );
+   else
+   err = _mv88e6xxx_fid_new(ds, );
+   if (err)
+   goto unlock;
+
+   err = _mv88e6xxx_port_fid_set(ds, port, fid);
+   if (err)
+   goto unlock;
 
ps->ports[port].bridge_dev = bridge;
+unlock:
+   mutex_unlock(>smi_mutex);
 
-   return 0;
+   return err;
 }
 
 int mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   u16 fid;
+   int err;
+
+   mutex_lock(>smi_mutex);
+
+   /* Give the port a fresh Filtering Information Database */
+   err = _mv88e6xxx_fid_new(ds, );
+   if (err)
+   goto unlock;
+
+   err = _mv88e6xxx_port_fid_set(ds, port, fid);
+   if (err)
+   goto unlock;
 
ps->ports[port].bridge_dev = NULL;
+unlock:
+   mutex_unlock(>smi_mutex);
 
-   return 0;
+   return err;
 }
 
 static int mv88e6xxx_setup_port_default_vlan(struct dsa_switch *ds, int port)
-- 
2.7.1

[PATCH net-next 1/9] net: dsa: support VLAN filtering switchdev attr

2016-02-26 Thread Vivien Didelot

When a user explicitly requests VLAN filtering with something like:

# echo 1 > /sys/class/net//bridge/vlan_filtering

Switchdev propagates a SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING port
attribute.

Add support for it in the DSA layer with a new port_vlan_filtering
function to let drivers toggle 802.1Q filtering on user demand.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h |  2 ++
 net/dsa/slave.c   | 21 +
 2 files changed, 23 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 3dd5486..26c0a3f 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -305,6 +305,8 @@ struct dsa_switch_driver {
/*
 * VLAN support
 */
+   int (*port_vlan_filtering)(struct dsa_switch *ds, int port,
+  bool vlan_filtering);
int (*port_vlan_prepare)(struct dsa_switch *ds, int port,
 const struct switchdev_obj_port_vlan *vlan,
 struct switchdev_trans *trans);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index cde2923..27bf03d 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -317,6 +317,24 @@ static int dsa_slave_stp_update(struct net_device *dev, u8 
state)
return ret;
 }
 
+static int dsa_slave_vlan_filtering(struct net_device *dev,
+   const struct switchdev_attr *attr,
+   struct switchdev_trans *trans)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+
+   /* bridge skips -EOPNOTSUPP, so skip the prepare phase */
+   if (switchdev_trans_ph_prepare(trans))
+   return 0;
+
+   if (ds->drv->port_vlan_filtering)
+   return ds->drv->port_vlan_filtering(ds, p->port,
+   attr->u.vlan_filtering);
+
+   return 0;
+}
+
 static int dsa_slave_port_attr_set(struct net_device *dev,
   const struct switchdev_attr *attr,
   struct switchdev_trans *trans)
@@ -333,6 +351,9 @@ static int dsa_slave_port_attr_set(struct net_device *dev,
ret = ds->drv->port_stp_update(ds, p->port,
   attr->u.stp_state);
break;
+   case SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING:
+   ret = dsa_slave_vlan_filtering(dev, attr, trans);
+   break;
default:
ret = -EOPNOTSUPP;
break;
-- 
2.7.1

[PATCH net-next 0/9] net: dsa: mv88e6xxx: implement VLAN filtering

2016-02-26 Thread Vivien Didelot

This patchset fixes hardware bridging for non 802.1Q aware systems.

The mv88e6xxx DSA driver currently depends on CONFIG_VLAN_8021Q and
CONFIG_BRIDGE_VLAN_FILTERING enabled for correct bridging between switch ports.

Patch 1/9 adds support for the VLAN filtering switchdev attribute in DSA.

Patchs 2/9 and 3/9 add helper functions for the following patches.

Patchs 4/9 to 6/9 assign dynamic address databases to VLANs, ports, and
bridge groups (the lowest available FID is cleared and assigned), and thus
restore support for per-port FDB operations.

Patchs 7/9 to 9/9 refine ports isolation and setup 802.1Q on user demand.

With this patchset, ports get correctly bridged and the driver behaves as
expected, with or without 802.1Q support.

With CONFIG_VLAN_8021Q enabled, setting a default PVID to the bridge correctly
propagates the corresponding VLAN, in addition to the hardware bridging:

# echo 42 > /sys/class/net//bridge/default_pvid

But considering CONFIG_BRIDGE_VLAN_FILTERING enabled, the hardware VLAN
filtering is enabled on all bridge members only when the user requests it:

# echo 1 > /sys/class/net//bridge/vlan_filtering

Vivien Didelot (9):
  net: dsa: support VLAN filtering switchdev attr
  net: dsa: mv88e6xxx: extract single VLAN retrieval
  net: dsa: mv88e6xxx: extract single FDB dump
  net: dsa: mv88e6xxx: assign dynamic FDB to VLANs
  net: dsa: mv88e6xxx: assign default FDB to ports
  net: dsa: mv88e6xxx: assign dynamic FDB to bridges
  net: dsa: mv88e6xxx: restore VLANTable map control
  net: dsa: mv88e6xxx: remove reserved VLANs
  net: dsa: mv88e6xxx: support VLAN filtering

 drivers/net/dsa/mv88e6171.c |   1 +
 drivers/net/dsa/mv88e6352.c |   1 +
 drivers/net/dsa/mv88e6xxx.c | 441 ++--
 drivers/net/dsa/mv88e6xxx.h |   6 +
 include/net/dsa.h   |   2 +
 net/dsa/slave.c |  21 +++
 6 files changed, 370 insertions(+), 102 deletions(-)

-- 
2.7.1

[PATCH net-next 8/9] net: dsa: mv88e6xxx: remove reserved VLANs

2016-02-26 Thread Vivien Didelot

Now that ports isolation is correctly configured when joining or leaving
a bridge, there is no need to rely on reserved VLANs to isolate
unbridged ports anymore. Thus remove them, and disable 802.1Q on setup.

This restores the expected behavior of hardware bridging for systems
without 802.1Q or VLAN filtering enabled.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 7f3036b..27a19dc 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1718,10 +1718,6 @@ int mv88e6xxx_port_vlan_prepare(struct dsa_switch *ds, 
int port,
 {
int err;
 
-   /* We reserve a few VLANs to isolate unbridged ports */
-   if (vlan->vid_end >= 4000)
-   return -EOPNOTSUPP;
-
/* If the requested port doesn't belong to the same bridge as the VLAN
 * members, do not support it (yet) and fallback to software VLAN.
 */
@@ -1819,7 +1815,6 @@ int mv88e6xxx_port_vlan_del(struct dsa_switch *ds, int 
port,
const struct switchdev_obj_port_vlan *vlan)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
-   const u16 defpvid = 4000 + ds->index * DSA_MAX_PORTS + port;
u16 pvid, vid;
int err = 0;
 
@@ -1835,8 +1830,7 @@ int mv88e6xxx_port_vlan_del(struct dsa_switch *ds, int 
port,
goto unlock;
 
if (vid == pvid) {
-   /* restore reserved VLAN ID */
-   err = _mv88e6xxx_port_pvid_set(ds, port, defpvid);
+   err = _mv88e6xxx_port_pvid_set(ds, port, 0);
if (err)
goto unlock;
}
@@ -2186,20 +2180,6 @@ unlock:
return err;
 }
 
-static int mv88e6xxx_setup_port_default_vlan(struct dsa_switch *ds, int port)
-{
-   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
-   const u16 pvid = 4000 + ds->index * DSA_MAX_PORTS + port;
-   int err;
-
-   mutex_lock(>smi_mutex);
-   err = _mv88e6xxx_port_vlan_add(ds, port, pvid, true);
-   if (!err)
-   err = _mv88e6xxx_port_pvid_set(ds, port, pvid);
-   mutex_unlock(>smi_mutex);
-   return err;
-}
-
 static void mv88e6xxx_bridge_work(struct work_struct *work)
 {
struct mv88e6xxx_priv_state *ps;
@@ -2320,7 +2300,7 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
}
 
/* Port Control 2: don't force a good FCS, set the maximum frame size to
-* 10240 bytes, enable secure 802.1q tags, don't discard tagged or
+* 10240 bytes, disable 802.1q tags checking, don't discard tagged or
 * untagged frames on this port, do a destination address lookup on all
 * received packets as usual, disable ARP mirroring and don't send a
 * copy of all transmitted/received frames on this port to the CPU.
@@ -2345,7 +2325,7 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
reg |= PORT_CONTROL_2_FORWARD_UNKNOWN;
}
 
-   reg |= PORT_CONTROL_2_8021Q_SECURE;
+   reg |= PORT_CONTROL_2_8021Q_DISABLED;
 
if (reg) {
ret = _mv88e6xxx_reg_write(ds, REG_PORT(port),
@@ -2474,13 +2454,6 @@ int mv88e6xxx_setup_ports(struct dsa_switch *ds)
ret = mv88e6xxx_setup_port(ds, i);
if (ret < 0)
return ret;
-
-   if (dsa_is_cpu_port(ds, i) || dsa_is_dsa_port(ds, i))
-   continue;
-
-   ret = mv88e6xxx_setup_port_default_vlan(ds, i);
-   if (ret < 0)
-   return ret;
}
return 0;
 }
-- 
2.7.1

[PATCH 0/2] phy: micrel: fix issues with interrupt on atmel boards

2016-02-26 Thread Alexandre Belloni

Hi,

Since the phy is not polled anymore, there were issues getting a link on the
sama5d* xplained boards.

I'm not too sure about were those fixes should go and I'm wondering whether the
first one shoud be made generic.

For the second one, I found the PHY_HAS_MAGICANEG flag that is not used and I
wondering whether this is related to that kind of issue. I had a quick look at
the history and could'nt find its use.

Alexandre Belloni (2):
  phy: micrel: Ensure interrupts are reenabled on resume
  phy: micrel: Disable auto negotiation on startup

 drivers/net/phy/micrel.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

-- 
2.7.0

[PATCH net-next 5/9] net: dsa: mv88e6xxx: assign default FDB to ports

2016-02-26 Thread Vivien Didelot

Restore per-port FDB. Assign them on setup, allow adding and deleting
addresses into them, and dump them.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 96 +
 drivers/net/dsa/mv88e6xxx.h |  2 +
 2 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index b4b2f05..0f06488 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1458,14 +1458,82 @@ loadpurge:
return _mv88e6xxx_vtu_cmd(ds, GLOBAL_VTU_OP_STU_LOAD_PURGE);
 }
 
+static int _mv88e6xxx_port_fid(struct dsa_switch *ds, int port, u16 *new,
+  u16 *old)
+{
+   u16 fid;
+   int ret;
+
+   /* Port's default FID bits 3:0 are located in reg 0x06, offset 12 */
+   ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_BASE_VLAN);
+   if (ret < 0)
+   return ret;
+
+   fid = (ret & PORT_BASE_VLAN_FID_3_0_MASK) >> 12;
+
+   if (new) {
+   ret &= ~PORT_BASE_VLAN_FID_3_0_MASK;
+   ret |= (*new << 12) & PORT_BASE_VLAN_FID_3_0_MASK;
+
+   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_BASE_VLAN,
+  ret);
+   if (ret < 0)
+   return ret;
+   }
+
+   /* Port's default FID bits 11:4 are located in reg 0x05, offset 0 */
+   ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_CONTROL_1);
+   if (ret < 0)
+   return ret;
+
+   fid |= (ret & PORT_CONTROL_1_FID_11_4_MASK) << 4;
+
+   if (new) {
+   ret &= ~PORT_CONTROL_1_FID_11_4_MASK;
+   ret |= (*new >> 4) & PORT_CONTROL_1_FID_11_4_MASK;
+
+   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL_1,
+  ret);
+   if (ret < 0)
+   return ret;
+
+   netdev_dbg(ds->ports[port], "FID %d (was %d)\n", *new, fid);
+   }
+
+   if (old)
+   *old = fid;
+
+   return 0;
+}
+
+static int _mv88e6xxx_port_fid_get(struct dsa_switch *ds, int port, u16 *fid)
+{
+   return _mv88e6xxx_port_fid(ds, port, NULL, fid);
+}
+
+static int _mv88e6xxx_port_fid_set(struct dsa_switch *ds, int port, u16 fid)
+{
+   return _mv88e6xxx_port_fid(ds, port, , NULL);
+}
+
 static int _mv88e6xxx_fid_new(struct dsa_switch *ds, u16 *fid)
 {
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
DECLARE_BITMAP(fid_bitmap, MV88E6XXX_N_FID);
struct mv88e6xxx_vtu_stu_entry vlan;
-   int err;
+   int i, err;
 
bitmap_zero(fid_bitmap, MV88E6XXX_N_FID);
 
+   /* Set every FID bit used by the (un)bridged ports */
+   for (i = 0; i < ps->num_ports; ++i) {
+   err = _mv88e6xxx_port_fid_get(ds, i, fid);
+   if (err)
+   return err;
+
+   set_bit(*fid, fid_bitmap);
+   }
+
/* Set every FID bit used by the VLAN entries */
err = _mv88e6xxx_vtu_vid_write(ds, GLOBAL_VTU_VID_MASK);
if (err)
@@ -1824,7 +1892,11 @@ static int _mv88e6xxx_port_fdb_load(struct dsa_switch 
*ds, int port,
struct mv88e6xxx_vtu_stu_entry vlan;
int err;
 
-   err = _mv88e6xxx_vtu_get(ds, vid, , false);
+   /* Null VLAN ID corresponds to the port private database */
+   if (vid == 0)
+   err = _mv88e6xxx_port_fid_get(ds, port, );
+   else
+   err = _mv88e6xxx_vtu_get(ds, vid, , false);
if (err)
return err;
 
@@ -1843,10 +1915,6 @@ int mv88e6xxx_port_fdb_prepare(struct dsa_switch *ds, 
int port,
   const struct switchdev_obj_port_fdb *fdb,
   struct switchdev_trans *trans)
 {
-   /* We don't use per-port FDB */
-   if (fdb->vid == 0)
-   return -EOPNOTSUPP;
-
/* We don't need any dynamic resource from the kernel (yet),
 * so skip the prepare phase.
 */
@@ -1982,10 +2050,20 @@ int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, int 
port,
struct mv88e6xxx_vtu_stu_entry vlan = {
.vid = GLOBAL_VTU_VID_MASK, /* all ones */
};
+   u16 fid;
int err;
 
mutex_lock(>smi_mutex);
 
+   /* Dump port's default Filtering Information Database (VLAN ID 0) */
+   err = _mv88e6xxx_port_fid_get(ds, port, );
+   if (err)
+   goto unlock;
+
+   err = _mv88e6xxx_port_fdb_dump_one(ds, fid, 0, port, fdb, cb);
+   if (err)
+   goto unlock;
+
/* Dump VLANs' Filtering Information Databases */
err = _mv88e6xxx_vtu_vid_write(ds, vlan.vid);
if (err)
@@ -2286,9 +2364,13 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
if (ret)
goto abort;
 
-   /* Port based VLAN map: do not give each port its own address
+

[PATCH net-next 9/9] net: dsa: mv88e6xxx: support VLAN filtering

2016-02-26 Thread Vivien Didelot

Implement port_vlan_filtering in the driver to toggle the related port
802.1Q mode between DISABLED and SECURE, on user request.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6171.c |  1 +
 drivers/net/dsa/mv88e6352.c |  1 +
 drivers/net/dsa/mv88e6xxx.c | 39 +++
 drivers/net/dsa/mv88e6xxx.h |  2 ++
 4 files changed, 43 insertions(+)

diff --git a/drivers/net/dsa/mv88e6171.c b/drivers/net/dsa/mv88e6171.c
index dd1ebaf..d72ccbd 100644
--- a/drivers/net/dsa/mv88e6171.c
+++ b/drivers/net/dsa/mv88e6171.c
@@ -106,6 +106,7 @@ struct dsa_switch_driver mv88e6171_switch_driver = {
.port_join_bridge   = mv88e6xxx_port_bridge_join,
.port_leave_bridge  = mv88e6xxx_port_bridge_leave,
.port_stp_update= mv88e6xxx_port_stp_update,
+   .port_vlan_filtering= mv88e6xxx_port_vlan_filtering,
.port_vlan_prepare  = mv88e6xxx_port_vlan_prepare,
.port_vlan_add  = mv88e6xxx_port_vlan_add,
.port_vlan_del  = mv88e6xxx_port_vlan_del,
diff --git a/drivers/net/dsa/mv88e6352.c b/drivers/net/dsa/mv88e6352.c
index bbca36a..a41fa50 100644
--- a/drivers/net/dsa/mv88e6352.c
+++ b/drivers/net/dsa/mv88e6352.c
@@ -327,6 +327,7 @@ struct dsa_switch_driver mv88e6352_switch_driver = {
.port_join_bridge   = mv88e6xxx_port_bridge_join,
.port_leave_bridge  = mv88e6xxx_port_bridge_leave,
.port_stp_update= mv88e6xxx_port_stp_update,
+   .port_vlan_filtering= mv88e6xxx_port_vlan_filtering,
.port_vlan_prepare  = mv88e6xxx_port_vlan_prepare,
.port_vlan_add  = mv88e6xxx_port_vlan_add,
.port_vlan_del  = mv88e6xxx_port_vlan_del,
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 27a19dc..d11c9d5 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1712,6 +1712,45 @@ unlock:
return err;
 }
 
+static const char * const mv88e6xxx_port_8021q_mode_names[] = {
+   [PORT_CONTROL_2_8021Q_DISABLED] = "Disabled",
+   [PORT_CONTROL_2_8021Q_FALLBACK] = "Fallback",
+   [PORT_CONTROL_2_8021Q_CHECK] = "Check",
+   [PORT_CONTROL_2_8021Q_SECURE] = "Secure",
+};
+
+int mv88e6xxx_port_vlan_filtering(struct dsa_switch *ds, int port,
+ bool vlan_filtering)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   u16 old, new = vlan_filtering ? PORT_CONTROL_2_8021Q_SECURE :
+   PORT_CONTROL_2_8021Q_DISABLED;
+   int ret;
+
+   mutex_lock(>smi_mutex);
+
+   ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_CONTROL_2);
+   if (ret < 0)
+   goto unlock;
+
+   old = ret & PORT_CONTROL_2_8021Q_MASK;
+
+   ret &= ~PORT_CONTROL_2_8021Q_MASK;
+   ret |= new & PORT_CONTROL_2_8021Q_MASK;
+
+   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL_2, ret);
+   if (ret < 0)
+   goto unlock;
+
+   netdev_dbg(ds->ports[port], "802.1Q Mode: %s (was %s)\n",
+  mv88e6xxx_port_8021q_mode_names[new],
+  mv88e6xxx_port_8021q_mode_names[old]);
+unlock:
+   mutex_unlock(>smi_mutex);
+
+   return ret;
+}
+
 int mv88e6xxx_port_vlan_prepare(struct dsa_switch *ds, int port,
const struct switchdev_obj_port_vlan *vlan,
struct switchdev_trans *trans)
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index 85a4166..d7b088d 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -490,6 +490,8 @@ int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int 
port,
   struct net_device *bridge);
 int mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port);
 int mv88e6xxx_port_stp_update(struct dsa_switch *ds, int port, u8 state);
+int mv88e6xxx_port_vlan_filtering(struct dsa_switch *ds, int port,
+ bool vlan_filtering);
 int mv88e6xxx_port_vlan_prepare(struct dsa_switch *ds, int port,
const struct switchdev_obj_port_vlan *vlan,
struct switchdev_trans *trans);
-- 
2.7.1

[PATCH 2/2] phy: micrel: Disable auto negotiation on startup

2016-02-26 Thread Alexandre Belloni

Disable auto negotiation on init to properly detect an already plugged
cable at boot.

At boot, when the phy is started, it is in the PHY_UP state.
However, if a cable is plugged at boot, because auto negociation is already
enabled at the time we get the first interrupt, the phy is already running.
But the state machine then switches from PHY_UP to PHY_AN and calls
phy_start_aneg(). phy_start_aneg() will not do anything because aneg is
already enabled on the phy. It will then wait for a interrupt before going
further. This interrupt will never happen unless the cable is unplugged and
then replugged.

It was working properly before 321beec5047a (net: phy: Use interrupts when
available in NOLINK state) because switching to NOLINK meant starting
polling the phy, even if IRQ were enabled.

Fixes: 321beec5047a (net: phy: Use interrupts when available in NOLINK state)
Signed-off-by: Alexandre Belloni 
---
 drivers/net/phy/micrel.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index a5e265b2bbfb..dc85f7095e51 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -297,6 +297,17 @@ static int kszphy_config_init(struct phy_device *phydev)
if (priv->led_mode >= 0)
kszphy_setup_led(phydev, type->led_mode_reg, priv->led_mode);
 
+   if (phy_interrupt_is_valid(phydev)) {
+   int ctl = phy_read(phydev, MII_BMCR);
+
+   if (ctl < 0)
+   return ctl;
+
+   ret = phy_write(phydev, MII_BMCR, ctl & ~BMCR_ANENABLE);
+   if (ret < 0)
+   return ret;
+   }
+
return 0;
 }
 
-- 
2.7.0

Re: [PATCHv2 08/10] rfkill: Use switch to demux userspace operations

2016-02-26 Thread Jouni Malinen

On Mon, Feb 22, 2016 at 11:36:39AM -0500, João Paulo Rechi Vita wrote:
> Using a switch to handle different ev.op values in rfkill_fop_write()
> makes the code easier to extend, as out-of-range values can always be
> handled by the default case.

This breaks rfkill.. There are automated test scripts for testing this
area (and most of Wi-Fi for that matter. It would be nice if these were
used for changes before they get contributed upstream..

http://buildbot.w1.fi/hwsim/

This specific commit broke all the rfkill_* test cases because of
following:

> diff --git a/net/rfkill/core.c b/net/rfkill/core.c
> @@ -1199,29 +1200,32 @@ static ssize_t rfkill_fop_write(struct file *file, 
> const char __user *buf,
> - list_for_each_entry(rfkill, _list, node) {
> - if (rfkill->idx != ev.idx && ev.op != RFKILL_OP_CHANGE_ALL)
> - continue;
> -
> - if (rfkill->type != ev.type && ev.type != RFKILL_TYPE_ALL)
> - continue;

Note that RFKILL_TYPE_ALL here..

> + list_for_each_entry(rfkill, _list, node)
> + if (rfkill->type == ev.type ||
> + ev.type == RFKILL_TYPE_ALL)
> + rfkill_set_block(rfkill, ev.soft);

It was included for RFKILL_OP_CHANGE_ALL.

> + case RFKILL_OP_CHANGE:
> + list_for_each_entry(rfkill, _list, node)
> + if (rfkill->idx == ev.idx && rfkill->type == ev.type)
> + rfkill_set_block(rfkill, ev.soft);

but not for RFKILL_OP_CHANGE..

This needs following to work:


diff --git a/net/rfkill/core.c b/net/rfkill/core.c
index 59ff92d..c4bbd19 100644
--- a/net/rfkill/core.c
+++ b/net/rfkill/core.c
@@ -1239,7 +1239,9 @@ static ssize_t rfkill_fop_write(struct file *file, const 
char __user *buf,
break;
case RFKILL_OP_CHANGE:
list_for_each_entry(rfkill, _list, node)
-   if (rfkill->idx == ev.idx && rfkill->type == ev.type)
+   if (rfkill->idx == ev.idx &&
+   (rfkill->type == ev.type ||
+ev.type == RFKILL_TYPE_ALL))
rfkill_set_block(rfkill, ev.soft);
ret = 0;
break;
 
-- 
Jouni MalinenPGP id EFC895FA

Re: [PATCH v3 1/4] net: ethernet: dwmac: add Ethernet glue logic for stm32 chip

2016-02-26 Thread Joachim Eastwood

Hi Alexandre,

When people comment on your patch please CC them on the next version.

On 26 February 2016 at 11:51, Alexandre TORGUE
 wrote:
> stm324xx family chips support Synopsys MAC 3.510 IP.
> This patch adds settings for logical glue logic:
> -clocks
> -mode selection MII or RMII.
>
> Signed-off-by: Alexandre TORGUE 

Driver looks good now, thanks.

Reviewed-by: Joachim Eastwood 


regards,
Joachim Eastwood

Re: [PATCH v3 2/4] Documentation: Bindings: Add STM32 DWMAC glue

2016-02-26 Thread Joachim Eastwood

Hi Alexandre,

On 26 February 2016 at 11:51, Alexandre TORGUE
 wrote:
> Signed-off-by: Alexandre TORGUE 
>
> diff --git a/Documentation/devicetree/bindings/net/stm32-dwmac.txt 
> b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
> new file mode 100644
> index 000..67fceda
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
> @@ -0,0 +1,40 @@
> +STMicroelectronics STM32 / MCU DWMAC glue layer controller
> +
> +This file documents platform glue layer for stmmac.
> +Please see stmmac.txt for the other unchanged properties.
> +
> +The device node has following properties.
> +
> +Required properties:
> +- compatible:  Should be "st,stm32-dwmac" to select glue, and
> +  "snps,dwmac-3.50a" to select IP vesrion.
> +- clocks: Should contain the GMAC main clock, and tx clock
> +- compatible:  Should be "st,stm32-dwmac" to select glue and
> +  "snps,dwmac-3.50a" to select IP version.
> +- clocks: Should contain the MAC main clock
> +- clock-names: Should contain the clock names "stmmaceth".
> +- st,syscon : Should be phandle/offset pair. The phandle to the syscon node 
> which
> + encompases the glue register, and the offset of the control 
> register.
> +
> +Optional properties:
> +- clocks: Could contain:
> +   - the tx clock,
> +   - the rx clock
> +- clock-names: Could contain the clock names "tx-clk", "rx-clk"
> +
> +Example:
> +
> +   ethernet0: dwmac@40028000 {
> +   compatible = "st,stm32-dwmac", "snps,dwmac-3.50a";
> +   status = "disabled";
> +   reg = <0x40028000 0x8000>;
> +   reg-names = "stmmaceth";
> +   interrupts = <0 61 0>, <0 62 0>;
> +   interrupt-names = "macirq", "eth_wake_irq";
> +   clock-names = "stmmaceth", "tx-clk", "rx-clk";
> +   clocks = < 0 25>, < 0 26>, < 0 27>;
> +   st,syscon = < 0x4>;
> +   snps,pbl = <8>;
> +   snps,mixed-burst;
> +   dma-ranges;
> +   };

Looks just like any other dwmac-driver binding so:

Acked-by: Joachim Eastwood 


regards,
Joachim Eastwood

[PATCH net] ppp: lock ppp->flags in ppp_read() and ppp_poll()

2016-02-26 Thread Guillaume Nault

ppp_read() and ppp_poll() can be called concurrently with ppp_ioctl().
In this case, ppp_ioctl() might call ppp_ccp_closed(), which may update
ppp->flags while ppp_read() or ppp_poll() is reading it.
The update done by ppp_ccp_closed() isn't atomic due to the bit mask
operation ('ppp->flags &= ~(SC_CCP_OPEN | SC_CCP_UP)'), so concurrent
readers might get transient values.
Reading incorrect ppp->flags may disturb the 'ppp->flags & SC_LOOP_TRAFFIC'
test in ppp_read() and ppp_poll(), which in turn can lead to improper
decision on whether the PPP unit file is ready for reading or not.

Since ppp_ccp_closed() is protected by the Rx and Tx locks (with
ppp_lock()), taking the Rx lock is enough for ppp_read() and ppp_poll()
to guarantee that ppp_ccp_closed() won't update ppp->flags
concurrently.

The same reasoning applies to ppp->n_channels. The 'n_channels' field
can also be written to concurrently by ppp_ioctl() (through
ppp_connect_channel() or ppp_disconnect_channel()). These writes aren't
atomic (simple increment/decrement), but are protected by both the Rx
and Tx locks (like in the ppp->flags case). So holding the Rx lock
before reading ppp->n_channels also prevents concurrent writes.

Signed-off-by: Guillaume Nault 
---

This was patch #2 of the 'ppp: fix locking issues related to ppp_ioctl()'
series. I haven't kept the extra locking of ppp->flags in
ppp_ioctl(PPPIOCGFLAGS), which was added in the original series,
because the ppp_mutex lock ensures we can't enter the PPPIOCSFLAGS case
concurrently.
This is still quite theoretical issue as I've never observed the error
in practice.

 drivers/net/ppp/ppp_generic.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index fc8ad00..e8a5936 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -443,9 +443,14 @@ static ssize_t ppp_read(struct file *file, char __user 
*buf,
 * network traffic (demand mode).
 */
struct ppp *ppp = PF_TO_PPP(pf);
+
+   ppp_recv_lock(ppp);
if (ppp->n_channels == 0 &&
-   (ppp->flags & SC_LOOP_TRAFFIC) == 0)
+   (ppp->flags & SC_LOOP_TRAFFIC) == 0) {
+   ppp_recv_unlock(ppp);
break;
+   }
+   ppp_recv_unlock(ppp);
}
ret = -EAGAIN;
if (file->f_flags & O_NONBLOCK)
@@ -532,9 +537,12 @@ static unsigned int ppp_poll(struct file *file, poll_table 
*wait)
else if (pf->kind == INTERFACE) {
/* see comment in ppp_read */
struct ppp *ppp = PF_TO_PPP(pf);
+
+   ppp_recv_lock(ppp);
if (ppp->n_channels == 0 &&
(ppp->flags & SC_LOOP_TRAFFIC) == 0)
mask |= POLLIN | POLLRDNORM;
+   ppp_recv_unlock(ppp);
}
 
return mask;
-- 
2.7.0

Re: Sending short raw packets using sendmsg() broke

2016-02-26 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 26 Feb 2016 12:33:13 -0500

> Right. The simplest, if hacky, fix is to add something along the lines of
> 
>   static unsigned short netdev_min_hard_header_len(struct net_device *dev)
>   {
>   if (unlikely(dev->type ==ARPHDR_AX25))
> return AX25_KISS_HEADER_LEN;
>   else
> return dev->hard_header_len;
>   }
> 
> Depending on how the variable encoding scheme works, a basic min
> length check may still produce buggy headers that confuse the stack or
> driver. I need to read up on AX25. If so, then extending header_ops
> with an optional validate() function is a more generic approach of
> checking header sanity.

I suspect we will need some kind of header ops for this.

Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread David Miller

From: Andrew Lunn 
Date: Fri, 26 Feb 2016 18:05:45 +0100

> I think it is great a vendor is providing funding to get code
> upstream. However, that code needs to conform with current kernel
> architecture and design philosophy.
> 
> We as a community also need to be consistent. We have recently push
> back on Microchip with there LAN9352 who want to do something very
> similar, introduce the MAC and a very dumb switch driver. They are now
> looking at what it means to do a DSA driver. There is also talk of
> writing a DSA driver for the ks8995 family.
> 
> As David said recently, a year ago this probably would of been
> accepted. But now, switches need to be DSA or switchdev.

+1

Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread David Miller

From: Felix Fietkau 
Date: Fri, 26 Feb 2016 17:25:38 +0100

> In my opinion, leaving this part out does not make much sense and
> neither does deferring the entire patch series until we have a
> switchdev/DSA capable driver. This is just a starting point, which will
> be turned into a proper driver with the right APIs later.

I disagree, and we want people to concentrate on writing proper
switchdev/DSA drivers.

People like Andrew have offered to help in any way possible to make
this as easy as possible, so please take this seriously.

I would have accepted your arguments a year ago when we didn't have
the right infrastructure, but now we do and there is no real excuse
to submit partial or bastardized drivers for these kinds of hardware
anymore

Thanks in advance for your understanding, and I'm plan to stand
very firm on this.

Re: [PATCH RFC 0/3] intermediate representation for jit and cls_u32 conversion

2016-02-26 Thread David Miller

From: Pablo Neira Ayuso 
Date: Fri, 26 Feb 2016 17:19:48 +0100

> I see no reason to have as many hooks as frontends to start with. If
> you find limitations with the IR that are unfixable for any of the
> existing frontends in the future, then we can add direct hook as final
> solution.

I see no problem with adding many hooks, one for each class of things
we'd like to offload.  Stuff neading IR vs. stuff that does not.

And IR is "unfixable" for the latter case in that it will always be by
definition pure overhead if the cards can do this stuff directly, and
they can.

I do not encourage anything, in any way whatsoever, to try and genericize
all of this stuff into a generic framework.  That is wasted work in my
opinion.

You find an IR useful for nftables offloads, great!  But I do not see it
being useful nor desirable for u32, flower, et al.

Thanks.

Re: [net-next PATCH v3 1/3] net: sched: consolidate offload decision in cls_u32

2016-02-26 Thread Cong Wang

On Fri, Feb 26, 2016 at 7:53 AM, John Fastabend
 wrote:
> The offload decision was originally very basic and tied to if the dev
> implemented the appropriate ndo op hook. The next step is to allow
> the user to more flexibly define if any paticular rule should be
> offloaded or not. In order to have this logic in one function lift
> the current check into a helper routine tc_should_offload().
>
> Signed-off-by: John Fastabend 
> ---
>  include/net/pkt_cls.h |5 +
>  net/sched/cls_u32.c   |8 
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index 2121df5..e64d20b 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -392,4 +392,9 @@ struct tc_cls_u32_offload {
> };
>  };
>
> +static inline bool tc_should_offload(struct net_device *dev)
> +{
> +   return dev->netdev_ops->ndo_setup_tc;
> +}
> +

These should be protected by CONFIG_NET_CLS_U32, no?

Re: [PATCH RFC 3/3] net: convert tc_u32 to use the intermediate representation

2016-02-26 Thread David Miller

From: Pablo Neira Ayuso 
Date: Fri, 26 Feb 2016 17:02:22 +0100

> Just because you want to early microoptimize this thing by saving a
> little of extra code that runs from the control plane path.

I don't think that's what he is doing at all.

We have classes of classifier etc. offloads that need IR, and we have
those that don't.

There is nothing wrong with making this distinction and making our
design based upon that observation.

Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread David Miller

From: John Crispin 
Date: Fri, 26 Feb 2016 16:24:47 +0100

> the problem here is that on one side people complain about vendors not
> sending code upstream. once they start being a good citizen and provide
> funding to send stuff upstream the feedback tends to be very bad as seen
> here.

The feedback is not bad, on the contrary it is very positive and people
like Andrew want to help people like you write proper switch drivers.

If you were ignored, or rejected purely on the grounds of coding style
issues, that would be "very bad" feedback.

Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread David Miller

From: Andrew Lunn 
Date: Fri, 26 Feb 2016 16:18:13 +0100

> On Fri, Feb 26, 2016 at 03:21:35PM +0100, John Crispin wrote:
>> The ESW is found in many of the old 100mbit MIPS based SoCs. it has 5
>> external ports, 1 cpu port and 1 further port that the internal HW
>> offloading engine connects to.
>> 
>> This driver is very basic and only provides basic init and irq support.
>> The SoC and switch core both have support for a special tag making DSA
>> support possible.
> 
> Hi Crispin
> 
> There was recently a discussion about adding switches without using
> DSA or switchdev. It was pretty much decided we would not accept such
> drivers.
> 
> Sorry

+1

Re: Sending short raw packets using sendmsg() broke

2016-02-26 Thread David Miller

From: Alan Cox 
Date: Fri, 26 Feb 2016 14:44:34 +

> On Thu, 2016-02-25 at 15:26 -0500, David Miller wrote:
>> From: Heikki Hannikainen 
>> Date: Thu, 25 Feb 2016 21:36:07 +0200 (EET)
>> 
>> > Commit 9c7077622dd9174 added a check, ll_header_truncated(), which
>> > requires that a packet transmitted using sendmsg() with PF_PACKET,
>> > SOCK_RAW must be longer than dev->hard_header_len.
>> 
>> Fixed by:
>> 
>> commit 880621c2605b82eb5af91a2c94223df6f5a3fb64
>> Author: Martin Blumenstingl 
>> Date:   Sun Nov 22 17:46:09 2015 +0100
>> 
>> packet: Allow packets with only a header (but no payload)
> 
> The AX.25 case the header is variable length so this still doesn't fix
> the regression as far as I can see.

Then can you suggest a way to ensure that the user has given us a fully
specified link header?  Perhaps we can have a netdev_ops callback for
this, that variable length header technologies can implement.

1 2 >

1 - 100 of 193 matches

Mail list logo