Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-30 Thread Sergei Shtylyov

Hello.

On 09/14/2015 03:42 AM, Simon Horman wrote:

   Sorry for delayed reply, I thought I'd already replied to this. :-/


From: Kazuya Mizuguchi 

This patch updates the ravb binding to support the r8a7795 SoC by:
- Adding a compat string for the new hardware
- Adding 25 named interrupts to binding for the new SoC;
   older SoCs continue to use a single multiplexed interrupt

The example is also updated to reflect the r8a7795 as this is the
more complex case.

Based on work by Kazuya Mizuguchi and others.

Signed-off-by: Simon Horman 

---

v2
* First post; broken out of a driver update patch
* As discussed with Geert Uytterhoeven and Sergei Shtylyov
   - Binding: Make all interrupts mandatory as named-interrupts of
 the form ch%u
---
  .../devicetree/bindings/net/renesas,ravb.txt   | 65 +++---
  1 file changed, 58 insertions(+), 7 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index 1fd8831437bf..6c360f993d33 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt

[...]

@@ -27,13 +33,46 @@ Optional properties:
  Example:

ethernet@e680 {
-   compatible = "renesas,etheravb-r8a7790";
-   reg = <0 0xe680 0 0x800>, <0 0xee0e8000 0 0x4000>;
+   compatible = "renesas,etheravb-r8a7795";
+   reg = <0 0xe680 0 0x800>, <0 0xe6a0 0 0x1>;
interrupt-parent = <>;
-   interrupts = <0 163 IRQ_TYPE_LEVEL_HIGH>;
-   clocks = <_clks R8A7790_CLK_ETHERAVB>;
-   phy-mode = "rmii";
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   interrupt-names = "ch0", "ch1", "ch2", "ch3",
+ "ch4", "ch5", "ch6", "ch7",
+ "ch8", "ch9", "ch10", "ch11",
+ "ch12", "ch13", "ch14", "ch15",
+ "ch16", "ch17", "ch18", "ch19",
+ "ch20", "ch21", "ch22", "ch23",
+ "ch24";


To me, these names don't look very helpful. You could as well omit them
and use platform_get_irq() with the channel #.



These names reflect the hardware; which is the aim of DT.


   Indeed (I've looked into the manuals by now). They just look poorly 
chosen. :-)



As I believe you pointed out earlier it is preferred to use named
interrupts when there is more than one. Do I misunderstand the situation
there?


   Yes.


If you have a positive contribution to make regarding better names then
I am all ears.


   I liked your "tx", "rx" variant better...

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2015 at 11:27:34AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
>  wrote:
> > This patch introduces the concept of a seccomp fd, with a similar interface
> > and usage to ebpf fds. Initially, one is allowed to create, install, and
> > dump these fds. Any manipulation of seccomp fds requires users to be root
> > in their own user namespace, matching the checks done for
> > SECCOMP_SET_MODE_FILTER.
> >
> > Installing a filterfd has some gotchas, though. Andy mentioned previously
> > that we should restrict installation to filter fds whose parent is already
> > in the filter tree. This doesn't quite work in the case of created seccomp
> > fds, since once you install a filter fd, you can't install any other filter
> > fd since it has no parent and there is no way to "pre-chain" filters before
> > installing them.
> 
> ISTM, if we like the seccomp fd approach, we should have them be
> created with a parent already set.  IOW the default should be that
> their parent is the creator's seccomp fd and, if needed, creators
> could specify a different parent.

Allowing people doing SECCOMP_FD_NEW to specify a parent fd would
work. Then we can disallow installing a seccomp fd if its parent is
not the current filter, and get rid of the whole mess with prev
locking and all that.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/14] RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL

2015-09-30 Thread Santosh Shilimkar
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
transports. We can safely issue rds_send_xmit() and the using its
return value take decision on deferred work. This will also fix
the scenario where at times we are seeing connections stuck with
the LL_SEND_FULL bit getting set and never cleared.

We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
ring allocation code.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c| 10 ++
 net/rds/threads.c |  2 ++
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index f1e709c..9d8b52d 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1122,8 +1122,9 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 */
rds_stats_inc(s_send_queued);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   rds_send_xmit(conn);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return payload_len;
@@ -1179,8 +1180,9 @@ rds_send_pong(struct rds_connection *conn, __be16 dport)
rds_stats_inc(s_send_queued);
rds_stats_inc(s_send_pong);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   queue_delayed_work(rds_wq, >c_send_w, 0);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return 0;
diff --git a/net/rds/threads.c b/net/rds/threads.c
index dc2402e..454aa6d 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -162,7 +162,9 @@ void rds_send_worker(struct work_struct *work)
int ret;
 
if (rds_conn_state(conn) == RDS_CONN_UP) {
+   clear_bit(RDS_LL_SEND_FULL, >c_flags);
ret = rds_send_xmit(conn);
+   cond_resched();
rdsdebug("conn %p ret %d\n", conn, ret);
switch (ret) {
case -EAGAIN:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/14] RDS: IB: ack more receive completions to improve performance

2015-09-30 Thread Santosh Shilimkar
For better performance, we split the receive completion IRQ handler. That
lets us acknowledge several WCE events in one call. We also limit the WC
to max 32 to avoid latency. Acknowledging several completions in one call
instead of several calls each time will provide better performance since
less mutual exclusion locks are being performed.

In next patch, send completion is also split which re-uses the poll_cq()
and hence the code is moved to ib_cm.c

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   |  28 +--
 net/rds/ib_cm.c|  70 ++-
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_stats.c |   3 +-
 4 files changed, 132 insertions(+), 105 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f1fd5ff..727759b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -24,6 +24,8 @@
 
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
+#define RDS_IB_WC_MAX  32
+
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
 
@@ -89,6 +91,20 @@ struct rds_ib_work_ring {
atomic_tw_free_ctr;
 };
 
+/* Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_ib_ack_state {
+   u64 ack_next;
+   u64 ack_recv;
+   unsigned intack_required:1;
+   unsigned intack_next_valid:1;
+   unsigned intack_recv_valid:1;
+};
+
+
 struct rds_ib_device;
 
 struct rds_ib_connection {
@@ -102,6 +118,10 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_recv_wc[RDS_IB_WC_MAX];
+
+   /* interrupt handling */
+   struct tasklet_struct   i_recv_tasklet;
 
/* tx */
struct rds_ib_work_ring i_send_ring;
@@ -112,7 +132,6 @@ struct rds_ib_connection {
atomic_ti_signaled_sends;
 
/* rx */
-   struct tasklet_struct   i_recv_tasklet;
struct mutexi_recv_mutex;
struct rds_ib_work_ring i_recv_ring;
struct rds_ib_incoming  *i_ibinc;
@@ -199,13 +218,14 @@ struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
uint64_ts_ib_tx_cq_call;
+   uint64_ts_ib_evt_handler_call;
+   uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
uint64_ts_ib_tx_ring_full;
uint64_ts_ib_tx_throttle;
uint64_ts_ib_tx_sg_mapping_failure;
uint64_ts_ib_tx_stalled;
uint64_ts_ib_tx_credit_updates;
-   uint64_ts_ib_rx_cq_call;
uint64_ts_ib_rx_cq_event;
uint64_ts_ib_rx_ring_empty;
uint64_ts_ib_rx_refill_from_cq;
@@ -324,7 +344,8 @@ void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
-void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
+struct rds_ib_ack_state *state);
 void rds_ib_recv_tasklet_fn(unsigned long data);
 void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
 void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
@@ -332,6 +353,7 @@ void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
 void rds_ib_attempt_ack(struct rds_ib_connection *ic);
 void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
 u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
+void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, int ack_required);
 
 /* ib_ring.c */
 void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 9043f5c..28e0979 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -216,6 +216,72 @@ static void rds_ib_cq_event_handler(struct ib_event 
*event, void *data)
 event->event, ib_event_msg(event->event), data);
 }
 
+/* Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.
+ */
+static void 

[PATCH v2 01/14] RDS: use kfree_rcu in rds_ib_remove_ipaddr

2015-09-30 Thread Santosh Shilimkar
synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();

So lets use that to gain some speedup.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  | 1 +
 net/rds/ib_rdma.c | 6 ++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index aae60fd..f1fd5ff 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -164,6 +164,7 @@ struct rds_ib_connection {
 struct rds_ib_ipaddr {
struct list_headlist;
__be32  ipaddr;
+   struct rcu_head rcu;
 };
 
 struct rds_ib_device {
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 251d1ce..872f523 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -159,10 +159,8 @@ static void rds_ib_remove_ipaddr(struct rds_ib_device 
*rds_ibdev, __be32 ipaddr)
}
spin_unlock_irq(_ibdev->spinlock);
 
-   if (to_free) {
-   synchronize_rcu();
-   kfree(to_free);
-   }
+   if (to_free)
+   kfree_rcu(to_free, rcu);
 }
 
 int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/14] RDS: IB: fix the rds_ib_fmr_wq kick call

2015-09-30 Thread Santosh Shilimkar
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
to use queue_delayed_work() to kick the work. This was hurting
the performance since pool maintenance was less often triggered
from other path.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 872f523..b6644fa 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -319,7 +319,7 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct 
rds_ib_device *rds_ibdev)
int err = 0, iter = 0;
 
if (atomic_read(>dirty_count) >= pool->max_items / 10)
-   schedule_delayed_work(>flush_worker, 10);
+   queue_delayed_work(rds_ib_fmr_wq, >flush_worker, 10);
 
while (1) {
ibmr = rds_ib_reuse_fmr(pool);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/14] RDS: IB: split send completion handling and do batch ack

2015-09-30 Thread Santosh Shilimkar
Similar to what we did with receive CQ completion handling, we split
the transmit completion handler so that it lets us implement batched
work completion handling.

We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
identify the send vs receive completion event handler invocation.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   |   6 ++-
 net/rds/ib_cm.c|  45 --
 net/rds/ib_send.c  | 110 +
 net/rds/ib_stats.c |   1 -
 net/rds/send.c |   1 +
 5 files changed, 98 insertions(+), 65 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 727759b..3a8cd31 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -25,6 +25,7 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
+#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
@@ -118,9 +119,11 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
/* interrupt handling */
+   struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
 
/* tx */
@@ -217,7 +220,6 @@ struct rds_ib_device {
 struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
-   uint64_ts_ib_tx_cq_call;
uint64_ts_ib_evt_handler_call;
uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
@@ -371,7 +373,7 @@ extern wait_queue_head_t rds_ib_ring_empty_wait;
 void rds_ib_xmit_complete(struct rds_connection *conn);
 int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off);
-void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 void rds_ib_send_init_ring(struct rds_ib_connection *ic);
 void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
 int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 28e0979..8f51d0d 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -250,11 +250,34 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+
+   if (wc->wr_id & RDS_IB_SEND_OP)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
}
}
 }
 
+static void rds_ib_tasklet_fn_send(unsigned long data)
+{
+   struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
+   struct rds_connection *conn = ic->conn;
+   struct rds_ib_ack_state state;
+
+   rds_ib_stats_inc(s_ib_tasklet_call);
+
+   memset(, 0, sizeof(state));
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+
+   if (rds_conn_up(conn) &&
+   (!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
+   test_bit(0, >c_map_queued)))
+   rds_send_xmit(ic->conn);
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -304,6 +327,18 @@ static void rds_ib_qp_event_handler(struct ib_event 
*event, void *data)
}
 }
 
+static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, void *context)
+{
+   struct rds_connection *conn = context;
+   struct rds_ib_connection *ic = conn->c_transport_data;
+
+   rdsdebug("conn %p cq %p\n", conn, cq);
+
+   rds_ib_stats_inc(s_ib_evt_handler_call);
+
+   tasklet_schedule(>i_send_tasklet);
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -337,7 +372,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_pd = rds_ibdev->pd;
 
cq_attr.cqe = ic->i_send_ring.w_nr + 1;
-   ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
+
+   ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_send_cq)) 

[PATCH v2 04/14] RDS: Use per-bucket rw lock for bind hash-table

2015-09-30 Thread Santosh Shilimkar
One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.

When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.

The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c |  2 ++
 net/rds/bind.c   | 47 ---
 net/rds/rds.h|  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index dc08766..384ea1e 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -582,6 +582,8 @@ static int rds_init(void)
 {
int ret;
 
+   rds_bind_lock_init();
+
ret = rds_conn_init();
if (ret)
goto out;
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 166c605..bc6b93e 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -38,22 +38,27 @@
 #include 
 #include "rds.h"
 
+struct bind_bucket {
+   rwlock_tlock;
+   struct hlist_head   head;
+};
+
 #define BIND_HASH_SIZE 1024
-static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_RWLOCK(rds_bind_lock);
+static struct bind_bucket bind_hash_table[BIND_HASH_SIZE];
 
-static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
+static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port)
 {
return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) &
  (BIND_HASH_SIZE - 1));
 }
 
 /* must hold either read or write lock (write lock for insert != NULL) */
-static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
+static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket,
+   __be32 addr, __be16 port,
struct rds_sock *insert)
 {
struct rds_sock *rs;
-   struct hlist_head *head = hash_to_bucket(addr, port);
+   struct hlist_head *head = >head;
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
@@ -91,10 +96,11 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
unsigned long flags;
+   struct bind_bucket *bucket = hash_to_bucket(addr, port);
 
-   read_lock_irqsave(_bind_lock, flags);
-   rs = rds_bind_lookup(addr, port, NULL);
-   read_unlock_irqrestore(_bind_lock, flags);
+   read_lock_irqsave(>lock, flags);
+   rs = rds_bind_lookup(bucket, addr, port, NULL);
+   read_unlock_irqrestore(>lock, flags);
 
if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
rds_sock_put(rs);
@@ -113,6 +119,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
unsigned long flags;
int ret = -EADDRINUSE;
u16 rover, last;
+   struct bind_bucket *bucket;
 
if (*port != 0) {
rover = be16_to_cpu(*port);
@@ -122,13 +129,15 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
last = rover - 1;
}
 
-   write_lock_irqsave(_bind_lock, flags);
-
do {
struct rds_sock *rrs;
if (rover == 0)
rover++;
-   rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs);
+
+   bucket = hash_to_bucket(addr, cpu_to_be16(rover));
+   write_lock_irqsave(>lock, flags);
+   rrs = rds_bind_lookup(bucket, addr, cpu_to_be16(rover), rs);
+   write_unlock_irqrestore(>lock, flags);
if (!rrs) {
*port = rs->rs_bound_port;
ret = 0;
@@ -140,16 +149,16 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
}
} while (rover++ != last);
 
-   write_unlock_irqrestore(_bind_lock, flags);
-
return ret;
 }
 
 void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
+   struct bind_bucket *bucket =
+   hash_to_bucket(rs->rs_bound_addr, rs->rs_bound_port);
 
-   write_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(>lock, flags);
 
if (rs->rs_bound_addr) {
rdsdebug("rs %p unbinding from %pI4:%d\n",
@@ -161,7 +170,7 @@ void rds_remove_bound(struct rds_sock *rs)
rs->rs_bound_addr = 0;
}
 
-   write_unlock_irqrestore(_bind_lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
 
 int rds_bind(struct 

Re: [patch net-next] switchdev: bring back switchdev_obj and use it as a generic object param

2015-09-30 Thread Vivien Didelot
Hi Jiri,

On Sep. Wednesday 30 (40) 06:00 PM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Replace "void *obj" with a generic structure. Introduce couple of
> helpers along that.
> 
> Signed-off-by: Jiri Pirko 
> ---
>  drivers/net/ethernet/rocker/rocker.c | 41 +--
>  include/net/switchdev.h  | 42 
> ++--
>  net/bridge/br_fdb.c  |  2 +-
>  net/bridge/br_vlan.c |  6 --
>  net/dsa/slave.c  | 35 ++
>  net/switchdev/switchdev.c| 40 ++
>  6 files changed, 104 insertions(+), 62 deletions(-)
> 
> diff --git a/drivers/net/ethernet/rocker/rocker.c 
> b/drivers/net/ethernet/rocker/rocker.c
> index 9773f5b..1236835 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -4437,7 +4437,8 @@ static int rocker_port_fdb_add(struct rocker_port 
> *rocker_port,
>  }
>  
>  static int rocker_port_obj_add(struct net_device *dev,
> -enum switchdev_obj_id id, const void *obj,
> +enum switchdev_obj_id id,
> +const struct switchdev_obj *obj,
>  struct switchdev_trans *trans)
>  {
>   struct rocker_port *rocker_port = netdev_priv(dev);
> @@ -4446,16 +4447,18 @@ static int rocker_port_obj_add(struct net_device *dev,
>  
>   switch (id) {
>   case SWITCHDEV_OBJ_PORT_VLAN:
> - err = rocker_port_vlans_add(rocker_port, trans, obj);
> + err = rocker_port_vlans_add(rocker_port, trans,
> + SWITCHDEV_OBJ_VLAN(obj));
>   break;
>   case SWITCHDEV_OBJ_IPV4_FIB:
> - fib4 = obj;
> + fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
>   err = rocker_port_fib_ipv4(rocker_port, trans,
>  htonl(fib4->dst), fib4->dst_len,
>  fib4->fi, fib4->tb_id, 0);
>   break;
>   case SWITCHDEV_OBJ_PORT_FDB:
> - err = rocker_port_fdb_add(rocker_port, trans, obj);
> + err = rocker_port_fdb_add(rocker_port, trans,
> +   SWITCHDEV_OBJ_FDB(obj));
>   break;
>   default:
>   err = -EOPNOTSUPP;
> @@ -4508,7 +4511,8 @@ static int rocker_port_fdb_del(struct rocker_port 
> *rocker_port,
>  }
>  
>  static int rocker_port_obj_del(struct net_device *dev,
> -enum switchdev_obj_id id, const void *obj)
> +enum switchdev_obj_id id,
> +const struct switchdev_obj *obj)
>  {
>   struct rocker_port *rocker_port = netdev_priv(dev);
>   const struct switchdev_obj_ipv4_fib *fib4;
> @@ -4516,17 +4520,19 @@ static int rocker_port_obj_del(struct net_device *dev,
>  
>   switch (id) {
>   case SWITCHDEV_OBJ_PORT_VLAN:
> - err = rocker_port_vlans_del(rocker_port, obj);
> + err = rocker_port_vlans_del(rocker_port,
> + SWITCHDEV_OBJ_VLAN(obj));
>   break;
>   case SWITCHDEV_OBJ_IPV4_FIB:
> - fib4 = obj;
> + fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
>   err = rocker_port_fib_ipv4(rocker_port, NULL,
>  htonl(fib4->dst), fib4->dst_len,
>  fib4->fi, fib4->tb_id,
>  ROCKER_OP_FLAG_REMOVE);
>   break;
>   case SWITCHDEV_OBJ_PORT_FDB:
> - err = rocker_port_fdb_del(rocker_port, NULL, obj);
> + err = rocker_port_fdb_del(rocker_port, NULL,
> +   SWITCHDEV_OBJ_FDB(obj));
>   break;
>   default:
>   err = -EOPNOTSUPP;
> @@ -4538,7 +4544,7 @@ static int rocker_port_obj_del(struct net_device *dev,
>  
>  static int rocker_port_fdb_dump(const struct rocker_port *rocker_port,
>   struct switchdev_obj_fdb *fdb,
> - int (*cb)(void *obj))
> + switchdev_obj_dump_cb_t *cb)
>  {
>   struct rocker *rocker = rocker_port->rocker;
>   struct rocker_fdb_tbl_entry *found;
> @@ -4555,7 +4561,7 @@ static int rocker_port_fdb_dump(const struct 
> rocker_port *rocker_port,
>   fdb->ndm_state = NUD_REACHABLE;
>   fdb->vid = rocker_port_vlan_to_vid(rocker_port,
>  found->key.vlan_id);
> - err = cb(fdb);
> + err = cb(>obj);
>   if (err)
>   break;
>   }
> @@ -4566,7 +4572,7 @@ static int rocker_port_fdb_dump(const struct 
> rocker_port *rocker_port,
>  
>  static int rocker_port_vlan_dump(const 

checkpoint/restore of seccomp filters v3

2015-09-30 Thread Tycho Andersen
Hi all,

Here's a re-worked set for c/r of seccomp filters which keeps around the
original bpf program passed to the kernel instead of trying to dump the
ebpf version. There are various comments/questions in the individual patch
notes.

I'm not sure this needs to go via net-next any more, as the impact in net/
is fairly minimal, and it seems more seccomp heavy. As such, this set is
based on seccomp/tip.

Thoughts welcome,

Tycho

P.S. Man page patches to come once we agree on the API :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFT v3] geneve: implement support for IPv6-based tunnels

2015-09-30 Thread John W. Linville
Signed-off-by: John W. Linville 
---
v3: 
- declare geneve_remote_unspec as static

v2:
- do not require remote address for tx on metadata tunnels
- pass correct sockaddr family to udp_tun_rx_dst in geneve_rx
- accommodate both ipv4 and ipv6 sockets open on same tunnel
- move declaration of geneve_get_dst for aesthetic purposes

 drivers/net/geneve.c | 430 ---
 include/uapi/linux/if_link.h |   1 +
 2 files changed, 368 insertions(+), 63 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 8f5c02eed47d..291d3d7754a8 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -46,16 +46,28 @@ struct geneve_net {
 
 static int geneve_net_id;
 
+union geneve_addr {
+   struct sockaddr_in sin;
+   struct sockaddr_in6 sin6;
+   struct sockaddr sa;
+};
+
+static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, };
+
+#define GENEVE_F_IPV6  0x0001
+
 /* Pseudo network device */
 struct geneve_dev {
struct hlist_node  hlist;   /* vni hash table */
struct net *net;/* netns for packet i/o */
struct net_device  *dev;/* netdev for geneve tunnel */
-   struct geneve_sock *sock;   /* socket used for geneve tunnel */
+   struct geneve_sock *sock4;  /* IPv4 socket used for geneve tunnel */
+   struct geneve_sock *sock6;  /* IPv6 socket used for geneve tunnel */
u8 vni[3];  /* virtual network ID for tunnel */
u8 ttl; /* TTL override */
u8 tos; /* TOS override */
-   struct sockaddr_in remote;  /* IPv4 address for link partner */
+   u32flags;   /* GENEVE_F_* above */
+   union geneve_addr  remote;  /* IP address for link partner */
struct list_head   next;/* geneve's per namespace list */
__be16 dst_port;
bool   collect_md;
@@ -103,11 +115,32 @@ static struct geneve_dev *geneve_lookup(struct 
geneve_sock *gs,
vni_list_head = >vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-   addr == geneve->remote.sin_addr.s_addr)
+   addr == geneve->remote.sin.sin_addr.s_addr)
+   return geneve;
+   }
+   return NULL;
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
+struct in6_addr addr6, u8 vni[])
+{
+   struct hlist_head *vni_list_head;
+   struct geneve_dev *geneve;
+   __u32 hash;
+
+   /* Find the device for this VNI */
+   hash = geneve_net_vni_hash(vni);
+   vni_list_head = >vni_list[hash];
+   hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
+   if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
+   !memcmp(, >remote.sin6.sin6_addr,
+   sizeof(addr6)))
return geneve;
}
return NULL;
 }
+#endif
 
 static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb)
 {
@@ -121,24 +154,47 @@ static void geneve_rx(struct geneve_sock *gs, struct 
sk_buff *skb)
struct metadata_dst *tun_dst = NULL;
struct geneve_dev *geneve = NULL;
struct pcpu_sw_netstats *stats;
-   struct iphdr *iph;
-   u8 *vni;
+   struct iphdr *iph = NULL;
__be32 addr;
-   int err;
+   static u8 zero_vni[3];
+   u8 *vni;
+   int err = 0;
+   sa_family_t sa_family;
+#if IS_ENABLED(CONFIG_IPV6)
+   struct ipv6hdr *ip6h = NULL;
+   struct in6_addr addr6;
+   static struct in6_addr zero_addr6;
+#endif
+
+   sa_family = gs->sock->sk->sk_family;
 
-   iph = ip_hdr(skb); /* outer IP header... */
+   if (sa_family == AF_INET) {
+   iph = ip_hdr(skb); /* outer IP header... */
 
-   if (gs->collect_md) {
-   static u8 zero_vni[3];
+   if (gs->collect_md) {
+   vni = zero_vni;
+   addr = 0;
+   } else {
+   vni = gnvh->vni;
 
-   vni = zero_vni;
-   addr = 0;
-   } else {
-   vni = gnvh->vni;
-   addr = iph->saddr;
-   }
+   addr = iph->saddr;
+   }
+
+   geneve = geneve_lookup(gs, addr, vni);
+   } else if (sa_family == AF_INET6) {
+   ip6h = ipv6_hdr(skb); /* outer IPv6 header... */
+
+   if (gs->collect_md) {
+   vni = zero_vni;
+   addr6 = zero_addr6;
+   } else {
+   vni = gnvh->vni;
+
+   addr6 = ip6h->saddr;
+   }
 
-   geneve = geneve_lookup(gs, addr, 

Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 11:36 AM, Tycho Andersen
 wrote:
> On Wed, Sep 30, 2015 at 11:27:34AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
>>  wrote:
>> > This patch introduces the concept of a seccomp fd, with a similar interface
>> > and usage to ebpf fds. Initially, one is allowed to create, install, and
>> > dump these fds. Any manipulation of seccomp fds requires users to be root
>> > in their own user namespace, matching the checks done for
>> > SECCOMP_SET_MODE_FILTER.
>> >
>> > Installing a filterfd has some gotchas, though. Andy mentioned previously
>> > that we should restrict installation to filter fds whose parent is already
>> > in the filter tree. This doesn't quite work in the case of created seccomp
>> > fds, since once you install a filter fd, you can't install any other filter
>> > fd since it has no parent and there is no way to "pre-chain" filters before
>> > installing them.
>>
>> ISTM, if we like the seccomp fd approach, we should have them be
>> created with a parent already set.  IOW the default should be that
>> their parent is the creator's seccomp fd and, if needed, creators
>> could specify a different parent.
>
> Allowing people doing SECCOMP_FD_NEW to specify a parent fd would
> work. Then we can disallow installing a seccomp fd if its parent is
> not the current filter, and get rid of the whole mess with prev
> locking and all that.
>

Yes, please.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 14/14] RDS: IB: split mr pool to improve 8K messages performance

2015-09-30 Thread Santosh Shilimkar
8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c   |  47 +
 net/rds/ib.h   |  43 ---
 net/rds/ib_rdma.c  | 101 +
 net/rds/ib_stats.c |  18 ++
 4 files changed, 147 insertions(+), 62 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 883813a..a833ab7 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -43,14 +43,14 @@
 #include "rds.h"
 #include "ib.h"
 
-static unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE;
-unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned 
MRs */
+unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
+unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(fmr_pool_size, int, 0444);
-MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA");
-module_param(fmr_message_size, int, 0444);
-MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer");
+module_param(rds_ib_fmr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
+module_param(rds_ib_fmr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -97,8 +97,10 @@ static void rds_ib_dev_free(struct work_struct *work)
struct rds_ib_device *rds_ibdev = container_of(work,
struct rds_ib_device, free_work);
 
-   if (rds_ibdev->mr_pool)
-   rds_ib_destroy_mr_pool(rds_ibdev->mr_pool);
+   if (rds_ibdev->mr_8k_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_8k_pool);
+   if (rds_ibdev->mr_1m_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_1m_pool);
if (rds_ibdev->pd)
ib_dealloc_pd(rds_ibdev->pd);
 
@@ -148,9 +150,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_mr ?
-   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
-   fmr_pool_size;
+   rds_ibdev->max_1m_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, (dev_attr->max_mr / 2),
+ rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+
+   rds_ibdev->max_8k_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, ((dev_attr->max_mr / 2) * RDS_MR_8K_SCALE),
+ rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = dev_attr->max_qp_rd_atom;
@@ -162,12 +168,25 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev);
-   if (IS_ERR(rds_ibdev->mr_pool)) {
-   rds_ibdev->mr_pool = NULL;
+   rds_ibdev->mr_1m_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_1M_POOL);
+   if (IS_ERR(rds_ibdev->mr_1m_pool)) {
+   rds_ibdev->mr_1m_pool = NULL;
goto put_dev;
}
 
+   rds_ibdev->mr_8k_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_8K_POOL);
+   if (IS_ERR(rds_ibdev->mr_8k_pool)) {
+   rds_ibdev->mr_8k_pool = NULL;
+   goto put_dev;
+   }
+
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+dev_attr->max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
+rds_ibdev->max_8k_fmrs);
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 3a8cd31..f17d095 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -9,8 +9,11 @@
 #include "rds.h"
 #include "rdma_transport.h"
 
-#define RDS_FMR_SIZE   256
-#define RDS_FMR_POOL_SIZE  8192
+#define RDS_FMR_1M_POOL_SIZE   (8192 / 2)
+#define RDS_FMR_1M_MSG_SIZE256
+#define RDS_FMR_8K_MSG_SIZE   

[PATCH net-next 4/5] bridge: vlan: fix possible null ptr derefs on port init and deinit

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

When a new port is being added we need to make vlgrp available after
rhashtable has been initialized and when removing a port we need to
flush the vlans and free the resources after we're sure noone can use
the port, i.e. after it's removed from the port list and synchronize_rcu
is executed.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_if.c   |  3 ++-
 net/bridge/br_vlan.c | 16 ++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 45e4757c6fd2..934cae9fa317 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -248,7 +248,6 @@ static void del_nbp(struct net_bridge_port *p)
 
list_del_rcu(>list);
 
-   nbp_vlan_flush(p);
br_fdb_delete_by_port(br, p, 0, 1);
nbp_update_port_count(br);
 
@@ -257,6 +256,8 @@ static void del_nbp(struct net_bridge_port *p)
dev->priv_flags &= ~IFF_BRIDGE_PORT;
 
netdev_rx_handler_unregister(dev);
+   /* use the synchronize_rcu done by netdev_rx_handler_unregister */
+   nbp_vlan_flush(p);
 
br_multicast_del_port(p);
 
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 90ac4b0c55c1..7e9d60a402e2 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -854,16 +854,20 @@ err_rhtbl:
 
 int nbp_vlan_init(struct net_bridge_port *p)
 {
+   struct net_bridge_vlan_group *vg;
int ret = -ENOMEM;
 
-   p->vlgrp = kzalloc(sizeof(struct net_bridge_vlan_group), GFP_KERNEL);
-   if (!p->vlgrp)
+   vg = kzalloc(sizeof(struct net_bridge_vlan_group), GFP_KERNEL);
+   if (!vg)
goto out;
 
-   ret = rhashtable_init(>vlgrp->vlan_hash, _vlan_rht_params);
+   ret = rhashtable_init(>vlan_hash, _vlan_rht_params);
if (ret)
goto err_rhtbl;
-   INIT_LIST_HEAD(>vlgrp->vlan_list);
+   INIT_LIST_HEAD(>vlan_list);
+   /* Make sure everything's committed before publishing vg */
+   smp_wmb();
+   p->vlgrp = vg;
if (p->br->default_pvid) {
ret = nbp_vlan_add(p, p->br->default_pvid,
   BRIDGE_VLAN_INFO_PVID |
@@ -875,9 +879,9 @@ out:
return ret;
 
 err_vlan_add:
-   rhashtable_destroy(>vlgrp->vlan_hash);
+   rhashtable_destroy(>vlan_hash);
 err_rhtbl:
-   kfree(p->vlgrp);
+   kfree(vg);
 
goto out;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/5] seccomp: save the original filter

2015-09-30 Thread Tycho Andersen
In order to implement checkpoint of seccomp filters, we need to keep track
of the original filter as the user gave it to us. Since we're doing this,
we need to also use bpf_prog_destroy to free the struct bpf_brogs so we
don't leak this memory.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
 include/linux/filter.h |  2 ++
 kernel/seccomp.c   | 24 
 net/core/filter.c  |  4 ++--
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index fa2cab9..6c045ba 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -410,6 +410,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct 
sock_fprog_kern *fprog);
 int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
  bpf_aux_classic_check_t trans);
 void bpf_prog_destroy(struct bpf_prog *fp);
+int bpf_prog_store_orig_filter(struct bpf_prog *fp,
+  const struct sock_fprog *fprog);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
 int sk_attach_bpf(u32 ufd, struct sock *sk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5bd4779..09f3769 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -337,6 +337,14 @@ static inline void seccomp_sync_threads(void)
}
 }
 
+static inline void seccomp_filter_free(struct seccomp_filter *filter)
+{
+   if (filter) {
+   bpf_prog_destroy(filter->prog);
+   kfree(filter);
+   }
+}
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -376,6 +384,14 @@ static struct seccomp_filter 
*seccomp_prepare_filter(struct sock_fprog *fprog)
return ERR_PTR(ret);
}
 
+   if (config_enabled(CONFIG_CHECKPOINT_RESTORE)) {
+   ret = bpf_prog_store_orig_filter(sfilter->prog, fprog);
+   if (ret < 0) {
+   seccomp_filter_free(sfilter);
+   return ERR_PTR(ret);
+   }
+   }
+
atomic_set(>usage, 1);
 
return sfilter;
@@ -466,14 +482,6 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(>usage);
 }
 
-static inline void seccomp_filter_free(struct seccomp_filter *filter)
-{
-   if (filter) {
-   bpf_prog_free(filter->prog);
-   kfree(filter);
-   }
-}
-
 /* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
 void put_seccomp_filter(struct task_struct *tsk)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f0..70995dd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -832,8 +832,8 @@ static int bpf_check_classic(const struct sock_filter 
*filter,
return -EINVAL;
 }
 
-static int bpf_prog_store_orig_filter(struct bpf_prog *fp,
- const struct sock_fprog *fprog)
+int bpf_prog_store_orig_filter(struct bpf_prog *fp,
+  const struct sock_fprog *fprog)
 {
unsigned int fsize = bpf_classic_proglen(fprog);
struct sock_fprog_kern *fkprog;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/5] bridge: vlan: adjust rhashtable initial size and hash locks size

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

As Stephen pointed out the default initial size is more than we need, so
let's start small (4 elements, thus nelem_hint = 3). Also limit the hash
locks to the number of CPUs as we don't need any write-side scaling and
this looks like the minimum.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_vlan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index e227164bc3e1..283d012c3d89 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -19,6 +19,8 @@ static const struct rhashtable_params br_vlan_rht_params = {
.head_offset = offsetof(struct net_bridge_vlan, vnode),
.key_offset = offsetof(struct net_bridge_vlan, vid),
.key_len = sizeof(u16),
+   .nelem_hint = 3,
+   .locks_mul = 1,
.max_size = VLAN_N_VID,
.obj_cmpfn = br_vlan_cmp,
.automatic_shrinking = true,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/5] bridge: vlan: move pvid inside net_bridge_vlan_group

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

One obvious way to converge more code (which was also used by the
previous vlan code) is to move pvid inside net_bridge_vlan_group. This
allows us to simplify some and remove other port-specific functions.
Also gives us the ability to simply pass the vlan group and use all of the
contained information.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_device.c  |   2 +-
 net/bridge/br_input.c   |   2 +-
 net/bridge/br_netlink.c |  42 +---
 net/bridge/br_private.h |  44 ++---
 net/bridge/br_vlan.c| 103 
 5 files changed, 75 insertions(+), 118 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index c915c5b408ea..bdfb9544ca03 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -56,7 +56,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
skb_reset_mac_header(skb);
skb_pull(skb, ETH_HLEN);
 
-   if (!br_allowed_ingress(br, skb, ))
+   if (!br_allowed_ingress(br, br_vlan_group(br), skb, ))
goto out;
 
if (is_broadcast_ether_addr(dest))
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index e27d0dfd2ee9..f5c5a4500e2f 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -140,7 +140,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
if (!p || p->state == BR_STATE_DISABLED)
goto drop;
 
-   if (!nbp_allowed_ingress(p, skb, ))
+   if (!br_allowed_ingress(p->br, nbp_vlan_group(p), skb, ))
goto out;
 
/* insert into forwarding database after filtering to avoid spoofing */
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index bb8bb7b36f04..c64dcad11662 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -22,17 +22,17 @@
 #include "br_private_stp.h"
 
 static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
-   u32 filter_mask,
-   u16 pvid)
+   u32 filter_mask)
 {
struct net_bridge_vlan *v;
u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
-   u16 flags;
+   u16 flags, pvid;
int num_vlans = 0;
 
if (!(filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED))
return 0;
 
+   pvid = br_get_pvid(vg);
/* Count number of vlan infos */
list_for_each_entry(v, >vlan_list, vlist) {
flags = 0;
@@ -74,7 +74,7 @@ initvars:
 }
 
 static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg,
-u32 filter_mask, u16 pvid)
+u32 filter_mask)
 {
if (!vg)
return 0;
@@ -82,7 +82,7 @@ static int br_get_num_vlan_infos(struct net_bridge_vlan_group 
*vg,
if (filter_mask & RTEXT_FILTER_BRVLAN)
return vg->num_vlans;
 
-   return __get_num_vlan_infos(vg, filter_mask, pvid);
+   return __get_num_vlan_infos(vg, filter_mask);
 }
 
 static size_t br_get_link_af_size_filtered(const struct net_device *dev,
@@ -92,19 +92,16 @@ static size_t br_get_link_af_size_filtered(const struct 
net_device *dev,
struct net_bridge_port *p;
struct net_bridge *br;
int num_vlan_infos;
-   u16 pvid = 0;
 
rcu_read_lock();
if (br_port_exists(dev)) {
p = br_port_get_rcu(dev);
vg = nbp_vlan_group(p);
-   pvid = nbp_get_pvid(p);
} else if (dev->priv_flags & IFF_EBRIDGE) {
br = netdev_priv(dev);
vg = br_vlan_group(br);
-   pvid = br_get_pvid(br);
}
-   num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask, pvid);
+   num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask);
rcu_read_unlock();
 
/* Each VLAN is returned in bridge_vlan_info along with flags */
@@ -196,18 +193,18 @@ nla_put_failure:
 }
 
 static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
-struct net_bridge_vlan_group *vg,
-u16 pvid)
+struct net_bridge_vlan_group *vg)
 {
struct net_bridge_vlan *v;
u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
-   u16 flags;
+   u16 flags, pvid;
int err = 0;
 
/* Pack IFLA_BRIDGE_VLAN_INFO's for every vlan
 * and mark vlan info with begin and end flags
 * if vlaninfo represents a range
 */
+   pvid = br_get_pvid(vg);
list_for_each_entry(v, >vlan_list, vlist) {
flags = 0;
if (!br_vlan_should_use(v))
@@ -251,12 +248,13 @@ initvars:
 }
 
 static int br_fill_ifvlaninfo(struct sk_buff *skb,
- struct 

[PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Tycho Andersen
This patch introduces the concept of a seccomp fd, with a similar interface
and usage to ebpf fds. Initially, one is allowed to create, install, and
dump these fds. Any manipulation of seccomp fds requires users to be root
in their own user namespace, matching the checks done for
SECCOMP_SET_MODE_FILTER.

Installing a filterfd has some gotchas, though. Andy mentioned previously
that we should restrict installation to filter fds whose parent is already
in the filter tree. This doesn't quite work in the case of created seccomp
fds, since once you install a filter fd, you can't install any other filter
fd since it has no parent and there is no way to "pre-chain" filters before
installing them. To work around this, we allow installing filters who have
no parent. If the filter has a parent, we require the current filter try to
be an ancestor of it.

I'm not quite sure that the ancestor restriction is correct, since it can
still allow for "re-parenting" of filters, potentially introducing new
filters to a task. However, since these operations are limited to root in
the user ns, perhaps it is ok. There is also some potentially racy
behavior where a task re-parents a filter that another task has installed.

One option to work around this is to keep a bit on struct seccomp_filter to
allow each filter to have its parent set exactly once. (This would still
allow you to install a filter multiple times, as long as the parent was the
same in each case.)

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
 include/uapi/linux/seccomp.h |  24 ++
 kernel/seccomp.c | 189 ++-
 2 files changed, 210 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..4ee8770 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,10 +13,16 @@
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT0
 #define SECCOMP_SET_MODE_FILTER1
+#define SECCOMP_FILTER_FD  2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC  1
 
+/* Valid commands for SECCOMP_FILTER_FD */
+#define SECCOMP_FD_NEW 0
+#define SECCOMP_FD_INSTALL 1
+#define SECCOMP_FD_DUMP2
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
@@ -51,4 +57,22 @@ struct seccomp_data {
__u64 args[6];
 };
 
+struct seccomp_fd {
+   __u32   size;
+
+   union {
+   /* SECCOMP_FD_NEW */
+   struct sock_fprog __user*new_prog;
+
+   /* SECCOMP_FD_INSTALL */
+   int install_fd;
+
+   /* SECCOMP_FD_DUMP */
+   struct {
+   int dump_fd;
+   struct sock_filter __user   *insns;
+   };
+   };
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 09f3769..6f0465c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,8 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -58,6 +60,7 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+   spinlock_t prev_lock;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -393,6 +396,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}
 
atomic_set(>usage, 1);
+   sfilter->prev_lock = __SPIN_LOCK_UNLOCKED(>prev_lock);
 
return sfilter;
 }
@@ -441,6 +445,7 @@ static long seccomp_attach_filter(unsigned int flags,
struct seccomp_filter *walker;
 
assert_spin_locked(>sighand->siglock);
+   assert_spin_locked(>prev_lock);
 
/* Validate resulting filter length. */
total_insns = filter->prog->len;
@@ -482,10 +487,8 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(>usage);
 }
 
-/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
-void put_seccomp_filter(struct task_struct *tsk)
+static void seccomp_filter_decref(struct seccomp_filter *orig)
 {
-   struct seccomp_filter *orig = tsk->seccomp.filter;
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(>usage)) {
struct seccomp_filter *freeme = orig;
@@ -494,6 +497,12 @@ void put_seccomp_filter(struct task_struct *tsk)
}
 }
 
+/* put_seccomp_filter - decrements the ref count of 

[PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen
This command allows comparing the underling private data of two fds. This
is useful e.g. to find out if a seccomp filter is inherited, since struct
seccomp_filter are unique across tasks and are the private_data seccomp
fds.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
 include/uapi/linux/kcmp.h |  1 +
 kernel/kcmp.c | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/kcmp.h b/include/uapi/linux/kcmp.h
index 84df14b..ed389d2 100644
--- a/include/uapi/linux/kcmp.h
+++ b/include/uapi/linux/kcmp.h
@@ -10,6 +10,7 @@ enum kcmp_type {
KCMP_SIGHAND,
KCMP_IO,
KCMP_SYSVSEM,
+   KCMP_FILE_PRIVATE_DATA,
 
KCMP_TYPES,
 };
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 0aa69ea..9ae673b 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -165,6 +165,20 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
ret = -EOPNOTSUPP;
 #endif
break;
+   case KCMP_FILE_PRIVATE_DATA: {
+   struct file *filp1, *filp2;
+
+   filp1 = get_file_raw_ptr(task1, idx1);
+   filp2 = get_file_raw_ptr(task2, idx2);
+
+   if (filp1 && filp2)
+   ret = kcmp_ptr(filp1->private_data,
+  filp2->private_data,
+  KCMP_FILE_PRIVATE_DATA);
+   else
+   ret = -EBADF;
+   break;
+   }
default:
ret = -EINVAL;
break;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/5] bridge: vlan: cleanups & fixes

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Hi,
This is the first follow-up set, patch 01 reduces the default rhashtable
size and the number of locks that can be allocated. Patch 02 and 04 fix
possible null pointer dereferences due to the new ordering and
initialization on port add/del, and patch 03 moves the "pvid" member in
the net_bridge_vlan_group struct in order to simplify code (similar to how
it was with the older struct). Patch 05 fixes adding a vlan on a port which
is pvid and doesn't have a global context yet.
Please review carefully, I think this is the first use of rhashtable's
"locks_mul" member in the tree and I'd like to make sure it's correct.
Another thing that needs special attention is the nbp_vlan_flush() move
after the rx_handler unregister.

Cheers,
 Nik


Nikolay Aleksandrov (5):
  bridge: vlan: adjust rhashtable initial size and hash locks size
  bridge: vlan: fix possible null vlgrp deref while registering new port
  bridge: vlan: move pvid inside net_bridge_vlan_group
  bridge: vlan: fix possible null ptr derefs on port init and deinit
  bridge: vlan: don't pass flags when creating context only

 net/bridge/br_device.c  |   2 +-
 net/bridge/br_if.c  |   3 +-
 net/bridge/br_input.c   |   2 +-
 net/bridge/br_netlink.c |  42 +++-
 net/bridge/br_private.h |  44 +
 net/bridge/br_vlan.c| 127 ++--
 6 files changed, 93 insertions(+), 127 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/5] bridge: vlan: fix possible null vlgrp deref while registering new port

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

While a new port is being initialized the rx_handler gets set, but the
vlans get initialized later in br_add_if() and in that window if we
receive a frame with a link-local address we can try to dereference
p->vlgrp in:
br_handle_frame() -> br_handle_local_finish() -> br_should_learn()

Fix this by checking vlgrp before using it.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_vlan.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 283d012c3d89..678d5c41b551 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -476,13 +476,15 @@ bool br_allowed_egress(struct net_bridge_vlan_group *vg,
 /* Called under RCU */
 bool br_should_learn(struct net_bridge_port *p, struct sk_buff *skb, u16 *vid)
 {
+   struct net_bridge_vlan_group *vg;
struct net_bridge *br = p->br;
 
/* If filtering was disabled at input, let it pass. */
if (!br->vlan_enabled)
return true;
 
-   if (!p->vlgrp->num_vlans)
+   vg = p->vlgrp;
+   if (!vg || !vg->num_vlans)
return false;
 
if (!br_vlan_get_tag(skb, vid) && skb->vlan_proto != br->vlan_proto)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
 wrote:
> This patch introduces the concept of a seccomp fd, with a similar interface
> and usage to ebpf fds. Initially, one is allowed to create, install, and
> dump these fds. Any manipulation of seccomp fds requires users to be root
> in their own user namespace, matching the checks done for
> SECCOMP_SET_MODE_FILTER.
>
> Installing a filterfd has some gotchas, though. Andy mentioned previously
> that we should restrict installation to filter fds whose parent is already
> in the filter tree. This doesn't quite work in the case of created seccomp
> fds, since once you install a filter fd, you can't install any other filter
> fd since it has no parent and there is no way to "pre-chain" filters before
> installing them.

ISTM, if we like the seccomp fd approach, we should have them be
created with a parent already set.  IOW the default should be that
their parent is the creator's seccomp fd and, if needed, creators
could specify a different parent.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
 wrote:
> This command allows comparing the underling private data of two fds. This
> is useful e.g. to find out if a seccomp filter is inherited, since struct
> seccomp_filter are unique across tasks and are the private_data seccomp
> fds.

This is very implementation-specific and may have nasty ABI
consequences far outside seccomp.  Let's do something specific to
seccomp and/or eBPF.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
>  wrote:
> > This command allows comparing the underling private data of two fds. This
> > is useful e.g. to find out if a seccomp filter is inherited, since struct
> > seccomp_filter are unique across tasks and are the private_data seccomp
> > fds.
> 
> This is very implementation-specific and may have nasty ABI
> consequences far outside seccomp.  Let's do something specific to
> seccomp and/or eBPF.

We could change the name to a less generic KCMP_SECCOMP_FD or
something, but without some sort of GUID on each struct
seccomp_filter, the implementation would be effectively the same as it
is today. Is that enough, or do we need a GUID?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/14] RDS: IB: use max_mr from HCA caps than max_fmr

2015-09-30 Thread Santosh Shilimkar
All HCA drivers seems to popullate max_mr caps and few of
them do both max_mr and max_fmr.

Hence update RDS code to make use of max_mr.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 2d3f2ab..883813a 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -148,8 +148,8 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_fmr ?
-   min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) :
+   rds_ibdev->max_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
fmr_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/14] RDS: defer the over_batch work to send worker

2015-09-30 Thread Santosh Shilimkar
Current process gives up if its send work over the batch limit.
The work queue will get  kicked to finish off any other requests.
This fixes remainder condition from commit 443be0e5affe ("RDS: make
sure not to loop forever inside rds_send_xmit").

The restart condition is only for the case where we reached to
over_batch code for some other reason so just retrying again
before giving up.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 4df61a5..f1e709c 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -423,7 +423,9 @@ over_batch:
 !list_empty(>c_send_queue)) &&
send_gen == conn->c_send_gen) {
rds_stats_inc(s_send_lock_queue_raced);
-   goto restart;
+   if (batch_count < 1024)
+   goto restart;
+   queue_delayed_work(rds_wq, >c_send_w, 1);
}
}
 out:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/14] RDS: fix rds_sock reference bug while doing bind

2015-09-30 Thread Santosh Shilimkar
One need to take rds socket reference while using it and release it
once done with it. rds_add_bind() code path does not do that so
lets fix it.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 01989e2..166c605 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -61,8 +61,10 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
  be16_to_cpu(rs->rs_bound_port);
 
-   if (cmp == needle)
+   if (cmp == needle) {
+   rds_sock_addref(rs);
return rs;
+   }
}
 
if (insert) {
@@ -94,10 +96,10 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
rs = rds_bind_lookup(addr, port, NULL);
read_unlock_irqrestore(_bind_lock, flags);
 
-   if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
-   rds_sock_addref(rs);
-   else
+   if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
+   rds_sock_put(rs);
rs = NULL;
+   }
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, ,
ntohs(port));
@@ -123,14 +125,18 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
write_lock_irqsave(_bind_lock, flags);
 
do {
+   struct rds_sock *rrs;
if (rover == 0)
rover++;
-   if (!rds_bind_lookup(addr, cpu_to_be16(rover), rs)) {
+   rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs);
+   if (!rrs) {
*port = rs->rs_bound_port;
ret = 0;
rdsdebug("rs %p binding to %pI4:%d\n",
  rs, , (int)ntohs(*port));
break;
+   } else {
+   rds_sock_put(rrs);
}
} while (rover++ != last);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT v2] geneve: implement support for IPv6-based tunnels

2015-09-30 Thread kbuild test robot
Hi John,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
ignore]

reproduce:
  # apt-get install sparse
  make ARCH=x86_64 allmodconfig
  make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/geneve.c:55:19: sparse: symbol 'geneve_remote_unspec' was not 
>> declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 5/6] ipv6: Call xfrm6_xlat_addr from ipv6_rcv

2015-09-30 Thread Tom Herbert
On Wed, Sep 30, 2015 at 2:06 AM, Steffen Klassert
 wrote:
> On Tue, Sep 29, 2015 at 03:17:22PM -0700, Tom Herbert wrote:
>> Call before performing NF_HOOK and routing in order to perform address
>> translation in the receive path.
>>
>> Signed-off-by: Tom Herbert 
>> ---
>>  net/ipv6/ip6_input.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
>> index 9075acf..06dac55 100644
>> --- a/net/ipv6/ip6_input.c
>> +++ b/net/ipv6/ip6_input.c
>> @@ -183,6 +183,9 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device 
>> *dev, struct packet_type *pt
>>   /* Must drop socket now because of tproxy. */
>>   skb_orphan(skb);
>>
>> + /* Translate destination address before routing */
>> + xfrm6_xlat_addr(skb);
>> +
>
> This shows that xfrm is not the right place to add this. The existing
> xfrm hooks are located at the same place as your current LWT hooks are.
>
> You could use the existing xfrm hooks similar to xfrm tunnel modes.
> This reinserts the transformed packet back into layer2, but I guess
> this is not what you want.
>
> I'm currently paying with a GRO codepath for IPsec to get the
> packets transformed early. If you can do your address translation
> that early, it could be an option too. This clearly depends on
> enabled GRO at the receiving device, but you would still have
> the LWT hook as a fallback.
>
GRO probably doesn't help here. ILA already works with GRO, and
performing translation for every segment instead of just once for the
GRO packet would be unnecessary overhead. Besides, that still doesn't
address the problem of how to hook in a lookup and translation
function in the data path.

>>   return NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING,
>>  net, NULL, skb, dev, NULL,
>>  ip6_rcv_finish);
>
> Or, try to use the netfilter hook that seems to be at the right
> place at least.
>
My original patch did hook into nf so it didn't require any change to
IP data path. The suggested alternatives were to use iptables or nft,
but the overhead of is too great for these to be useful for as a
performance optimization. The problem is that any additional lookup
added for this purpose only makes sense if it is significantly cheaper
than the cost of doing a route lookup (the part that can be eliminated
by early demux), and needs to have near zero impact on unrelated
traffic.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v3 02/10] switchdev: introduce transaction item queue for attr_set and obj_add

2015-09-30 Thread Vivien Didelot
Hi all,

On Sep. Friday 25 (39) 11:03 AM, Vivien Didelot wrote:
> On Sep. Thursday 24 (39) 10:55 PM, David Miller wrote:
> > From: Scott Feldman 
> > Date: Thu, 24 Sep 2015 22:29:43 -0700
> > 
> > > I'd rather keep 2-phase not optional, or at least make it some what of
> > > a pain for drivers to opt-out of 2-phase.  Forcing the driver to see
> > > both phases means the driver needs to put some code to skip phase 1
> > > (and hopefully has some persistent comment explaining why its being
> > > skipped).  Something like:
> > > 
> > > /* I'm skipping phase 1 prepare for this operation.  I have infinite 
> > > hardware
> > >  * resources and I'm not setting any persistent state in the driver or 
> > > device
> > >  * and I don't need any dynamic resources from the kernel, so its 
> > > impossible
> > >  * for me to fail phase 2 commit.  Nothing to prepare, sorry.
> > >  */
> > 
> > I agree with Scott here.
> > 
> > If you can opt out of something, you can not think about it and thus
> > more likely get it wrong.
> > 
> > I can just see a driver not implementing prepare at all and then doing
> > stupid things in commit when they hit some resource limit or whatever,
> > rather than taking care of such issues in prepare.
> 
> OK, I have no experience with stacked devices nor what it actually looks
> like, but I understand that it is a redundant setup where it makes sense
> to ensure that an operation is feasible before programming the hardware.
> 
> I agree with both of you on imposing switchdev drivers such notion.
> 
> I was confused with the rtnl lock (from bridge netlink requests) which
> seemed to limit a lot the usage of this prepare phase.
> 
> I don't know the batch mode neither, but I can think about a potentially
> powerful usage of the prepare phase in Marvell switches (or any basic
> home router switches), please tell me if the following is feasible:
> 
> Every hardware VLANs I know of are programmed with all port membership
> in one shot. This is not feasible today with the bridge command. If I
> could bundle in one request the equivalent of ("VID 100: 0u 1u 5t"):
> 
> bridge vlan add master dev swp0 vid 100 pvid untagged
> bridge vlan add master dev swp1 vid 100 pvid untagged
> bridge vlan add master dev swp5 vid 100 # cpu
> 
> In such case the prepare phase could be great to allocate and populate a
> VLAN entry structure (i.e. struct mv88e6xxx_vtu_stu_entry) before
> programming the hardware *just once*. Is that doable?

May I get answers for this? I'd need that in order to suggest a next
step for the prepare phase in DSA drivers.

Thanks,
-v

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
>  wrote:
> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> >>  wrote:
> >> > This command allows comparing the underling private data of two fds. This
> >> > is useful e.g. to find out if a seccomp filter is inherited, since struct
> >> > seccomp_filter are unique across tasks and are the private_data seccomp
> >> > fds.
> >>
> >> This is very implementation-specific and may have nasty ABI
> >> consequences far outside seccomp.  Let's do something specific to
> >> seccomp and/or eBPF.
> >
> > We could change the name to a less generic KCMP_SECCOMP_FD or
> > something, but without some sort of GUID on each struct
> > seccomp_filter, the implementation would be effectively the same as it
> > is today. Is that enough, or do we need a GUID?
> >
> 
> I don't care about the GUID.  I think we should name it
> KCMP_SECCOMP_FD and make it only work on seccomp fds.

Ok, I can do that.

> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
> and consider fixing it.  IMO it's really too bad that struct file is
> so heavyweight that we can't really just embed one in all kinds of
> structures.

The problem is that KCMP_FILE compares the file objects themselves,
instead of the underlying data. If I ask for a seccomp fd for filter 0
twice, I'll have two different file objects and they won't be equal. I
suppose we could add some special logic inside KCMP_FILE to compare
the underlying data in special cases (seccomp, ebpf, others?), but it
seems cleaner to have a separate command as you described above.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/14] RDS: make socket bind/release locking scheme simple and more efficient

2015-09-30 Thread Santosh Shilimkar
RDS bind and release locking scheme is very inefficient. It
uses RCU for maintaining the bind hash-table which is great but
it also needs to hold spinlock for [add/remove]_bound(). So
overall usecase, the hash-table concurrent speedup doesn't pay off.
In fact blocking nature of synchronize_rcu() makes the RDS
socket shutdown too slow which hurts RDS performance since
connection shutdown and re-connect happens quite often to
maintain the RC part of the protocol.

So we make the locking scheme simpler and more efficient by
replacing spin_locks with reader/writer locks and getting rid
off rcu for bind hash-table.

In subsequent patch, we also covert the global lock with per-bucket
lock to reduce the global lock contention.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c |  6 --
 net/rds/bind.c   | 35 +++
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a2f28a6..dc08766 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -72,13 +72,7 @@ static int rds_release(struct socket *sock)
rds_clear_recv_queue(rs);
rds_cong_remove_socket(rs);
 
-   /*
-* the binding lookup hash uses rcu, we need to
-* make sure we synchronize_rcu before we free our
-* entry
-*/
rds_remove_bound(rs);
-   synchronize_rcu();
 
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
diff --git a/net/rds/bind.c b/net/rds/bind.c
index dd666fb..01989e2 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -40,7 +40,7 @@
 
 #define BIND_HASH_SIZE 1024
 static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_SPINLOCK(rds_bind_lock);
+static DEFINE_RWLOCK(rds_bind_lock);
 
 static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
 {
@@ -48,6 +48,7 @@ static struct hlist_head *hash_to_bucket(__be32 addr, __be16 
port)
  (BIND_HASH_SIZE - 1));
 }
 
+/* must hold either read or write lock (write lock for insert != NULL) */
 static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
struct rds_sock *insert)
 {
@@ -56,30 +57,24 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
-   rcu_read_lock();
-   hlist_for_each_entry_rcu(rs, head, rs_bound_node) {
+   hlist_for_each_entry(rs, head, rs_bound_node) {
cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
  be16_to_cpu(rs->rs_bound_port);
 
-   if (cmp == needle) {
-   rcu_read_unlock();
+   if (cmp == needle)
return rs;
-   }
}
-   rcu_read_unlock();
 
if (insert) {
/*
 * make sure our addr and port are set before
-* we are added to the list, other people
-* in rcu will find us as soon as the
-* hlist_add_head_rcu is done
+* we are added to the list.
 */
insert->rs_bound_addr = addr;
insert->rs_bound_port = port;
rds_sock_addref(insert);
 
-   hlist_add_head_rcu(>rs_bound_node, head);
+   hlist_add_head(>rs_bound_node, head);
}
return NULL;
 }
@@ -93,8 +88,11 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
 struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
+   unsigned long flags;
 
+   read_lock_irqsave(_bind_lock, flags);
rs = rds_bind_lookup(addr, port, NULL);
+   read_unlock_irqrestore(_bind_lock, flags);
 
if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
rds_sock_addref(rs);
@@ -103,6 +101,7 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, ,
ntohs(port));
+
return rs;
 }
 
@@ -121,7 +120,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
last = rover - 1;
}
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
do {
if (rover == 0)
@@ -135,7 +134,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
}
} while (rover++ != last);
 
-   spin_unlock_irqrestore(_bind_lock, flags);
+   write_unlock_irqrestore(_bind_lock, flags);
 
return ret;
 }
@@ -144,19 +143,19 @@ void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
if (rs->rs_bound_addr) {
rdsdebug("rs %p unbinding 

[PATCH v3 5/5] bpf: save the program the user actually supplied

2015-09-30 Thread Tycho Andersen
In some cases (e.g. seccomp) the program result might be translated from
the original program the user supplied. If we're saving the result for
checkpoint/restore, we should save exactly the program the user initially
supplied.

This causes problems when the translations seccomp makes are not allowed by
bpf_check_classic.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 70995dd..5a4596b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -845,8 +845,7 @@ int bpf_prog_store_orig_filter(struct bpf_prog *fp,
fkprog = fp->orig_prog;
fkprog->len = fprog->len;
 
-   fkprog->filter = kmemdup(fp->insns, fsize,
-GFP_KERNEL | __GFP_NOWARN);
+   fkprog->filter = memdup_user(fprog->filter, fsize);
if (!fkprog->filter) {
kfree(fp->orig_prog);
return -ENOMEM;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/5] seccomp: add a ptrace command to get seccomp filter fds

2015-09-30 Thread Tycho Andersen
I just picked 40 for the constant out of thin air, but there may be a more
appropriate value for this. Also, we return EINVAL when there is no filter
for the index the user requested, but ptrace also returns EINVAL for
invalid commands, making it slightly awkward to test whether or not the
kernel supports this feature. It can still be done via,

if (is_in_mode_filter(pid)) {
int fd;

fd = ptrace(PTRACE_SECCOMP_GET_FILTER_FD, pid, NULL, 0);
if (fd < 0 && errno == -EINVAL)
/* not supported */

...
}

since being in SECCOMP_MODE_FILTER implies that there is at least one
filter. If there is a more appropriate errno (ESRCH collides as well with
ptrace) to give here that may be better.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
 include/linux/seccomp.h |  9 +
 include/uapi/linux/ptrace.h |  2 ++
 kernel/ptrace.c |  4 
 kernel/seccomp.c| 28 
 4 files changed, 43 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index f426503..637d91f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -95,4 +95,13 @@ static inline void get_seccomp_filter(struct task_struct 
*tsk)
return;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+extern long seccomp_get_filter_fd(struct task_struct *task, long data);
+#else
+static inline long seccomp_get_filter_fd(struct task_struct *task, long data)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE && CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index a7a6979..3271f5a 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -23,6 +23,8 @@
 
 #define PTRACE_SYSCALL   24
 
+#define PTRACE_SECCOMP_GET_FILTER_FD 40
+
 /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
 #define PTRACE_SETOPTIONS  0x4200
 #define PTRACE_GETEVENTMSG 0x4201
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..aede440 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,10 @@ int ptrace_request(struct task_struct *child, long 
request,
break;
}
 #endif
+
+   case PTRACE_SECCOMP_GET_FILTER_FD:
+   return seccomp_get_filter_fd(child, data);
+
default:
break;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6f0465c..7275ce0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1058,3 +1058,31 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char 
__user *filter)
/* prctl interface doesn't have flags, so they are always zero. */
return do_seccomp(op, 0, uargs);
 }
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+long seccomp_get_filter_fd(struct task_struct *task, long n)
+{
+   struct seccomp_filter *filter;
+   long fd;
+
+   if (task->seccomp.mode != SECCOMP_MODE_FILTER)
+   return -EINVAL;
+
+   filter = task->seccomp.filter;
+   while (n > 0 && filter) {
+   filter = filter->prev;
+   n--;
+   }
+
+   if (!filter)
+   return -EINVAL;
+
+   atomic_inc(>usage);
+   fd = anon_inode_getfd("seccomp", _fops, filter,
+ O_RDONLY | O_CLOEXEC);
+   if (fd < 0)
+   seccomp_filter_decref(filter);
+
+   return fd;
+}
+#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 11:55 AM, Tycho Andersen
 wrote:
> On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
>>  wrote:
>> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
>> >>  wrote:
>> >> > This command allows comparing the underling private data of two fds. 
>> >> > This
>> >> > is useful e.g. to find out if a seccomp filter is inherited, since 
>> >> > struct
>> >> > seccomp_filter are unique across tasks and are the private_data seccomp
>> >> > fds.
>> >>
>> >> This is very implementation-specific and may have nasty ABI
>> >> consequences far outside seccomp.  Let's do something specific to
>> >> seccomp and/or eBPF.
>> >
>> > We could change the name to a less generic KCMP_SECCOMP_FD or
>> > something, but without some sort of GUID on each struct
>> > seccomp_filter, the implementation would be effectively the same as it
>> > is today. Is that enough, or do we need a GUID?
>> >
>>
>> I don't care about the GUID.  I think we should name it
>> KCMP_SECCOMP_FD and make it only work on seccomp fds.
>
> Ok, I can do that.
>
>> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
>> and consider fixing it.  IMO it's really too bad that struct file is
>> so heavyweight that we can't really just embed one in all kinds of
>> structures.
>
> The problem is that KCMP_FILE compares the file objects themselves,
> instead of the underlying data. If I ask for a seccomp fd for filter 0
> twice, I'll have two different file objects and they won't be equal. I
> suppose we could add some special logic inside KCMP_FILE to compare
> the underlying data in special cases (seccomp, ebpf, others?), but it
> seems cleaner to have a separate command as you described above.
>

What I meant was that maybe we could get the two requests to actually
produce the same struct file.  But that could get very messy
memory-wise.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rate limiting AP bandwidth change messages in ieee80211_config_bw?

2015-09-30 Thread Johannes Berg

> 
> I'm not sure ratelimiting it would even work - it's not *that* high
> frequency? Not really sure though.
> 
> I think we can do either, it's not such a terribly important message as
> far as I can tell.
> 

Seems like Emmanuel would like to see the message stay in some form -
perhaps we should try rate limiting it then? Could you check if that
actually works?

johannes






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/14] RDS: IB: handle rds_ibdev release case instead of crashing the kernel

2015-09-30 Thread Santosh Shilimkar
From: Santosh Shilimkar 

Just in case we are still handling the QP receive completion while the
rds_ibdev is released, drop the connection instead of crashing the kernel.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8f51d0d..2b2370e 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -285,7 +285,8 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
struct rds_ib_device *rds_ibdev = ic->rds_ibdev;
struct rds_ib_ack_state state;
 
-   BUG_ON(!rds_ibdev);
+   if (!rds_ibdev)
+   rds_conn_drop(conn);
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] net: dsa: Complete and fix the dsa unbinding

2015-09-30 Thread Florian Fainelli
On 30/09/15 01:21, Neil Armstrong wrote:
> In order to cleanly unbind the dsa core, either as a module removal,
> or a platform device unbind, switch the allocation the their devm_
> counterparts and complete the destroy functions.
> 
> The last patch is an experimental way to exit the probe when no
> switch is found in the discover process.
> 
> The patches are based on the current net-next.

I looked at the patches and they bring DSA in a better direction. For
future submissions, could you CC people who recently worked on DSA, like
Andrew Lunn, Guenter Roeck, Vivien Didelot and myself? We can typically
give your patches a try fairly quickly.

In case you are seriously considering making DSA a loadable module,
there were an earlier attempt here:

http://comments.gmane.org/gmane.linux.network/345803

Thanks!

> 
> Neil Armstrong (3):
>   net: dsa: Use devm_ prefixed allocations
>   net: dsa: complete dsa_switch_destroy calls
>   net: dsa: exit probe if no switch were found
> 
>  net/dsa/dsa.c | 67 
> ---
>  1 file changed, 60 insertions(+), 7 deletions(-)
> 


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/5] bridge: vlan: don't pass flags when creating context only

2015-09-30 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

We should not pass the original flags when creating a context vlan only
because they may contain some flags that change behaviour in the bridge.
The new global context should be with minimal set of flags, so pass 0
and let br_vlan_add() set the master flag only.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 7e9d60a402e2..75214a51cf0e 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -197,7 +197,7 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags)
masterv = br_vlan_find(br->vlgrp, v->vid);
if (!masterv) {
/* missing global ctx, create it now */
-   err = br_vlan_add(br, v->vid, master_flags);
+   err = br_vlan_add(br, v->vid, 0);
if (err)
goto out_filt;
masterv = br_vlan_find(br->vlgrp, v->vid);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
 wrote:
> On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
>>  wrote:
>> > This command allows comparing the underling private data of two fds. This
>> > is useful e.g. to find out if a seccomp filter is inherited, since struct
>> > seccomp_filter are unique across tasks and are the private_data seccomp
>> > fds.
>>
>> This is very implementation-specific and may have nasty ABI
>> consequences far outside seccomp.  Let's do something specific to
>> seccomp and/or eBPF.
>
> We could change the name to a less generic KCMP_SECCOMP_FD or
> something, but without some sort of GUID on each struct
> seccomp_filter, the implementation would be effectively the same as it
> is today. Is that enough, or do we need a GUID?
>

I don't care about the GUID.  I think we should name it
KCMP_SECCOMP_FD and make it only work on seccomp fds.

Alternatively, we could figure out why KCMP_FILE doesn't do the trick
and consider fixing it.  IMO it's really too bad that struct file is
so heavyweight that we can't really just embed one in all kinds of
structures.


--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/14] RDS: connection scalability and performance improvements

2015-09-30 Thread Santosh Shilimkar
[v2]:
Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
earlier version [1]. I plan to address the hash table scalability using
re-sizable hash tables as suggested by David Laight and David Miller [2]

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is still an issue and
I plan to address it by using re-sizable hash tables as suggested on the list.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.

Series applies against 4.3-rc1 as well as net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes. The patchset is also available at below git repo:

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v2

As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown IB performance and
am hoping to get that in mainline with help of them.

Santosh Shilimkar (14):
  RDS: use kfree_rcu in rds_ib_remove_ipaddr
  RDS: make socket bind/release locking scheme simple and more efficient
  RDS: fix rds_sock reference bug while doing bind
  RDS: Use per-bucket rw lock for bind hash-table
  RDS: defer the over_batch work to send worker
  RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL
  RDS: IB: ack more receive completions to improve performance
  RDS: IB: split send completion handling and do batch ack
  RDS: IB: handle rds_ibdev release case instead of crashing the kernel
  RDS: IB: fix the rds_ib_fmr_wq kick call
  RDS: IB: use already available pool handle from ibmr
  RDS: IB: mark rds_ib_fmr_wq static
  RDS: IB: use max_mr from HCA caps than max_fmr
  RDS: IB: split mr pool to improve 8K messages performance

 net/rds/af_rds.c   |   8 +---
 net/rds/bind.c |  76 ++
 net/rds/ib.c   |  47 --
 net/rds/ib.h   |  78 +++---
 net/rds/ib_cm.c| 114 ++--
 net/rds/ib_rdma.c  | 116 ++---
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_send.c  | 110 ---
 net/rds/ib_stats.c |  22 +
 net/rds/rds.h  |   1 +
 net/rds/send.c |  15 --
 net/rds/threads.c  |   2 +
 12 files changed, 445 insertions(+), 280 deletions(-)

-- 
1.9.1

Regards,
Santosh

[1] https://lkml.org/lkml/2015/9/19/384
[2] https://lkml.org/lkml/2015/9/21/828



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] net: mvneta: Switch to per-CPU irq and make rxq_def useful

2015-09-30 Thread Thomas Gleixner
On Wed, 30 Sep 2015, David Miller wrote:
> From: Thomas Gleixner 
> Date: Wed, 30 Sep 2015 16:56:06 +0200 (CEST)
> 
> > On Tue, 29 Sep 2015, David Miller wrote:
> >> From: Gregory CLEMENT 
> >> Date: Fri, 25 Sep 2015 18:09:31 +0200
> >> 
> >> > As stated in the first version: "this patchset reworks the Marvell
> >> > neta driver in order to really support its per-CPU interrupts, instead
> >> > of faking them as SPI, and allow the use of any RX queue instead of
> >> > the hardcoded RX queue 0 that we have currently."
> >> 
> >> Series applied, thanks.
> > 
> > You could have had the courtesy to wait for an ack for the core irq
> > parts at least
> 
> Sorry, my impression was that those parts were already discussed and
> agreed upon.

No problem. I would have preferred to merge them to a separate branch
which you could have pulled so we don't end up with conflicts on
further changes in that area. But it's ok as it is. The patches are
good to go.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wpan-tools 1/2] security: add nl802154 security support

2015-09-30 Thread Alexander Aring
Hi,

On Wed, Sep 30, 2015 at 04:46:30PM +0200, Stefan Schmidt wrote:
> Hello.
> 
> A really huge patch. I will start on it. Not sure I can do a full review in
> one go though.
> 
> On 28/09/15 09:25, Alexander Aring wrote:
> >This patch introduce support for the experimental seucirty support for
> 
> Type. Security.
> >nl802154. We currently support add/del settings for manipulating
> >security table entries. The dump functionality is a "really" keep it
> 
> is really a
> >short and stupid handling, the dump will printout the printout the right
> 
> dump will printout the right calls to add the entry

ok.

> >add calls which was called to add the entry. This can be used for
> >storing the current security tables by some script. The interface
> >argument is replaced by $WPAN_DEV variable, so it's possible to move one
> >interface configuration to another one.
> >
> >Signed-off-by: Alexander Aring 
> >---
> >  src/Makefile.am |1 +
> >  src/interface.c |  100 +
> >  src/nl802154.h  |  191 ++
> >  src/security.c  | 1118 
> > +++
> >  4 files changed, 1410 insertions(+)
> >  create mode 100644 src/security.c
> >
> >diff --git a/src/Makefile.am b/src/Makefile.am
> >index 2d54576..b2177a2 100644
> >--- a/src/Makefile.am
> >+++ b/src/Makefile.am
> >@@ -9,6 +9,7 @@ iwpan_SOURCES = \
> > interface.c \
> > phy.c \
> > mac.c \
> >+security.c \
> > nl_extras.h \
> > nl802154.h
> >diff --git a/src/interface.c b/src/interface.c
> >index 85d40a8..076e7c3 100644
> >--- a/src/interface.c
> >+++ b/src/interface.c
> >@@ -10,6 +10,7 @@
> >  #include 
> >  #include 
> >+#define CONFIG_IEEE802154_NL802154_EXPERIMENTAL
> >  #include "nl802154.h"
> >  #include "nl_extras.h"
> >  #include "iwpan.h"
> >@@ -226,6 +227,105 @@ static int print_iface_handler(struct nl_msg *msg, 
> >void *arg)
> > if (tb_msg[NL802154_ATTR_ACKREQ_DEFAULT])
> > printf("%s\tackreq_default %d\n", indent, 
> > nla_get_u8(tb_msg[NL802154_ATTR_ACKREQ_DEFAULT]));
> >+if (tb_msg[NL802154_ATTR_SEC_ENABLED])
> >+printf("%s\tsecurity %d\n", indent, 
> >nla_get_u8(tb_msg[NL802154_ATTR_SEC_ENABLED]));
> >+if (tb_msg[NL802154_ATTR_SEC_OUT_LEVEL])
> >+printf("%s\tout_level %d\n", indent, 
> >nla_get_u8(tb_msg[NL802154_ATTR_SEC_OUT_LEVEL]));
> >+if (tb_msg[NL802154_ATTR_SEC_OUT_KEY_ID]) {
> >+struct nlattr *tb_key_id[NL802154_KEY_ID_ATTR_MAX + 1];
> >+static struct nla_policy key_id_policy[NL802154_KEY_ID_ATTR_MAX 
> >+ 1] = {
> >+[NL802154_KEY_ID_ATTR_MODE] = { .type = NLA_U32 },
> >+[NL802154_KEY_ID_ATTR_INDEX] = { .type = NLA_U8 },
> >+[NL802154_KEY_ID_ATTR_IMPLICIT] = { .type = NLA_NESTED 
> >},
> >+[NL802154_KEY_ID_ATTR_SOURCE_SHORT] = { .type = NLA_U32 
> >},
> >+[NL802154_KEY_ID_ATTR_SOURCE_EXTENDED] = { .type = 
> >NLA_U64 },
> >+};
> >+
> >+nla_parse_nested(tb_key_id, NL802154_KEY_ID_ATTR_MAX,
> >+ tb_msg[NL802154_ATTR_SEC_OUT_KEY_ID], 
> >key_id_policy);
> >+printf("%s\tout_key_id\n", indent);
> >+
> >+if (tb_key_id[NL802154_KEY_ID_ATTR_MODE]) {
> >+enum nl802154_key_id_modes key_id_mode;
> >+
> >+key_id_mode = 
> >nla_get_u32(tb_key_id[NL802154_KEY_ID_ATTR_MODE]);
...
> >+enum nl802154_dev_addr_modes {
> >+NL802154_DEV_ADDR_NONE,
> >+__NL802154_DEV_ADDR_INVALID,
> >+NL802154_DEV_ADDR_SHORT,
> >+NL802154_DEV_ADDR_EXTENDED,
> >+
> >+/* keep last */
> >+__NL802154_DEV_ADDR_AFTER_LAST,
> 
> Hmm, why bother with AFTER_LAST here and not just use ADDR_MAX as sentinal
> for this enum? Looks redundant to me.
> 

At first I want to keep the wireless nl80211 userspace uapi header,
which declarate this hidden __FOOBAR enum in "mostly" every their enum
declaration. See [0], I simple adapt this convention for nl802154.

The reason is probaly they want some automatic mechanism to increment
the MAX value. Also it differs if you declare an array for netlink
policy [1] or give the length argument for parsing [2], which occurs
sometimes in off-by-one errors. 

...
> >+
> >+static int handle_out_key_id_set(struct nl802154_state *state, struct nl_cb 
> >*cb,
> >+ struct nl_msg *msg, int argc, char **argv,
> >+ enum id_input id)
> >+{
> >+return handle_parse_key_id(msg, NL802154_ATTR_SEC_OUT_KEY_ID, , 
> >);
> >+
> >+}
> >+COMMAND(set, out_key_id,
> >+"<0  <2 |3 >>|"
> >+"<1 >|"
> >+"<2  >|"
> >+"<3  >",
> 
> What are these extra >>| for ?
> 

The numbers are acutally the enums value which is usually some specific
mode, in this case the key_id_mode. Of course each of them has a proper
name and we should add some helper functions to map these enums to a
string.

The '>' should 

RE: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register

2015-09-30 Thread Casey Leedom
Hari,

  I think you missed the corresponding change that's needed for the const char 
*owner[] array.  You need to add an "" entry so the index of "4" makes 
sense.

Casey


From: Hariprasad Shenai [haripra...@chelsio.com]
Sent: Wednesday, September 30, 2015 8:03 AM
To: netdev@vger.kernel.org
Cc: da...@davemloft.net; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S
Subject: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox 
Control register

T4 doesn't have the Shadow copy of the register which we can read without
side effect. So don't read mbox control register for T4 adapter

Signed-off-by: Hariprasad Shenai 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 0a87a32..8001619 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -1134,12 +1134,20 @@ static int mbox_show(struct seq_file *seq, void *v)
unsigned int mbox = (uintptr_t)seq->private & 7;
struct adapter *adap = seq->private - mbox;
void __iomem *addr = adap->regs + PF_REG(mbox, CIM_PF_MAILBOX_DATA_A);
-   unsigned int ctrl_reg = (is_t4(adap->params.chip)
-? CIM_PF_MAILBOX_CTRL_A
-: CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A);
-   void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg);

-   i = MBOWNER_G(readl(ctrl));
+   /* For T4 we don't have a shadow copy of the Mailbox Control register.
+* And since reading that real register causes a side effect of
+* granting ownership, we're best of simply not reading it at all.
+*/
+   if (is_t4(adap->params.chip)) {
+   i = 4; /* index of "" */
+   } else {
+   unsigned int ctrl_reg = CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A;
+   void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg);
+
+   i = MBOWNER_G(readl(ctrl));
+   }
+
seq_printf(seq, "mailbox owned by %s\n\n", owner[i]);

for (i = 0; i < MBOX_LEN; i += 8)
--
2.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread kbuild test robot
Hi Tycho,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
ignore]

config: i386-alldefconfig (attached as .config)
reproduce:
  git checkout 9613ae6bf5f111701614acb3eda3123d21a59239
  # save the attached .config to linux build tree
  make ARCH=i386 

All error/warnings (new ones prefixed by >>):

>> kernel/seccomp.c:998:10: error: expected ';', ',' or ')' before 'const'
 const char __user *filter)
 ^
   kernel/seccomp.c: In function 'do_seccomp':
>> kernel/seccomp.c:1016:10: error: implicit declaration of function 
>> 'seccomp_filter_fd' [-Werror=implicit-function-declaration]
  return seccomp_filter_fd(flags, uargs);
 ^
   cc1: some warnings being treated as errors

vim +998 kernel/seccomp.c

   992 const char __user *filter)
   993  {
   994  return -EINVAL;
   995  }
   996  
   997  static inline long seccomp_filter_fd(unsigned int flags
 > 998   const char __user *filter)
   999  {
  1000  return -EINVAL;
  1001  }
  1002  #endif
  1003  
  1004  /* Common entry point for both prctl and syscall. */
  1005  static long do_seccomp(unsigned int op, unsigned int flags,
  1006 const char __user *uargs)
  1007  {
  1008  switch (op) {
  1009  case SECCOMP_SET_MODE_STRICT:
  1010  if (flags != 0 || uargs != NULL)
  1011  return -EINVAL;
  1012  return seccomp_set_mode_strict();
  1013  case SECCOMP_SET_MODE_FILTER:
  1014  return seccomp_set_mode_filter(flags, uargs);
  1015  case SECCOMP_FILTER_FD:
> 1016  return seccomp_filter_fd(flags, uargs);
  1017  default:
  1018  return -EINVAL;
  1019  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: Rate limiting AP bandwidth change messages in ieee80211_config_bw?

2015-09-30 Thread Johannes Berg
On Wed, 2015-09-30 at 13:02 -0400, Josh Boyer wrote:
> Hi Johannes,
> 
> We've seen a handful of reports that seem to have verbose output from
> the ieee80211_config_bw function in net/mac80211/mlme.c.  It looks
> similar to this:
> 
> [   66.578652] wlp3s0: AP xx:xx:xx:xx:xx changed bandwidth, new config
> is 2437 MHz, width 2 (2447/0 MHz)
> [   68.522437] wlp3s0: AP xx:xx:xx:xx:xx changed bandwidth, new config
> is 2437 MHz, width 1 (2437/0 MHz)

> Essentially, this looks like the AP is changing the bandwidth (and
> only the width) every second or so.  Why it is doing this, I'm not
> sure.  However, this doesn't seem to actually be an error case yet the
> kernel logs are getting spammed with this message.
> 
> I'm wondering if we could either change this message to use sdata_dbg
> instead of sdata_info, or if we could possibly ratelimit it somehow.
> I'd be happy to come up with a patch for either, but I wanted to get
> your feedback on it before I started.  Do you have any objections or
> preference?
> 

I'm not sure ratelimiting it would even work - it's not *that* high
frequency? Not really sure though.

I think we can do either, it's not such a terribly important message as
far as I can tell.

johannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/4] ravb: Provide dev parameter to DMA API

2015-09-30 Thread Simon Horman
From: Kazuya Mizuguchi 

This patch is in preparation for using this driver on arm64 where the
implementation of __dma_alloc_coherent fails if a device parameter is not
provided.

Signed-off-by: Kazuya Mizuguchi 
Signed-off-by: Yoshihiro Shimoda 
Signed-off-by: Masaru Nagai 
[horms: squashed into a single patch]
Signed-off-by: Simon Horman 

---
* [horms]
  I have only tested this on arm64 using r8a7795/salvator-x.

v0 [Kazuya Mizuguchi, Yoshihiro Shimoda, Masaru Nagai]

v1 [Simon Horman]
* Squashed into a single patch

v2 [Simon Horman]
* No change

v4
* No change
---
 drivers/net/ethernet/renesas/ravb_main.c | 38 
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index 450899e9cea2..4ca093d033f8 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -201,7 +201,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
if (priv->rx_ring[q]) {
ring_size = sizeof(struct ravb_ex_rx_desc) *
(priv->num_rx_ring[q] + 1);
-   dma_free_coherent(NULL, ring_size, priv->rx_ring[q],
+   dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q],
  priv->rx_desc_dma[q]);
priv->rx_ring[q] = NULL;
}
@@ -209,7 +209,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
if (priv->tx_ring[q]) {
ring_size = sizeof(struct ravb_tx_desc) *
(priv->num_tx_ring[q] * NUM_TX_DESC + 1);
-   dma_free_coherent(NULL, ring_size, priv->tx_ring[q],
+   dma_free_coherent(ndev->dev.parent, ring_size, priv->tx_ring[q],
  priv->tx_desc_dma[q]);
priv->tx_ring[q] = NULL;
}
@@ -240,13 +240,13 @@ static void ravb_ring_format(struct net_device *ndev, int 
q)
rx_desc = >rx_ring[q][i];
/* The size of the buffer should be on 16-byte boundary. */
rx_desc->ds_cc = cpu_to_le16(ALIGN(PKT_BUF_SZ, 16));
-   dma_addr = dma_map_single(>dev, priv->rx_skb[q][i]->data,
+   dma_addr = dma_map_single(ndev->dev.parent, 
priv->rx_skb[q][i]->data,
  ALIGN(PKT_BUF_SZ, 16),
  DMA_FROM_DEVICE);
/* We just set the data size to 0 for a failed mapping which
 * should prevent DMA from happening...
 */
-   if (dma_mapping_error(>dev, dma_addr))
+   if (dma_mapping_error(ndev->dev.parent, dma_addr))
rx_desc->ds_cc = cpu_to_le16(0);
rx_desc->dptr = cpu_to_le32(dma_addr);
rx_desc->die_dt = DT_FEMPTY;
@@ -309,7 +309,7 @@ static int ravb_ring_init(struct net_device *ndev, int q)
 
/* Allocate all RX descriptors. */
ring_size = sizeof(struct ravb_ex_rx_desc) * (priv->num_rx_ring[q] + 1);
-   priv->rx_ring[q] = dma_alloc_coherent(NULL, ring_size,
+   priv->rx_ring[q] = dma_alloc_coherent(ndev->dev.parent, ring_size,
  >rx_desc_dma[q],
  GFP_KERNEL);
if (!priv->rx_ring[q])
@@ -320,7 +320,7 @@ static int ravb_ring_init(struct net_device *ndev, int q)
/* Allocate all TX descriptors. */
ring_size = sizeof(struct ravb_tx_desc) *
(priv->num_tx_ring[q] * NUM_TX_DESC + 1);
-   priv->tx_ring[q] = dma_alloc_coherent(NULL, ring_size,
+   priv->tx_ring[q] = dma_alloc_coherent(ndev->dev.parent, ring_size,
  >tx_desc_dma[q],
  GFP_KERNEL);
if (!priv->tx_ring[q])
@@ -443,7 +443,7 @@ static int ravb_tx_free(struct net_device *ndev, int q)
size = le16_to_cpu(desc->ds_tagl) & TX_DS;
/* Free the original skb. */
if (priv->tx_skb[q][entry / NUM_TX_DESC]) {
-   dma_unmap_single(>dev, le32_to_cpu(desc->dptr),
+   dma_unmap_single(ndev->dev.parent, 
le32_to_cpu(desc->dptr),
 size, DMA_TO_DEVICE);
/* Last packet descriptor? */
if (entry % NUM_TX_DESC == NUM_TX_DESC - 1) {
@@ -546,7 +546,7 @@ static bool ravb_rx(struct net_device *ndev, int *quota, 
int q)
 
skb = priv->rx_skb[q][entry];
priv->rx_skb[q][entry] = NULL;
-   dma_unmap_single(>dev, le32_to_cpu(desc->dptr),
+   dma_unmap_single(ndev->dev.parent, 

[PATCH net-next 1/4] phylib: Add phy_set_max_speed helper

2015-09-30 Thread Simon Horman
Add a helper to allow ethernet drivers to limit the speed of a phy
(that they are attached to).

This mainly involves factoring out the business-end of
of_set_phy_supported() and exporting a new symbol.

This code seems to be open coded in several places, in several different
variants.

It is is envisaged that this will be used in situations where setting the
"max-speed" property in DT is not appropriate, e.g. because the maximum
speed is not a property of the phy hardware.

Signed-off-by: Simon Horman 

---
v2
* First post

v3
* As suggested by Florian Fainelli
  - Do not check for !IS_ENABLED(CONFIG_OF_MDIO) in __set_phy_supported.
This is already done in of_set_phy_supported() and is not relevant to
phy_set_max_speed)
  - Return -ENOTSUPP if 'max_speed' is not an unknown value
* As suggested by Sergei Shtylyov
  - White-space and comment enhancements.

v4
* No change
---
 drivers/net/phy/phy_device.c | 59 ++--
 include/linux/phy.h  |  1 +
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index f761288abe66..383389146099 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1239,6 +1239,44 @@ static int gen10g_resume(struct phy_device *phydev)
return 0;
 }
 
+static int __set_phy_supported(struct phy_device *phydev, u32 max_speed)
+{
+   /* The default values for phydev->supported are provided by the PHY
+* driver "features" member, we want to reset to sane defaults first
+* before supporting higher speeds.
+*/
+   phydev->supported &= PHY_DEFAULT_FEATURES;
+
+   switch (max_speed) {
+   default:
+   return -ENOTSUPP;
+   case SPEED_1000:
+   phydev->supported |= PHY_1000BT_FEATURES;
+   /* fall through */
+   case SPEED_100:
+   phydev->supported |= PHY_100BT_FEATURES;
+   /* fall through */
+   case SPEED_10:
+   phydev->supported |= PHY_10BT_FEATURES;
+   }
+
+   return 0;
+}
+
+int phy_set_max_speed(struct phy_device *phydev, u32 max_speed)
+{
+   int err;
+
+   err = __set_phy_supported(phydev, max_speed);
+   if (err)
+   return err;
+
+   phydev->advertising = phydev->supported;
+
+   return 0;
+}
+EXPORT_SYMBOL(phy_set_max_speed);
+
 static void of_set_phy_supported(struct phy_device *phydev)
 {
struct device_node *node = phydev->dev.of_node;
@@ -1250,25 +1288,8 @@ static void of_set_phy_supported(struct phy_device 
*phydev)
if (!node)
return;
 
-   if (!of_property_read_u32(node, "max-speed", _speed)) {
-   /* The default values for phydev->supported are provided by the 
PHY
-* driver "features" member, we want to reset to sane defaults 
fist
-* before supporting higher speeds.
-*/
-   phydev->supported &= PHY_DEFAULT_FEATURES;
-
-   switch (max_speed) {
-   default:
-   return;
-
-   case SPEED_1000:
-   phydev->supported |= PHY_1000BT_FEATURES;
-   case SPEED_100:
-   phydev->supported |= PHY_100BT_FEATURES;
-   case SPEED_10:
-   phydev->supported |= PHY_10BT_FEATURES;
-   }
-   }
+   if (!of_property_read_u32(node, "max-speed", _speed))
+   __set_phy_supported(phydev, max_speed);
 }
 
 /**
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 4a4e3a092337..4c477e6ece33 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -798,6 +798,7 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd);
 int phy_start_interrupts(struct phy_device *phydev);
 void phy_print_status(struct phy_device *phydev);
 void phy_device_free(struct phy_device *phydev);
+int phy_set_max_speed(struct phy_device *phydev, u32 max_speed);
 
 int phy_register_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask,
   int (*run)(struct phy_device *));
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 4/4] ravb: Add support for r8a7795 SoC

2015-09-30 Thread Simon Horman
From: Kazuya Mizuguchi 

This patch supports the r8a7795 SoC by:
- Using two interrupts
  + One for E-MAC
  + One for everything else
  + Both can be handled by the existing common interrupt handler, which
affords a simpler update to support the new SoC. In future some
consideration may be given to implementing multiple interrupt handlers
- Limiting the phy speed to 100Mbit/s for the new SoC;
  at this time it is not clear how this restriction may be lifted
  but I hope it will be possible as more information comes to light

Signed-off-by: Kazuya Mizuguchi 
[horms: reworked]
Signed-off-by: Simon Horman 

---
v0 [Kazuya Mizuguchi]

v1 [Simon Horman]
* Updated patch subject

v2 [Simon Horman]
* Reworked based on extensive feedback from
  Geert Uytterhoeven and Sergei Shtylyov.
* Broke binding update out into separate patch

v3 [Simon Horman]
* Check new return value of phy_set_max_speed()

v4
* No change
---
 drivers/net/ethernet/renesas/ravb.h  |  7 
 drivers/net/ethernet/renesas/ravb_main.c | 63 
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index a157ff6a..0623fff932e4 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -766,6 +766,11 @@ struct ravb_ptp {
struct ravb_ptp_perout perout[N_PER_OUT];
 };
 
+enum ravb_chip_id {
+   RCAR_GEN2,
+   RCAR_GEN3,
+};
+
 struct ravb_private {
struct net_device *ndev;
struct platform_device *pdev;
@@ -806,6 +811,8 @@ struct ravb_private {
int msg_enable;
int speed;
int duplex;
+   int emac_irq;
+   enum ravb_chip_id chip_id;
 
unsigned no_avb_link:1;
unsigned avb_link_active_low:1;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index 4ca093d033f8..8cc5ec5ed19a 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -889,6 +889,22 @@ static int ravb_phy_init(struct net_device *ndev)
return -ENOENT;
}
 
+   /* This driver only support 10/100Mbit speeds on Gen3
+* at this time.
+*/
+   if (priv->chip_id == RCAR_GEN3) {
+   int err;
+
+   err = phy_set_max_speed(phydev, SPEED_100);
+   if (err) {
+   netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n");
+   phy_disconnect(phydev);
+   return err;
+   }
+
+   netdev_info(ndev, "limited PHY to 100Mbit/s\n");
+   }
+
netdev_info(ndev, "attached PHY %d (IRQ %d) to driver %s\n",
phydev->addr, phydev->irq, phydev->drv->name);
 
@@ -1197,6 +1213,15 @@ static int ravb_open(struct net_device *ndev)
goto out_napi_off;
}
 
+   if (priv->chip_id == RCAR_GEN3) {
+   error = request_irq(priv->emac_irq, ravb_interrupt,
+   IRQF_SHARED, ndev->name, ndev);
+   if (error) {
+   netdev_err(ndev, "cannot request IRQ\n");
+   goto out_free_irq;
+   }
+   }
+
/* Device init */
error = ravb_dmac_init(ndev);
if (error)
@@ -1220,6 +1245,7 @@ out_ptp_stop:
ravb_ptp_stop(ndev);
 out_free_irq:
free_irq(ndev->irq, ndev);
+   free_irq(priv->emac_irq, ndev);
 out_napi_off:
napi_disable(>napi[RAVB_NC]);
napi_disable(>napi[RAVB_BE]);
@@ -1625,10 +1651,20 @@ static int ravb_mdio_release(struct ravb_private *priv)
return 0;
 }
 
+static const struct of_device_id ravb_match_table[] = {
+   { .compatible = "renesas,etheravb-r8a7790", .data = (void *)RCAR_GEN2 },
+   { .compatible = "renesas,etheravb-r8a7794", .data = (void *)RCAR_GEN2 },
+   { .compatible = "renesas,etheravb-r8a7795", .data = (void *)RCAR_GEN3 },
+   { }
+};
+MODULE_DEVICE_TABLE(of, ravb_match_table);
+
 static int ravb_probe(struct platform_device *pdev)
 {
struct device_node *np = pdev->dev.of_node;
+   const struct of_device_id *match;
struct ravb_private *priv;
+   enum ravb_chip_id chip_id;
struct net_device *ndev;
int error, irq, q;
struct resource *res;
@@ -1657,7 +1693,14 @@ static int ravb_probe(struct platform_device *pdev)
/* The Ether-specific entries in the device structure. */
ndev->base_addr = res->start;
ndev->dma = -1;
-   irq = platform_get_irq(pdev, 0);
+
+   match = of_match_device(of_match_ptr(ravb_match_table), >dev);
+   chip_id = (enum ravb_chip_id)match->data;
+
+   if (chip_id == RCAR_GEN3)
+   irq = platform_get_irq_byname(pdev, "ch22");
+   else
+   irq = platform_get_irq(pdev, 

[PATCH net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-30 Thread Simon Horman
From: Kazuya Mizuguchi 

This patch updates the ravb binding to support the r8a7795 SoC by:
- Adding a compat string for the new hardware
- Adding 25 named interrupts to binding for the new SoC;
  older SoCs continue to use a single multiplexed interrupt

The example is also updated to reflect the r8a7795 as this is the
more complex case.

Based on work by Kazuya Mizuguchi and others.

Signed-off-by: Simon Horman 
Acked-by: Geert Uytterhoeven 

---
v2
* First post; broken out of a driver update patch
* As discussed with Geert Uytterhoeven and Sergei Shtylyov
  - Binding: Make all interrupts mandatory as named-interrupts of
the form ch%u

v3
* A suggested by Geert Uytterhoeven
  - Reword description of interrupts and interrupt-names to
make things clearer. It is now based to some extent on
spi-rspi.txt and renesas,usb-dmac.txt.
* As suggested by Sergei Shtylyov
  - Drop phy-reset-gpio from example
* Added power-domains to example

v4
* A suggested by Geert Uytterhoeven
  - grammar fix for interrupt-names description
* Add ack
---
 .../devicetree/bindings/net/renesas,ravb.txt   | 69 +++---
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index 1fd8831437bf..b486f3f5f6a3 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -6,8 +6,12 @@ interface contains.
 Required properties:
 - compatible: "renesas,etheravb-r8a7790" if the device is a part of R8A7790 
SoC.
  "renesas,etheravb-r8a7794" if the device is a part of R8A7794 SoC.
+ "renesas,etheravb-r8a7795" if the device is a part of R8A7795 SoC.
 - reg: offset and length of (1) the register block and (2) the stream buffer.
-- interrupts: interrupt specifier for the sole interrupt.
+- interrupts: A list of interrupt-specifiers, one for each entry in
+ interrupt-names.
+ If interrupt-names is not present, an interrupt specifier
+ for a single muxed interrupt.
 - phy-mode: see ethernet.txt file in the same directory.
 - phy-handle: see ethernet.txt file in the same directory.
 - #address-cells: number of address cells for the MDIO bus, must be equal to 1.
@@ -18,6 +22,12 @@ Required properties:
 Optional properties:
 - interrupt-parent: the phandle for the interrupt controller that services
interrupts for this device.
+- interrupt-names: A list of interrupt names.
+  For the R8A7795 SoC this property is mandatory;
+  it should include one entry per channel, named "ch%u",
+  where %u is the channel number ranging from 0 to 24.
+  For other SoCs this property is optional; if present
+  it should contain "mux" for a single muxed interrupt.
 - pinctrl-names: pin configuration state name ("default").
 - renesas,no-ether-link: boolean, specify when a board does not provide a 
proper
 AVB_LINK signal.
@@ -27,13 +37,46 @@ Optional properties:
 Example:
 
ethernet@e680 {
-   compatible = "renesas,etheravb-r8a7790";
-   reg = <0 0xe680 0 0x800>, <0 0xee0e8000 0 0x4000>;
+   compatible = "renesas,etheravb-r8a7795";
+   reg = <0 0xe680 0 0x800>, <0 0xe6a0 0 0x1>;
interrupt-parent = <>;
-   interrupts = <0 163 IRQ_TYPE_LEVEL_HIGH>;
-   clocks = <_clks R8A7790_CLK_ETHERAVB>;
-   phy-mode = "rmii";
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   interrupt-names = "ch0", "ch1", "ch2", "ch3",
+ "ch4", "ch5", "ch6", "ch7",
+ "ch8", "ch9", "ch10", "ch11",
+ "ch12", "ch13", "ch14", "ch15",
+ "ch16", "ch17", "ch18", "ch19",
+ "ch20", "ch21", "ch22", "ch23",
+ "ch24";
+   clocks = <_clks R8A7795_CLK_ETHERAVB>;
+   

[PATCH net-next 0/4] ravb: Add support for r8a7795 SoC

2015-09-30 Thread Simon Horman
Dave,

please consider this series for net-next.
It enhances the ravb driver to support the r8a7795 SoC.

Changes:

* Dropped RFC prefix
* Details in changelog of individual patches

Base:

* net-next/master

Availability:

To aid review of this in conjunction with other EtherAVB changes
the following branches are available in my renesas tree on kernel.org.

* me/r8a7795-ravb-driver-v4: this series
* me/r8a7795-ravb-pfc-v2: r8a7795 sh-pfc update for EthernetAVB
* me/r8a7795-ravb-integration-v4: enable EthernetAVB on r8a7795
* me/r8a7795-ravb-driver-and-integration-v4.runtime:
  the above three branches with their runtime dependencies

Kazuya Mizuguchi (3):
  ravb: Provide dev parameter to DMA API
  ravb: Document binding for r8a7795 SoC
  ravb: Add support for r8a7795 SoC

Simon Horman (1):
  phylib: Add phy_set_max_speed helper

 .../devicetree/bindings/net/renesas,ravb.txt   |  69 --
 drivers/net/ethernet/renesas/ravb.h|   7 ++
 drivers/net/ethernet/renesas/ravb_main.c   | 101 +++--
 drivers/net/phy/phy_device.c   |  59 
 include/linux/phy.h|   1 +
 5 files changed, 184 insertions(+), 53 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] ntp/pps: use timespec64 for hardpps()

2015-09-30 Thread Thomas Gleixner
On Mon, 28 Sep 2015, Arnd Bergmann wrote:

> There is only one user of the hardpps function in the kernel, so
> it makes sense to atomically change it over to using 64-bit
> timestamps for y2038 safety. In the hardpps implementation,
> we also need to change the pps_normtime structure, which is
> similar to struct timespec and also requires a 64-bit
> seconds portion.
> 
> This introduces two temporary variables in pps_kc_event() to
> do the conversion, they will be removed again in the next step,
> which seemed preferable to having a larger patch changing it
> all at the same time.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] ntp/pps: replace getnstime_raw_and_real with 64-bit version

2015-09-30 Thread Thomas Gleixner
On Mon, 28 Sep 2015, Arnd Bergmann wrote:

> There is exactly one caller of getnstime_raw_and_real in the kernel,
> which is the pps_get_ts function. This changes the caller and
> the implementation to work on timespec64 types rather than timespec,
> to avoid the time_t overflow on 32-bit architectures.
> 
> For consistency with the other new functions (ktime_get_seconds,
> ktime_get_real_*, ...), I'm renaming the function to
> ktime_get_raw_and_real_ts64.
> 
> We still need to convert from the internal 64-bit type to 32 bit
> types in the caller, but this conversion is now pushed out from
> getnstime_raw_and_real to pps_get_ts. A follow-up patch changes
> the remaining pps code to completely avoid the conversion.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] net: dsa: complete dsa_switch_destroy calls

2015-09-30 Thread Neil Armstrong
When unbinding dsa, complete the dsa_switch_destroy to cleanly
destroy and unregister the net and mdio devices.

Signed-off-by: Neil Armstrong 
---
 net/dsa/dsa.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 98f94c2..0c104af 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dsa_priv.h"

 char dsa_driver_version[] = "0.1";
@@ -420,10 +421,51 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,

 static void dsa_switch_destroy(struct dsa_switch *ds)
 {
+   struct device_node *port_dn;
+   struct phy_device *phydev;
+   struct dsa_chip_data *cd = ds->pd;
+   int port;
+
 #ifdef CONFIG_NET_DSA_HWMON
if (ds->hwmon_dev)
hwmon_device_unregister(ds->hwmon_dev);
 #endif
+
+   /* Disable configuration of the CPU and DSA ports */
+   for (port = 0; port < DSA_MAX_PORTS; port++) {
+   if (!(dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)))
+   continue;
+
+   port_dn = cd->port_dn[port];
+   if (of_phy_is_fixed_link(port_dn)) {
+   phydev = of_phy_find_device(port_dn);
+   if (phydev) {
+   int addr = phydev->addr;
+   phy_device_free(phydev);
+   of_node_put(port_dn);
+   fixed_phy_del(addr);
+   }
+   }
+   }
+
+   /*
+* Destroy network devices for physical switch ports.
+*/
+   for (port = 0; port < DSA_MAX_PORTS; port++) {
+   if (!(ds->phys_port_mask & (1 << port)))
+   continue;
+
+   if (!ds->ports[port])
+   continue;
+
+   unregister_netdev(ds->ports[port]);
+   free_netdev(ds->ports[port]);
+   }
+
+   /*
+* Do basic unregister.
+*/
+   mdiobus_unregister(ds->slave_mii_bus);
 }

 #ifdef CONFIG_PM_SLEEP
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/3] net: dsa: Complete and fix the dsa unbinding

2015-09-30 Thread Neil Armstrong
In order to cleanly unbind the dsa core, either as a module removal,
or a platform device unbind, switch the allocation the their devm_
counterparts and complete the destroy functions.

The last patch is an experimental way to exit the probe when no
switch is found in the discover process.

The patches are based on the current net-next.

Neil Armstrong (3):
  net: dsa: Use devm_ prefixed allocations
  net: dsa: complete dsa_switch_destroy calls
  net: dsa: exit probe if no switch were found

 net/dsa/dsa.c | 67 ---
 1 file changed, 60 insertions(+), 7 deletions(-)

-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] net: dsa: Use devm_ prefixed allocations

2015-09-30 Thread Neil Armstrong
To simplify and prevent memory leakage when unbinding, use
the devm_ memory allocation calls.

Signed-off-by: Neil Armstrong 
---
 net/dsa/dsa.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index c59fa5d..98f94c2 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -305,7 +305,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
struct device *parent)
if (ret < 0)
goto out;

-   ds->slave_mii_bus = mdiobus_alloc();
+   ds->slave_mii_bus = devm_mdiobus_alloc(parent);
if (ds->slave_mii_bus == NULL) {
ret = -ENOMEM;
goto out;
@@ -400,7 +400,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
/*
 * Allocate and initialise switch state.
 */
-   ds = kzalloc(sizeof(*ds) + drv->priv_size, GFP_KERNEL);
+   ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL);
if (ds == NULL)
return ERR_PTR(-ENOMEM);

@@ -883,7 +883,7 @@ static int dsa_probe(struct platform_device *pdev)
goto out;
}

-   dst = kzalloc(sizeof(*dst), GFP_KERNEL);
+   dst = devm_kzalloc(>dev, sizeof(*dst), GFP_KERNEL);
if (dst == NULL) {
dev_put(dev);
ret = -ENOMEM;
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] net: dsa: exit probe if no switch were found

2015-09-30 Thread Neil Armstrong
If no switch were found in dsa_setup_dst, return -ENODEV and
exit the dsa_probe cleanly.

Signed-off-by: Neil Armstrong 
---
 net/dsa/dsa.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 0c104af..6ae1ab9 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -844,10 +844,11 @@ static inline void dsa_of_remove(struct device *dev)
 }
 #endif

-static void dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev,
+static int dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev,
  struct device *parent, struct dsa_platform_data *pd)
 {
int i;
+   unsigned configured = 0;

dst->pd = pd;
dst->master_netdev = dev;
@@ -867,9 +868,17 @@ static void dsa_setup_dst(struct dsa_switch_tree *dst, 
struct net_device *dev,
dst->ds[i] = ds;
if (ds->drv->poll_link != NULL)
dst->link_poll_needed = 1;
+
+   ++configured;
}

/*
+* If no switch was found, exit cleanly
+*/
+   if (!configured)
+   return -ENODEV;
+
+   /*
 * If we use a tagging format that doesn't have an ethertype
 * field, make sure that all packets from this point on get
 * sent to the tag format's receive function.
@@ -885,6 +894,8 @@ static void dsa_setup_dst(struct dsa_switch_tree *dst, 
struct net_device *dev,
dst->link_poll_timer.expires = round_jiffies(jiffies + HZ);
add_timer(>link_poll_timer);
}
+
+   return 0;
 }

 static int dsa_probe(struct platform_device *pdev)
@@ -934,9 +945,9 @@ static int dsa_probe(struct platform_device *pdev)

platform_set_drvdata(pdev, dst);

-   dsa_setup_dst(dst, dev, >dev, pd);
-
-   return 0;
+   ret = dsa_setup_dst(dst, dev, >dev, pd);
+   if (!ret)
+   return 0;

 out:
dsa_of_remove(>dev);
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type

2015-09-30 Thread Daniel Mack
On 09/30/2015 09:40 AM, Jan Engelhardt wrote:
> 
> On Wednesday 2015-09-30 09:24, Daniel Mack wrote:
>>
>>> Drop?  Makes no sense, else application would not be running in the first
>>> place.
>>
>> Of course you can drop certain packets at this point, depending on other
>> details. Say, for instance, you want to match all packets that are
>> received by a certain task [...]
>> Another use case is accounting. If you want to know how much traffic a
>> certain service or application in your system has caused
> 
> But the sk info would be available in INPUT already, would it not?

No, only for established connections, as those are subject to early
demux which sets skb->sk. For all other packets, netfilter callbacks are
called with skb->sk == NULL.

That's the whole point of this patch set ;)


Daniel

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type

2015-09-30 Thread Daniel Mack
On 09/29/2015 11:19 PM, Florian Westphal wrote:
> Daniel Mack  wrote:
>> Add a new chain type NF_INET_LOCAL_SOCKET_IN which is ran after the
>> input demux is complete and the final destination socket (if any)
>> has been determined.
>>
>> This helps filtering packets based on information stored in the
>> destination socket, such as cgroup controller supplied net class IDs.
> 
> This still seems like the 'x y' problem ("want to do X, think Y is
> correct solution; ask about Y, but thats a strange thing to do").
> 
> There is nothing that this offers over INPUT *except* that sk is
> available.  But there is zero benefit as far as I am concerned --
> why would you want to do any meaningful filtering based on the sk at
> that point...?

Well, INPUT and SOCKET_INPUT are just two different tools that help
solve different classes of problems. INPUT is for filtering all local
traffic while SOCKET_INPUT is just for such that actually has a
listener, and they both make sense in different scenarios.

> Drop?  Makes no sense, else application would not be running in the first
> place.

Of course you can drop certain packets at this point, depending on other
details. Say, for instance, you want to match all packets that are
received by a certain task and that are originated from IP addresses of
a specific subnet, and drop the rest. Rather than adding matches to your
global firewall configuration for all the ports that tasks may or may
not listen on, you can just do it on a higher level, from the
perspective of an administrator. If you decide to let your web server
listen on another port as well, no firewall rule configuration change is
needed at all.

Another use case is accounting. If you want to know how much traffic a
certain service or application in your system has caused, you don't want
to match all its ports to firewall rules just in order to get that
information. Instead, you can now derive that information on a
per-application base. With this patch set, this even works just fine for
multicast listeners, which is something that is currently impossible to
achieve otherwise.

> So the only 'benefit' is that netcls id is available; but
> a) why is that even needed and

It's currently the only way of realizing application-level firewalls,
and it'd be an awesome feature if it actually worked.

> b) is such a huge sledgehammer just for net cgroup accounting
> worth it?

I really don't know if this approach is intrusive enough to make it
qualify as sledgehammer. I'd like to see some real-world benchmarks and
have proof there is a performance decrease for setups that don't use
such chains.

> Another question is what other strange things come up once we would
> open this door.

So let's discuss the possible drawbacks.

Again, the deal with this new chain type is simple: if there is no local
listener, the rules are not looked at. If you need rules that are
processed either way, put them in LOCAL_IN, as you always did.

>> listening on a specific task, the resulting error code that is sent
>> back to the remote peer can't be controlled with rules in
>> NF_INET_LOCAL_SOCKET_IN chains.
> 
> Right, and that makes this even weirder.

Well, to be more specific: you can only control the resulting error code
that is sent back to the remote peer _if_ there is a local listener. You
can do _anything_ _if_ there is a local listener. This is in line with
the above description and shouldn't cause much surprises for users.

> For deterministic ingress filtering you can only rely on what
> is contained in the packet.

Why so? For deterministic ingress filtering of traffic directed to a
local socket, you can as well rely on information associated with that
socket. And this is what application-level firewall rule sets are all about.


Daniel

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] ntp: use timespec64 in sync_cmos_clock

2015-09-30 Thread Thomas Gleixner
On Mon, 28 Sep 2015, Arnd Bergmann wrote:
> The sync_cmos_clock has one use of struct timespec, which we want to
> eventually replace with timespec64 or similar in the kernel. There
> is no way this one can overflow, but the conversion to timespec64
> is trivial and has no other dependencies.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key

2015-09-30 Thread Jiri Benc
On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote:
> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6
> tunnel keys. This can be stored in ip_tunnel_key.tun_flags.

Not really. This was my original approach, too, but openvswitch is not
the only user of struct ip_tunnel_key, and in the lwtunnel core,
tun_flags are handled in the way that makes this impractical. Most
importantly, the tun_flags value is directly taken from/stored to
LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in
net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc.

> That also saves space in flow key.

The field was added to a 2 byte hole in the struct sw_flow_key (leaving
still 1 byte free), thus there's no additional space used.

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 net-next 0/2] ipv4: Hash-based multipath routing

2015-09-30 Thread Peter Nørlund
When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series replaces the old implementation with flow-based load
balancing based on a hash over the source and destination addresses.

Distribution of the hash is done with thresholds as described in RFC 2992.
This reduces the disruption when a path is added/remove when having more than
two paths.

To futher the chance of successful usage in conjuction with anycast, ICMP
error packets are hashed over the inner IP addresses. This ensures that PMTU
will work together with anycast or load-balancers such as IPVS.

Port numbers are not considered since fragments could cause problems with
anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
since ICMP inspection effectively extracts information from the opposite
flow which might have a different state of the DF-flag. This is also why the
RSS hash is not used. These are typically based on the NDIS RSS spec which
mandates TCP support.

Measurements of the additional overhead of a two-path multipath
(p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):

Original per-packet: ~394 cycles/packet
L3 hash:  ~76 cycles/packet

Changes in v5:
- Fixed compilation error

Changes in v4:
- Functions take hash directly instead of func ptr
- Added inline hash function
- Added dummy macros to minimize ifdefs
- Use upper 31 bits of hash instead of lower

Changes in v3:
- Multipath algorithm is no longer configurable (always L3)
- Added random seed to hash
- Moved ICMP inspection to isolated function
- Ignore source quench packets (deprecated as per RFC 6633)

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown

Best Regards
 Peter Nørlund


Peter Nørlund (2):
  ipv4: L3 hash-based multipath
  ipv4: ICMP packet inspection for multipath


 include/net/ip_fib.h |   14 -
 include/net/route.h  |   11 +++-
 net/ipv4/fib_semantics.c |  140 ++
 net/ipv4/icmp.c  |   19 +-
 net/ipv4/route.c |   65 ++--
 5 files changed, 173 insertions(+), 76 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 net-next 2/2] ipv4: ICMP packet inspection for multipath

2015-09-30 Thread Peter Nørlund
From: Peter Nørlund 

ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund 
---
 include/net/route.h |   11 +-
 net/ipv4/icmp.c |   19 -
 net/ipv4/route.c|   59 +--
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index f46af25..7d79c05 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -110,7 +111,15 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
+struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
+ int mp_hash);
+
+static inline struct rtable *__ip_route_output_key(struct net *net,
+  struct flowi4 *flp)
+{
+   return __ip_route_output_key_hash(net, flp, -1);
+}
+
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5eb8ac..b3a1620 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -440,6 +440,22 @@ out_unlock:
icmp_xmit_unlock(sk);
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
+/* Source and destination is swapped. See ip_multipath_icmp_hash */
+static int icmp_multipath_hash_skb(const struct sk_buff *skb)
+{
+   const struct iphdr *iph = ip_hdr(skb);
+
+   return fib_multipath_hash(iph->daddr, iph->saddr);
+}
+
+#else
+
+#define icmp_multipath_hash_skb(skb) (-1)
+
+#endif
+
 static struct rtable *icmp_route_lookup(struct net *net,
struct flowi4 *fl4,
struct sk_buff *skb_in,
@@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev);
 
security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-   rt = __ip_route_output_key(net, fl4);
+   rt = __ip_route_output_key_hash(net, fl4,
+   icmp_multipath_hash_skb(skb_in));
if (IS_ERR(rt))
return rt;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 64367f3..a2479a4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1646,6 +1646,48 @@ out:
return err;
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
+/* To make ICMP packets follow the right flow, the multipath hash is
+ * calculated from the inner IP addresses in reverse order.
+ */
+static int ip_multipath_icmp_hash(struct sk_buff *skb)
+{
+   const struct iphdr *outer_iph = ip_hdr(skb);
+   struct icmphdr _icmph;
+   const struct icmphdr *icmph;
+   struct iphdr _inner_iph;
+   const struct iphdr *inner_iph;
+
+   if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
+   goto standard_hash;
+
+   icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
+  &_icmph);
+   if (!icmph)
+   goto standard_hash;
+
+   if (icmph->type != ICMP_DEST_UNREACH &&
+   icmph->type != ICMP_REDIRECT &&
+   icmph->type != ICMP_TIME_EXCEEDED &&
+   icmph->type != ICMP_PARAMETERPROB) {
+   goto standard_hash;
+   }
+
+   inner_iph = skb_header_pointer(skb,
+  outer_iph->ihl * 4 + sizeof(_icmph),
+  sizeof(_inner_iph), &_inner_iph);
+   if (!inner_iph)
+   goto standard_hash;
+
+   return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
+
+standard_hash:
+   return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
+}
+
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
+
 static int ip_mkroute_input(struct sk_buff *skb,
struct fib_result *res,
const struct flowi4 *fl4,
@@ -1656,7 +1698,10 @@ static int ip_mkroute_input(struct sk_buff *skb,
if (res->fi && res->fi->fib_nhs > 1) {
int h;
 
-   h = fib_multipath_hash(saddr, daddr);
+   if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
+   h = ip_multipath_icmp_hash(skb);
+   else
+   h = fib_multipath_hash(saddr, daddr);
fib_select_multipath(res, h);
}
 #endif
@@ -2042,7 +2087,8 @@ add:
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key(struct net *net, struct 

[PATCH v5 net-next 1/2] ipv4: L3 hash-based multipath

2015-09-30 Thread Peter Nørlund
From: Peter Nørlund 

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Signed-off-by: Peter Nørlund 
---
 include/net/ip_fib.h |   14 -
 net/ipv4/fib_semantics.c |  140 +-
 net/ipv4/route.c |   16 --
 3 files changed, 98 insertions(+), 72 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 727d6e9..7a51fd8 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -79,7 +79,7 @@ struct fib_nh {
unsigned char   nh_scope;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
int nh_weight;
-   int nh_power;
+   atomic_tnh_upper_bound;
 #endif
 #ifdef CONFIG_IP_ROUTE_CLASSID
__u32   nh_tclassid;
@@ -118,7 +118,7 @@ struct fib_info {
 #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
int fib_nhs;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-   int fib_power;
+   int fib_weight;
 #endif
struct rcu_head rcu;
struct fib_nh   fib_nh[0];
@@ -320,7 +320,15 @@ int ip_fib_check_default(__be32 gw, struct net_device 
*dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event);
 int fib_sync_down_addr(struct net *net, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
-void fib_select_multipath(struct fib_result *res);
+
+extern u32 fib_multipath_secret __read_mostly;
+
+static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
+{
+   return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1;
+}
+
+void fib_select_multipath(struct fib_result *res, int hash);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 064bd3c..0c49d2f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
 static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-
-static DEFINE_SPINLOCK(fib_multipath_lock);
+u32 fib_multipath_secret __read_mostly;
 
 #define for_nexthops(fi) { \
int nhsel; const struct fib_nh *nh; \
@@ -532,7 +531,67 @@ errout:
return ret;
 }
 
-#endif
+static void fib_rebalance(struct fib_info *fi)
+{
+   int total;
+   int w;
+   struct in_device *in_dev;
+
+   if (fi->fib_nhs < 2)
+   return;
+
+   total = 0;
+   for_nexthops(fi) {
+   if (nh->nh_flags & RTNH_F_DEAD)
+   continue;
+
+   in_dev = __in_dev_get_rcu(nh->nh_dev);
+
+   if (in_dev &&
+   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+   nh->nh_flags & RTNH_F_LINKDOWN)
+   continue;
+
+   total += nh->nh_weight;
+   } endfor_nexthops(fi);
+
+   w = 0;
+   change_nexthops(fi) {
+   int upper_bound;
+
+   in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
+
+   if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
+   upper_bound = -1;
+   } else if (in_dev &&
+  IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+  nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
+   upper_bound = -1;
+   } else {
+   w += nexthop_nh->nh_weight;
+   upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w,
+   total) - 1;
+   }
+
+   atomic_set(_nh->nh_upper_bound, upper_bound);
+   } endfor_nexthops(fi);
+
+   net_get_random_once(_multipath_secret,
+   sizeof(fib_multipath_secret));
+}
+
+static inline void fib_add_weight(struct fib_info *fi,
+ const struct fib_nh *nh)
+{
+   fi->fib_weight += nh->nh_weight;
+}
+
+#else /* CONFIG_IP_ROUTE_MULTIPATH */
+
+#define fib_rebalance(fi) do { } while (0)
+#define fib_add_weight(fi, nh) do { } while (0)
+
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
 
 static int fib_encap_match(struct net *net, u16 encap_type,
   struct nlattr *encap,
@@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
change_nexthops(fi) {
fib_info_update_nh_saddr(net, nexthop_nh);
+   fib_add_weight(fi, nexthop_nh);
} endfor_nexthops(fi)
 
+   fib_rebalance(fi);
+
 link_it:
ofi = fib_find_info(fi);
if (ofi) {
@@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned 
long event)
nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;

Re: [PATCH net-next 4/6] xfrm: Add xfrm6 address translation function

2015-09-30 Thread Steffen Klassert
On Tue, Sep 29, 2015 at 04:58:46PM -0600, David Ahern wrote:
> Hi Tom:
> 
> On 9/29/15 4:17 PM, Tom Herbert wrote:
> >This patch adds xfrm6_xlat_addr which is called in the data path
> >to perform address translation (primarily for the receive path). Modules
> >may register their own callback to perform a translation-- this
> >registration is managed by xfrm6_xlat_addr_add and xfrm6_xlat_addr_del.
> >xfrm6_xlat_addr allows translation of addresses for an sk_buff.
> 
> 
> Seems like a stretch to lump this into xfrms. You have a separate
> genl based config as opposed to the netlink xfrm API and you are
> calling the xlat_addr function directly in ip6_rcv as opposed to via
> some policy with dst_ops driven redirection. Why call this a xfrm?

I have to agree here. We have policies and states to do the lookups
and to describe the transformation. Just adding a callback to do this
in a different way does not integrate well into xfrm.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key

2015-09-30 Thread Jiri Benc
On Tue, 29 Sep 2015 19:08:44 -0700, Jesse Gross wrote:
> On Tue, Sep 29, 2015 at 10:52 AM, Jiri Benc  wrote:
> > diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> > index 5c030a4d7338..03ba070c3256 100644
> > --- a/net/openvswitch/flow_netlink.c
> > +++ b/net/openvswitch/flow_netlink.c
> > @@ -643,6 +643,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr 
> > *attr,
> > }
> >
> > SW_FLOW_KEY_PUT(match, tun_key.tun_flags, tun_flags, is_mask);
> > +   SW_FLOW_KEY_PUT(match, tun_proto, AF_INET, is_mask);
> 
> I don't think this is right in the case of the mask. It will cause the
> the mask to be the value AF_INET - instead you want to set the mask to
> be 0xff.

I think you're right, this is a special case. I'll fix it.

Thanks,

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Michal Kubecek
On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote:
> On 29 September 2015 at 21:09, Jason Baron  wrote:
> > However, if we call connect on socket 's', to connect to a new socket 'o2', 
> > we
> > drop the reference on the original socket 'o'. Thus, we can now close socket
> > 'o' without unregistering from epoll. Then, when we either close the ep
> > or unregister 'o', we end up with this list corruption. Thus, this is not a
> > race per se, but can be triggered sequentially.
> 
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?

I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is
still connected to it. This is detected by 's' in unix_dgram_sendmsg()
so that 's' releases its reference on 'o' and 'o' can be freed. If this
happens before 's' is unregistered, we get use-after-free as 'o' has
never been unregistered. And as the interval between freeing 'o' and
unregistering 's' can be quite long, there is a chance for the memory to
be reused. This is what one of our customers has seen:

[exception RIP: _raw_spin_lock_irqsave+156]
RIP: 8040f5bc  RSP: 8800e929de78  RFLAGS: 00010082
RAX: a32c  RBX: 88003954ab80  RCX: 1000
RDX: f232  RSI: f232  RDI: 88003954ab80
RBP: 5220   R8: dead00100100   R9: 
R10: 7fff1a284960  R11: 0246  R12: 
R13: 8800e929de8c  R14: 000e  R15: 
ORIG_RAX:   CS: 1e030  SS: e02b
 #8 [8800e929de70] _raw_spin_lock_irqsave at 8040f5a9
 #9 [8800e929deb0] remove_wait_queue at 8006ad09
#10 [8800e929ded0] ep_unregister_pollwait at 80170043
#11 [8800e929def0] ep_remove at 80170073
#12 [8800e929df10] sys_epoll_ctl at 80171453
#13 [8800e929df80] system_call_fastpath at 80417553

In this case, crash happened on unregistering 's' which had null peer
(i.e. not reconnected but rather disconnected) but there were still two
items in the list, the other pointing to an unallocated page which has
apparently been modified in between.

IMHO unix_dgram_disonnected() could be the place to handle this issue:
it is called from both places where we disconnect from a peer (dead peer
detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect())
just before the reference to peer is released. I'm not familiar with the
epoll implementation so I'm still trying to find what exactly needs to
be done to unregister the peer at this moment.

> That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for
> write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the
> reproducer.

Sounds likely as this is the commit that introduced unix_dgram_poll()
with the code which adds the "asymmetric peer" to monitor its queue
state. More precisely, the asymmetricity check has been added by

  ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")

shortly after that.

  Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling

2015-09-30 Thread Jiri Benc
On Tue, 29 Sep 2015 20:05:00 -0700, Jesse Gross wrote:
> This appears to me to be a bug in the existing code.
> ovs_tunnel_get_egress_info() as a general mechanism is still in use
> and should work with both the old and new configuration methods.

It's currently used only from the compat layer (the API that the user
space that is unaware of lwtunnels use).

I don't understand what it would be good for with lwtunnel based
tunnels. The metadata_dst is created in the validate_and_copy_set_tun
function (net/openvswitch/flow_netlink.c) and used to specify egress
encapsulation metadata. The ovs_tunnel_get_egress_info function is not
needed.

> However, I agree that it doesn't look like it will work currently with
> tunnel devices. I think we need to fix this rather than making it more
> broken.

I'm not making it more broken. We currently (i.e. right now, in the
current net.git) have two APIs for tunnel specification in the ovs
kernel datapath: the old one, which is translated by the compat layer
to create a net_device, and the lwtunnel one, which requires user space
to create a (metadata) tunnel net_device and add it to the datapath.
I'm simply not adding more code to the first, legacy interface, which
seems to be the correct thing to do.

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] y2038 conversion for ntp/pps and sfc driver

2015-09-30 Thread Thomas Gleixner
On Tue, 29 Sep 2015, David Miller wrote:
> From: Arnd Bergmann 
> Date: Mon, 28 Sep 2015 22:21:27 +0200
> 
> > When trying to build a kernel with time_t commented out, I found that
> > the ntp subsystem still relies on timespec for its pps handling.
> > 
> > This series addresses this and converts all the code to use timespec64
> > instead, step by step. There is one device driver that interacts with
> > this code directly (rather than only through the ptp subsystem), so
> > I have to convert that driver at the same time.
> > 
> > The patches should ideally stay together as a series, but they do
> > span multiple subsystems, so I'm also looking for the right person
> > to merge them.
> 
> I'm happy with this going via a tree other than mine, and for the

I think it should go via John Stultz timekeeping tree. 

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] net: sfc: avoid using timespec

2015-09-30 Thread Thomas Gleixner
On Mon, 28 Sep 2015, Arnd Bergmann wrote:

> The sfc driver internally uses a time format based on 32-bit (unsigned)
> seconds and 32-bit nanoseconds. This means it will overflow in 2106,
> but the value we pass into it is a signed 32-bit tv_sec that already
> overflows in 2038 to a negative value.
> 
> This patch changes the logic to use the lower 32 bits of the timespec64
> tv_sec in efx_ptp_ns_to_s_ns, which will have the correct value beyond the 
> overflow.
> While this does not change any of the register values, it lets us
> keep using the driver after we deprecate the use of the timespec type
> in the kernel.
> 
> In the efx_ptp_process_times function, the change to use timespec64
> is similar, in that the tv_sec portion is ignored anyway and we only
> care about the nanosecond portion that remains unchanged.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ntp/pps: use y2038 safe types in pps_event_time

2015-09-30 Thread Thomas Gleixner
On Mon, 28 Sep 2015, Arnd Bergmann wrote:

> The pps_event_time uses two 'timespec' structures internally, which
> suffer from the y2038 problem. The uses of this structure are
> fairly self-contained in the pps code, so this replaces them all at
> once.
> 
> Unfortunately, this includes the sfc ethernet driver aside from the
> pps subsystem, so we change that one as well. Both touch the
> same data structure, and there probably is no good way to split
> the patch into smaller units.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type

2015-09-30 Thread Jan Engelhardt

On Wednesday 2015-09-30 09:24, Daniel Mack wrote:
>
>> Drop?  Makes no sense, else application would not be running in the first
>> place.
>
>Of course you can drop certain packets at this point, depending on other
>details. Say, for instance, you want to match all packets that are
>received by a certain task [...]
>Another use case is accounting. If you want to know how much traffic a
>certain service or application in your system has caused

But the sk info would be available in INPUT already, would it not?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key

2015-09-30 Thread Pravin Shelar
On Wed, Sep 30, 2015 at 12:09 AM, Jiri Benc  wrote:
> On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote:
>> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6
>> tunnel keys. This can be stored in ip_tunnel_key.tun_flags.
>
> Not really. This was my original approach, too, but openvswitch is not
> the only user of struct ip_tunnel_key, and in the lwtunnel core,
> tun_flags are handled in the way that makes this impractical. Most
> importantly, the tun_flags value is directly taken from/stored to
> LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in
> net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc.
>
How is it impractical ? Userspace can set flag for IPv6 tunnel info.
That should be easy.

IPv6 bit can not be masked anyways so I do not see problem with
masking this flag due to the new bit.

Since this field is exposed to userspace. TUNNEL_* flags needs to be
moved to uapi header.


>> That also saves space in flow key.
>
> The field was added to a 2 byte hole in the struct sw_flow_key (leaving
> still 1 byte free), thus there's no additional space used.
>
>  Jiri
>
> --
> Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune

2015-09-30 Thread Sowmini Varadhan
Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.

In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 10 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp.c |   16 
 1 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index c42b60b..9d6ddba 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -67,21 +67,13 @@ void rds_tcp_nonagle(struct socket *sock)
set_fs(oldfs);
 }
 
+/* All module specific customizations to the RDS-TCP socket should be done in
+ * rds_tcp_tune() and applied after socket creation. In general these
+ * customizations should be tunable via module_param()
+ */
 void rds_tcp_tune(struct socket *sock)
 {
-   struct sock *sk = sock->sk;
-
rds_tcp_nonagle(sock);
-
-   /*
-* We're trying to saturate gigabit with the default,
-* see svc_sock_setbufsize().
-*/
-   lock_sock(sk);
-   sk->sk_sndbuf = RDS_TCP_DEFAULT_BUFSIZE;
-   sk->sk_rcvbuf = RDS_TCP_DEFAULT_BUFSIZE;
-   sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
-   release_sock(sk);
 }
 
 u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements

2015-09-30 Thread Sowmini Varadhan
A 3-part patchset that (a) improves current RDS-TCP perf
by 2X-3X and (b) refactors earlier robustness code for
better observability/scaling.

Patch 1 is an enhancment of earlier robustness fixes 
that had used separate sockets for client and server endpoints to
resolve race conditions. It is possible to have an equivalent
solution that does not use 2 sockets. The benefit of a
single socket solution is that it results in more predictable
and observable behavior for the underlying TCP pipe of an 
RDS connection

Patches 2 and 3 are simple, straightforward perf bug fixes
that align the RDS TCP socket with other parts of the kernel stack.

v2: fix kbuild-test-robot warnings, comments from  Sergei Shtylov
and Santosh Shilimkar. 

Sowmini Varadhan (3):
  Use a single TCP socket for both send and receive.
  Do not bloat sndbuf/rcvbuf in rds_tcp_tune
  Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in
rds_tcp_xmit

 net/rds/connection.c |   22 ++
 net/rds/rds.h|4 +++-
 net/rds/tcp.c|   16 
 net/rds/tcp_listen.c |   22 +-
 net/rds/tcp_send.c   |8 +++-
 5 files changed, 29 insertions(+), 43 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next 1/3] RDS: Use a single TCP socket for both send and receive.

2015-09-30 Thread Sowmini Varadhan
Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port.  The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.

However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().

A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.

Signed-off-by: Sowmini Varadhan 
---
v2: kbuild-test-robot warning around __be32, modify subject line per 
Santosh Shilimkar

 net/rds/connection.c |   22 ++
 net/rds/rds.h|4 +++-
 net/rds/tcp_listen.c |   22 +-
 3 files changed, 18 insertions(+), 30 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 49adeef..d456403 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -128,10 +128,7 @@ static struct rds_connection *__rds_conn_create(struct net 
*net,
struct rds_transport *loop_trans;
unsigned long flags;
int ret;
-   struct rds_transport *otrans = trans;
 
-   if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
-   goto new_conn;
rcu_read_lock();
conn = rds_conn_lookup(net, head, laddr, faddr, trans);
if (conn && conn->c_loopback && conn->c_trans != _loop_transport &&
@@ -147,7 +144,6 @@ static struct rds_connection *__rds_conn_create(struct net 
*net,
if (conn)
goto out;
 
-new_conn:
conn = kmem_cache_zalloc(rds_conn_slab, gfp);
if (!conn) {
conn = ERR_PTR(-ENOMEM);
@@ -207,6 +203,7 @@ static struct rds_connection *__rds_conn_create(struct net 
*net,
 
atomic_set(>c_state, RDS_CONN_DOWN);
conn->c_send_gen = 0;
+   conn->c_outgoing = (is_outgoing ? 1 : 0);
conn->c_reconnect_jiffies = 0;
INIT_DELAYED_WORK(>c_send_w, rds_send_worker);
INIT_DELAYED_WORK(>c_recv_w, rds_recv_worker);
@@ -243,22 +240,13 @@ static struct rds_connection *__rds_conn_create(struct 
net *net,
/* Creating normal conn */
struct rds_connection *found;
 
-   if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
-   found = NULL;
-   else
-   found = rds_conn_lookup(net, head, laddr, faddr, trans);
+   found = rds_conn_lookup(net, head, laddr, faddr, trans);
if (found) {
trans->conn_free(conn->c_transport_data);
kmem_cache_free(rds_conn_slab, conn);
conn = found;
} else {
-   if ((is_outgoing && otrans->t_type == RDS_TRANS_TCP) ||
-   (otrans->t_type != RDS_TRANS_TCP)) {
-   /* Only the active side should be added to
-* reconnect list for TCP.
-*/
-   hlist_add_head_rcu(>c_hash_node, head);
-   }
+   hlist_add_head_rcu(>c_hash_node, head);
rds_cong_add_conn(conn);
rds_conn_count++;
}
@@ -337,7 +325,9 @@ void rds_conn_shutdown(struct rds_connection *conn)
rcu_read_lock();
if (!hlist_unhashed(>c_hash_node)) {
rcu_read_unlock();
-   rds_queue_reconnect(conn);
+   if (conn->c_trans->t_type != RDS_TRANS_TCP ||
+   conn->c_outgoing == 1)
+   rds_queue_reconnect(conn);
} else {
rcu_read_unlock();
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index afb4048..b4c7ac0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -86,7 +86,9 @@ struct rds_connection {
struct hlist_node   c_hash_node;
__be32  c_laddr;
__be32  c_faddr;
-   unsigned intc_loopback:1;
+   unsigned intc_loopback:1,
+ 

[PATCH v2 net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit

2015-09-30 Thread Sowmini Varadhan
For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
tcp_sendpage()

Signed-off-by: Sowmini Varadhan 
---
v2: Sergei Shtylov, Santosh Shilimkar comments (some parens retained for
readability)

 net/rds/tcp_send.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 53b17ca..2894e60 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -83,6 +83,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct 
rds_message *rm,
struct rds_tcp_connection *tc = conn->c_transport_data;
int done = 0;
int ret = 0;
+   int more;
 
if (hdr_off == 0) {
/*
@@ -116,12 +117,15 @@ int rds_tcp_xmit(struct rds_connection *conn, struct 
rds_message *rm,
goto out;
}
 
+   more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0;
while (sg < rm->data.op_nents) {
+   int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
+
ret = tc->t_sock->ops->sendpage(tc->t_sock,
sg_page(>data.op_sg[sg]),
rm->data.op_sg[sg].offset + off,
rm->data.op_sg[sg].length - off,
-   MSG_DONTWAIT|MSG_NOSIGNAL);
+   flags);
rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void 
*)sg_page(>data.op_sg[sg]),
 rm->data.op_sg[sg].offset + off, 
rm->data.op_sg[sg].length - off,
 ret);
@@ -134,6 +138,8 @@ int rds_tcp_xmit(struct rds_connection *conn, struct 
rds_message *rm,
off = 0;
sg++;
}
+   if (sg == rm->data.op_nents - 1)
+   more = 0;
}
 
 out:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/6] net: Pass net through ip fragmention

2015-09-30 Thread Eric W. Biederman

This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.

This round focuses on passing net through ip fragmentation which we seem
to call from about everywhere.  That is the main ip output paths, the
bridge netfilter code, and openvswitch.  This has to happend at once
accross the tree as function pointers are involved.

First some prep work is done, then ipv4 and ipv6 are converted and then
temporary helper functions are removed.

The changes are also available against nf-next at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master

Eric

Eric W. Biederman (6):
  openvswitch: Pass net into ovs_vport_output
  openvswitch: Pass net into ovs_fragment
  ipv4: Pass struct net through ip_fragment
  ipv6: Pass struct net through ip6_fragment
  bridge: Remove br_nf_push_frag_xmit_sk
  openvswitch: Remove ovs_vport_output_sk

 include/linux/netfilter_ipv6.h  |  4 ++--
 include/net/ip.h|  4 ++--
 include/net/ip6_route.h |  4 ++--
 net/bridge/br_netfilter_hooks.c | 13 
 net/ipv4/ip_output.c| 44 +++--
 net/ipv6/ip6_output.c   | 16 +++
 net/ipv6/xfrm6_output.c | 10 --
 net/openvswitch/actions.c   | 13 ++--
 8 files changed, 52 insertions(+), 56 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling

2015-09-30 Thread Jesse Gross
On Wed, Sep 30, 2015 at 12:28 AM, Jiri Benc  wrote:
> On Tue, 29 Sep 2015 20:05:00 -0700, Jesse Gross wrote:
>> This appears to me to be a bug in the existing code.
>> ovs_tunnel_get_egress_info() as a general mechanism is still in use
>> and should work with both the old and new configuration methods.
>
> It's currently used only from the compat layer (the API that the user
> space that is unaware of lwtunnels use).

Yes but that is a bug. From the perspective of the intended use of
this function, I don't think there is any difference between compat
and non-compat users.

> I don't understand what it would be good for with lwtunnel based
> tunnels. The metadata_dst is created in the validate_and_copy_set_tun
> function (net/openvswitch/flow_netlink.c) and used to specify egress
> encapsulation metadata. The ovs_tunnel_get_egress_info function is not
> needed.

This function is used to report back information that is the result of
the encapsulation process, such as the UDP source port chosen. Take a
look at net/openvswitch/actions.c:output_userspace(), particularly the
OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key

2015-09-30 Thread Jesse Gross
On Wed, Sep 30, 2015 at 1:13 PM, Pravin Shelar  wrote:
> On Wed, Sep 30, 2015 at 12:09 AM, Jiri Benc  wrote:
>> On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote:
>>> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6
>>> tunnel keys. This can be stored in ip_tunnel_key.tun_flags.
>>
>> Not really. This was my original approach, too, but openvswitch is not
>> the only user of struct ip_tunnel_key, and in the lwtunnel core,
>> tun_flags are handled in the way that makes this impractical. Most
>> importantly, the tun_flags value is directly taken from/stored to
>> LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in
>> net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc.
>>
> How is it impractical ? Userspace can set flag for IPv6 tunnel info.
> That should be easy.
>
> IPv6 bit can not be masked anyways so I do not see problem with
> masking this flag due to the new bit.

I think he meant for non-OVS users.

> Since this field is exposed to userspace. TUNNEL_* flags needs to be
> moved to uapi header.

This doesn't really seem all that desirable to me. It's nice to be
able to change these as necessary and in the particular case of IPv6,
it seems like something that the kernel can manage by itself (as is
done in this patch and I think the same strategy would apply
regardless of the particular representation).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key

2015-09-30 Thread Jiri Benc
On Wed, 30 Sep 2015 13:25:12 -0700, Jesse Gross wrote:
> On Wed, Sep 30, 2015 at 1:13 PM, Pravin Shelar  wrote:
> > On Wed, Sep 30, 2015 at 12:09 AM, Jiri Benc  wrote:
> >> On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote:
> >>> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6
> >>> tunnel keys. This can be stored in ip_tunnel_key.tun_flags.
> >>
> >> Not really. This was my original approach, too, but openvswitch is not
> >> the only user of struct ip_tunnel_key, and in the lwtunnel core,
> >> tun_flags are handled in the way that makes this impractical. Most
> >> importantly, the tun_flags value is directly taken from/stored to
> >> LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in
> >> net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc.
> >>
> > How is it impractical ? Userspace can set flag for IPv6 tunnel info.
> > That should be easy.
> >
> > IPv6 bit can not be masked anyways so I do not see problem with
> > masking this flag due to the new bit.
> 
> I think he meant for non-OVS users.

Yes, I didn't mean masking in ovs, I meant that we'd need to hide the
bit from other users, for example in net/ipv4/ip_tunnel_core.c.
Currently, the information about ip_tunnel_key protocol is stored
outside the structure. Changing this would mean quite big changes in
the lwtunnel code (or, rather, IP users of lwtunnel) which doesn't seem
worth it just because of ovs. Especially when ovs can store the
information just fine without impact on memory footprint.

I don't see any real advantage in storing the protocol inside
ip_tunnel_key, this looks like it would be just a change for the change.

> > Since this field is exposed to userspace. TUNNEL_* flags needs to be
> > moved to uapi header.
> 
> This doesn't really seem all that desirable to me. It's nice to be
> able to change these as necessary and in the particular case of IPv6,
> it seems like something that the kernel can manage by itself (as is
> done in this patch and I think the same strategy would apply
> regardless of the particular representation).

User space can set and get those bits in LWTUNNEL_IP_FLAGS netlink
attribute when using lwtunnel+routing rules. It would make sense to
move them to uapi but that's for a different patch(set).

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling

2015-09-30 Thread Jiri Benc
On Wed, 30 Sep 2015 13:18:40 -0700, Jesse Gross wrote:
> This function is used to report back information that is the result of
> the encapsulation process, such as the UDP source port chosen. Take a
> look at net/openvswitch/actions.c:output_userspace(), particularly the
> OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case.

I see. I think it should be addressed separately from this patchset,
though, as the function needs to be completely rewritten even for IPv4
and IPv6 can be handled alongside it.

I'll change the patch description in v2, the current wording is not
correct. I don't think that fixing the bug should be a prerequisite for
this patchset, the problem is already there and this patchset doesn't
change that.

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling

2015-09-30 Thread Jesse Gross
On Wed, Sep 30, 2015 at 2:05 PM, Jiri Benc  wrote:
> On Wed, 30 Sep 2015 13:18:40 -0700, Jesse Gross wrote:
>> This function is used to report back information that is the result of
>> the encapsulation process, such as the UDP source port chosen. Take a
>> look at net/openvswitch/actions.c:output_userspace(), particularly the
>> OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case.
>
> I see. I think it should be addressed separately from this patchset,
> though, as the function needs to be completely rewritten even for IPv4
> and IPv6 can be handled alongside it.
>
> I'll change the patch description in v2, the current wording is not
> correct. I don't think that fixing the bug should be a prerequisite for
> this patchset, the problem is already there and this patchset doesn't
> change that.

Can you at least update the existing code for IPv6 so that this
doesn't introduce another lurking issue when the bug is fixed?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: usb: asix: Fix crash on skb alloc failure

2015-09-30 Thread David B. Robins
If asix_rx_fixup_internal() fails to allocate rx->ax_skb, it will return
but not clear rx->size. rx points to driver private data. A later call
assumes that nonzero size means ax_skb was allocated and passes a null
ax_skb to skb_put. Changed allocation failure return to clear size first.

Found testing board with AX88772B devices.

Signed-off-by: David B. Robins 
---
 drivers/net/usb/asix_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c
index 75d6f26..079069a 100644
--- a/drivers/net/usb/asix_common.c
+++ b/drivers/net/usb/asix_common.c
@@ -91,8 +91,10 @@ int asix_rx_fixup_internal(struct usbnet *dev, struct 
sk_buff *skb,
}
rx->ax_skb = netdev_alloc_skb_ip_align(dev->net,
   rx->size);
-   if (!rx->ax_skb)
+   if (!rx->ax_skb) {
+   rx->size = 0;
return 0;
+   }
}
 
if (rx->size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Add support for Broadcom's iProc MDIO and Cygnus Ethernet PHY

2015-09-30 Thread Arun Parameswaran
Hi
This patchset adds support for the iProc MDIO interface and the
Broadcom Cygnus SoC's internal Ethernet PHY.

The internal Ethernet PHY(s) in the Cygnus SoC's are accessed
via the MDIO interface found in most of the iProc based chips.

The patch also consolidates the common API's used by the
Broadcom phys to a common library. Existing Broadcom phy
drivers have been modified to use the common library API's.

The Ethernet driver for the iProc family will be submitted soon,
as will the device tree configurations for the different iProc
family SoCs.

Arun Parameswaran (4):
  dt-bindings: net: Broadcom iProc MDIO bus driver device tree binding
  net: phy: Broadcom iProc MDIO bus driver
  net: phy: Add Broadcom phy library for common interfaces
  net: phy: Broadcom Cygnus internal Etherent PHY driver

 .../devicetree/bindings/net/brcm,iproc-mdio.txt|  23 +++
 drivers/net/phy/Kconfig|  28 +++
 drivers/net/phy/Makefile   |   3 +
 drivers/net/phy/bcm-cygnus.c   | 162 
 drivers/net/phy/bcm-phy-lib.c  | 209 
 drivers/net/phy/bcm-phy-lib.h  |  37 
 drivers/net/phy/bcm63xx.c  |  38 +---
 drivers/net/phy/bcm7xxx.c  | 127 +++-
 drivers/net/phy/broadcom.c | 149 +-
 drivers/net/phy/mdio-bcm-iproc.c   | 213 +
 include/linux/brcmphy.h|  24 +--
 11 files changed, 757 insertions(+), 256 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt
 create mode 100644 drivers/net/phy/bcm-cygnus.c
 create mode 100644 drivers/net/phy/bcm-phy-lib.c
 create mode 100644 drivers/net/phy/bcm-phy-lib.h
 create mode 100644 drivers/net/phy/mdio-bcm-iproc.c

-- 
2.5.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] net: phy: Broadcom iProc MDIO bus driver

2015-09-30 Thread Arun Parameswaran
This patch adds support for the Broadcom iProc MDIO bus interface.
The MDIO interface can be found in the Broadcom iProc family Soc's.

The MDIO bus is accessed using a combination of command and data
registers. This MDIO driver provides access to the Etherent GPHY's
connected to the MDIO bus.

Signed-off-by: Arun Parameswaran 
---
 drivers/net/phy/Kconfig  |   9 ++
 drivers/net/phy/Makefile |   1 +
 drivers/net/phy/mdio-bcm-iproc.c | 213 +++
 3 files changed, 223 insertions(+)
 create mode 100644 drivers/net/phy/mdio-bcm-iproc.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index c5ad98a..b57f6c2 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -225,6 +225,15 @@ config MDIO_BCM_UNIMAC
  This hardware can be found in the Broadcom GENET Ethernet MAC
  controllers as well as some Broadcom Ethernet switches such as the
  Starfighter 2 switches.
+
+config MDIO_BCM_IPROC
+   tristate "Broadcom iProc MDIO bus controller"
+   depends on ARCH_BCM_IPROC || COMPILE_TEST
+   depends on HAS_IOMEM && OF_MDIO
+   help
+ This module provides a driver for the MDIO busses found in the
+ Broadcom iProc SoC's.
+
 endif # PHYLIB
 
 config MICREL_KS8995MA
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 87f079c..f4e6eb9 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -38,3 +38,4 @@ obj-$(CONFIG_MDIO_SUN4I)  += mdio-sun4i.o
 obj-$(CONFIG_MDIO_MOXART)  += mdio-moxart.o
 obj-$(CONFIG_MDIO_BCM_UNIMAC)  += mdio-bcm-unimac.o
 obj-$(CONFIG_MICROCHIP_PHY)+= microchip.o
+obj-$(CONFIG_MDIO_BCM_IPROC)   += mdio-bcm-iproc.o
diff --git a/drivers/net/phy/mdio-bcm-iproc.c b/drivers/net/phy/mdio-bcm-iproc.c
new file mode 100644
index 000..c0b4e65
--- /dev/null
+++ b/drivers/net/phy/mdio-bcm-iproc.c
@@ -0,0 +1,213 @@
+/*
+ * Copyright (C) 2015 Broadcom Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation version 2.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define IPROC_GPHY_MDCDIV0x1a
+
+#define MII_CTRL_OFFSET  0x000
+
+#define MII_CTRL_DIV_SHIFT   0
+#define MII_CTRL_PRE_SHIFT   7
+#define MII_CTRL_BUSY_SHIFT  8
+
+#define MII_DATA_OFFSET  0x004
+#define MII_DATA_MASK0x
+#define MII_DATA_TA_SHIFT16
+#define MII_DATA_TA_VAL  2
+#define MII_DATA_RA_SHIFT18
+#define MII_DATA_PA_SHIFT23
+#define MII_DATA_OP_SHIFT28
+#define MII_DATA_OP_WRITE1
+#define MII_DATA_OP_READ 2
+#define MII_DATA_SB_SHIFT30
+
+struct iproc_mdio_priv {
+   struct mii_bus *mii_bus;
+   void __iomem *base;
+};
+
+static inline int iproc_mdio_wait_for_idle(void __iomem *base)
+{
+   u32 val;
+   unsigned int timeout = 1000; /* loop for 1s */
+
+   do {
+   val = readl(base + MII_CTRL_OFFSET);
+   if ((val & BIT(MII_CTRL_BUSY_SHIFT)) == 0)
+   return 0;
+
+   usleep_range(1000, 2000);
+   } while (timeout--);
+
+   return -ETIMEDOUT;
+}
+
+static inline void iproc_mdio_config_clk(void __iomem *base)
+{
+   u32 val;
+
+   val = (IPROC_GPHY_MDCDIV << MII_CTRL_DIV_SHIFT) |
+ BIT(MII_CTRL_PRE_SHIFT);
+   writel(val, base + MII_CTRL_OFFSET);
+}
+
+static int iproc_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+{
+   struct iproc_mdio_priv *priv = bus->priv;
+   u32 cmd;
+   int rc;
+
+   rc = iproc_mdio_wait_for_idle(priv->base);
+   if (rc)
+   return rc;
+
+   iproc_mdio_config_clk(priv->base);
+
+   /* Prepare the read operation */
+   cmd = (MII_DATA_TA_VAL << MII_DATA_TA_SHIFT) |
+   (reg << MII_DATA_RA_SHIFT) |
+   (phy_id << MII_DATA_PA_SHIFT) |
+   BIT(MII_DATA_SB_SHIFT) |
+   (MII_DATA_OP_READ << MII_DATA_OP_SHIFT);
+
+   writel(cmd, priv->base + MII_DATA_OFFSET);
+
+   rc = iproc_mdio_wait_for_idle(priv->base);
+   if (rc)
+   return rc;
+
+   cmd = readl(priv->base + MII_DATA_OFFSET) & MII_DATA_MASK;
+
+   return cmd;
+}
+
+static int iproc_mdio_write(struct mii_bus *bus, int phy_id,
+   int reg, u16 val)
+{
+   struct iproc_mdio_priv *priv = bus->priv;
+   u32 cmd;
+   int rc;
+
+   rc = iproc_mdio_wait_for_idle(priv->base);
+   if (rc)
+   return rc;
+
+   iproc_mdio_config_clk(priv->base);
+
+   /* Prepare the 

Re: [PATCH 3/4] net: phy: Add Broadcom phy library for common interfaces

2015-09-30 Thread kbuild test robot
Hi Arun,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
ignore]

config: i386-randconfig-i1-201539 (attached as .config)
reproduce:
  git checkout 25a633b2114806a7ce7d4f171c4714880e2c721b
  # save the attached .config to linux build tree
  make ARCH=i386 

All error/warnings (new ones prefixed by >>):

>> ERROR: "bcm_phy_config_intr" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_ack_intr" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_read_exp" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_write_exp" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_read_shadow" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_write_shadow" [drivers/net/phy/broadcom.ko] undefined!
>> ERROR: "bcm_phy_enable_apd" [drivers/net/phy/bcm7xxx.ko] undefined!
>> ERROR: "bcm_phy_enable_eee" [drivers/net/phy/bcm7xxx.ko] undefined!
>> ERROR: "bcm_phy_write_misc" [drivers/net/phy/bcm7xxx.ko] undefined!
>> ERROR: "bcm_phy_write_exp" [drivers/net/phy/bcm7xxx.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [RFC PATCH 1/3] net: dsa: Use devm_ prefixed allocations

2015-09-30 Thread Andrew Lunn
Hi Neil

I tested all three patches on a board with three switches.

1) Normal boot
2) Bad address set for the 3rd switch so that it was not found, so
   causing the probe to fail.

No regressions observed. 

Tested-by: Andrew Lunn 

As Florian said, this is going in the right direction for modular DSA,
but still quite a way to go...

Thanks
Andrew

On Wed, Sep 30, 2015 at 10:21:08AM +0200, Neil Armstrong wrote:
> To simplify and prevent memory leakage when unbinding, use
> the devm_ memory allocation calls.
> 
> Signed-off-by: Neil Armstrong 
> ---
>  net/dsa/dsa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index c59fa5d..98f94c2 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -305,7 +305,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
> struct device *parent)
>   if (ret < 0)
>   goto out;
> 
> - ds->slave_mii_bus = mdiobus_alloc();
> + ds->slave_mii_bus = devm_mdiobus_alloc(parent);
>   if (ds->slave_mii_bus == NULL) {
>   ret = -ENOMEM;
>   goto out;
> @@ -400,7 +400,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
>   /*
>* Allocate and initialise switch state.
>*/
> - ds = kzalloc(sizeof(*ds) + drv->priv_size, GFP_KERNEL);
> + ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL);
>   if (ds == NULL)
>   return ERR_PTR(-ENOMEM);
> 
> @@ -883,7 +883,7 @@ static int dsa_probe(struct platform_device *pdev)
>   goto out;
>   }
> 
> - dst = kzalloc(sizeof(*dst), GFP_KERNEL);
> + dst = devm_kzalloc(>dev, sizeof(*dst), GFP_KERNEL);
>   if (dst == NULL) {
>   dev_put(dev);
>   ret = -ENOMEM;
> -- 
> 1.9.1
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] dt-bindings: net: Broadcom iProc MDIO bus driver device tree binding

2015-09-30 Thread Arun Parameswaran
Add device tree binding documentation for the Broadcom iProc MDIO
bus driver.

Signed-off-by: Arun Parameswaran 
---
 .../devicetree/bindings/net/brcm,iproc-mdio.txt| 23 ++
 1 file changed, 23 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt

diff --git a/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt 
b/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt
new file mode 100644
index 000..689f97c
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt
@@ -0,0 +1,23 @@
+* Broadcom iProc MDIO bus controller
+
+Required properties:
+- compatible: should be "brcm,iproc-mdio"
+- reg: address and length of the register set for the MDIO interface
+- #size-cells: must be 1
+- #address-cells: must be 0
+
+Child nodes of this MDIO bus controller node are standard Ethernet PHY device
+nodes as described in Documentation/devicetree/bindings/net/phy.txt
+
+Example:
+
+mdio@0x18002000 {
+   compatible = "brcm,iproc-mdio";
+   reg = <0x18002000 0x8>;
+   #size-cells = <1>;
+   #address-cells = <0>;
+
+   enet-gphy@0 {
+   reg = <0>;
+   };
+};
-- 
2.5.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


DSA driver - how to glue to a PCI based NIC's mdio?

2015-09-30 Thread Tim Harvey
Greetings,

I'm working on adding DSA support for a PCIe expansion card (designed
by us) that has common PCIe NIC connected via its mii-bus to a Marvell
MV88E6171. Because the NIC is a PCIe device, it has no device-tree
representation of its NIC or its mdio bus, but does register its mdio
bus with Linux.

Does anyone know how I would be able to specify the Ethernet device
and MDIO bus for this add-in card if it has no device-tree handle?

I can write a phy driver that gets probed by matching the MV88E6171
PHY ID when the NIC's mdio bus get's registered yet even then I'm not
clear how to get hold of the netdev pointer given the mdio bus so as
to build a dsa_chip_data struct to register as a platform device. I'm
also not clear in this case how to verify in this case if the
MV88E6171 is from 'my' add-in card vs another card.

Perhaps the right approach is to program the NIC's EEPROM on our board
with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the
NIC's driver, and within the NIC's driver create and register dsa
platform device when our ID is encountered?

Thanks for any advise,

Tim

Tim Harvey - Principal Software Engineer
Gateworks Corporation - http://www.gateworks.com/
3026 S. Higuera St. San Luis Obispo CA 93401
805-781-2000
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DSA driver - how to glue to a PCI based NIC's mdio?

2015-09-30 Thread Andrew Lunn
On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote:
> Greetings,
> 
> I'm working on adding DSA support for a PCIe expansion card (designed
> by us) that has common PCIe NIC connected via its mii-bus to a Marvell
> MV88E6171. Because the NIC is a PCIe device, it has no device-tree
> representation of its NIC or its mdio bus, but does register its mdio
> bus with Linux.

It is possible to represent PCIe devices in device tree. Take a look
at ePAPR. Is the PCIe host in DT?

> Perhaps the right approach is to program the NIC's EEPROM on our board
> with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the
> NIC's driver, and within the NIC's driver create and register dsa
> platform device when our ID is encountered?

This sounds sensible. But i doubt you can add your DSA platform
information to the NIC's device driver. Better would be to have a
small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would
instantiate the NIC driver, and insert a DSA platform device.

Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DSA driver - how to glue to a PCI based NIC's mdio?

2015-09-30 Thread Tim Harvey
On Wed, Sep 30, 2015 at 2:12 PM, Andrew Lunn  wrote:
> On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote:
>> Greetings,
>>
>> I'm working on adding DSA support for a PCIe expansion card (designed
>> by us) that has common PCIe NIC connected via its mii-bus to a Marvell
>> MV88E6171. Because the NIC is a PCIe device, it has no device-tree
>> representation of its NIC or its mdio bus, but does register its mdio
>> bus with Linux.
>
> It is possible to represent PCIe devices in device tree. Take a look
> at ePAPR. Is the PCIe host in DT?

It is possible to represent PCI devices in device-tree however not in
a dynamic or plug-able fashion - they have to be nested per bus/slot
which defeats the purpose of dynamic enumeration.

>
>> Perhaps the right approach is to program the NIC's EEPROM on our board
>> with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the
>> NIC's driver, and within the NIC's driver create and register dsa
>> platform device when our ID is encountered?
>
> This sounds sensible. But i doubt you can add your DSA platform
> information to the NIC's device driver. Better would be to have a
> small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would
> instantiate the NIC driver, and insert a DSA platform device.

I was thinking of this as well, but then I would still need that shim
to know the netdevice that the driver I'm shimming creates so I can't
figure a way to do it without touching the PCI driver.

Tim
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] net: phy: Broadcom Cygnus internal Etherent PHY driver

2015-09-30 Thread Arun Parameswaran
Add support for the Broadcom Cygnus SoCs internal PHY's.
The PHYs are 1000M/100M/10M capable with support for 'EEE'
and 'APD' (Auto Power Down).

This driver supports the following Broadcom Cygnus SoCs:
 - BCM583XX (BCM58300, BCM58302, BCM58303, BCM58305)
 - BCM113XX (BCM11300, BCM11320, BCM11350, BCM11360)

The PHY's on these SoC's require some workarounds for
stable operation, both during configuration time and
during suspend/resume. This driver handles the
application of the workarounds.

Signed-off-by: Arun Parameswaran 
---
 drivers/net/phy/Kconfig  |  13 
 drivers/net/phy/Makefile |   1 +
 drivers/net/phy/bcm-cygnus.c | 162 +++
 include/linux/brcmphy.h  |   2 +
 4 files changed, 178 insertions(+)
 create mode 100644 drivers/net/phy/bcm-cygnus.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 0c85a01..da979ec 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -79,6 +79,19 @@ config BROADCOM_PHY
  Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464,
  BCM5481 and BCM5482 PHYs.
 
+config BCM_CYGNUS_PHY
+   tristate "Drivers for Broadcom Cygnus SoC internal PHY"
+   depends on ARCH_BCM_CYGNUS || COMPILE_TEST
+   select BCM_NET_PHYLIB
+   select MDIO_BCM_IPROC
+   ---help---
+ This PHY driver is for the 1G internal PHYs of the Broadcom
+ Cygnus Family SoC.
+
+ Currently supports internal PHY's used in the BCM11300,
+ BCM11320, BCM11350, BCM11360, BCM58300, BCM58302,
+ BCM58303 & BCM58305 Broadcom Cygnus SoCs.
+
 config BCM63XX_PHY
tristate "Drivers for Broadcom 63xx SOCs internal PHY"
depends on BCM63XX
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 6932475..7655d47 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_BROADCOM_PHY)+= broadcom.o
 obj-$(CONFIG_BCM63XX_PHY)  += bcm63xx.o
 obj-$(CONFIG_BCM7XXX_PHY)  += bcm7xxx.o
 obj-$(CONFIG_BCM87XX_PHY)  += bcm87xx.o
+obj-$(CONFIG_BCM_CYGNUS_PHY)   += bcm-cygnus.o
 obj-$(CONFIG_ICPLUS_PHY)   += icplus.o
 obj-$(CONFIG_REALTEK_PHY)  += realtek.o
 obj-$(CONFIG_LSI_ET1011C_PHY)  += et1011c.o
diff --git a/drivers/net/phy/bcm-cygnus.c b/drivers/net/phy/bcm-cygnus.c
new file mode 100644
index 000..28bab20
--- /dev/null
+++ b/drivers/net/phy/bcm-cygnus.c
@@ -0,0 +1,162 @@
+/*
+ * Copyright (C) 2015 Broadcom Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation version 2.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Broadcom Cygnus SoC internal transceivers support. */
+#include "bcm-phy-lib.h"
+#include 
+#include 
+#include 
+#include 
+
+/* Broadcom Cygnus Phy specific registers */
+#define MII_BCM_CORE_BASE1E   0x1E /* Core BASE1E register */
+#define MII_BCM_EXPB0 0xB0 /* EXPB0 register */
+#define MII_BCM_EXPB1 0xB1 /* EXPB1 register */
+
+#define MII_BCM_CYGNUS_AFE_VDAC_ICTRL_0  0x91E5 /* VDAL Control register */
+
+static int bcm_cygnus_afe_config(struct phy_device *phydev)
+{
+   int rc;
+
+   /* ensure smdspclk is enabled */
+   rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, 0x0c30);
+   if (rc < 0)
+   return rc;
+
+   /* AFE_VDAC_ICTRL_0 bit 7:4 Iq=1100 for 1g 10bt, normal modes */
+   rc = bcm_phy_write_misc(phydev, 0x39, 0x01, 0xA7C8);
+   if (rc < 0)
+   return rc;
+
+   /* AFE_HPF_TRIM_OTHERS bit11=1, short cascode enable for all modes*/
+   rc = bcm_phy_write_misc(phydev, 0x3A, 0x00, 0x0803);
+   if (rc < 0)
+   return rc;
+
+   /* AFE_TX_CONFIG_1 bit 7:4 Iq=1100 for test modes */
+   rc = bcm_phy_write_misc(phydev, 0x3A, 0x01, 0xA740);
+   if (rc < 0)
+   return rc;
+
+   /* AFE TEMPSEN_OTHERS rcal_HT, rcal_LT 1 */
+   rc = bcm_phy_write_misc(phydev, 0x3A, 0x03, 0x8400);
+   if (rc < 0)
+   return rc;
+
+   /* AFE_FUTURE_RSV bit 2:0 rccal <2:0>=100 */
+   rc = bcm_phy_write_misc(phydev, 0x3B, 0x00, 0x0004);
+   if (rc < 0)
+   return rc;
+
+   /* Adjust bias current trim to overcome digital offSet */
+   rc = phy_write(phydev, MII_BCM_CORE_BASE1E, 0x02);
+   if (rc < 0)
+   return rc;
+
+   /* make rcal=100, since rdb default is 000 */
+   rc = bcm_phy_write_exp(phydev, MII_BCM_EXPB1, 0x10);
+   if (rc < 0)
+   return rc;
+
+   /* CORE_EXPB0, Reset R_CAL/RC_CAL Engine */
+   rc = bcm_phy_write_exp(phydev, MII_BCM_EXPB0, 0x10);
+   if 

[PATCH 3/4] net: phy: Add Broadcom phy library for common interfaces

2015-09-30 Thread Arun Parameswaran
This patch adds the Broadcom phy library to consolidate common
interfaces shared by Broadcom phy's.

Moved the common interfaces to the 'bcm-phy-lib.c' and updated
the Broadcom PHY drivers to use the new APIs.

Signed-off-by: Arun Parameswaran 
---
 drivers/net/phy/Kconfig   |   6 ++
 drivers/net/phy/Makefile  |   1 +
 drivers/net/phy/bcm-phy-lib.c | 209 ++
 drivers/net/phy/bcm-phy-lib.h |  37 
 drivers/net/phy/bcm63xx.c |  38 +---
 drivers/net/phy/bcm7xxx.c | 127 ++---
 drivers/net/phy/broadcom.c| 149 +-
 include/linux/brcmphy.h   |  22 +
 8 files changed, 333 insertions(+), 256 deletions(-)
 create mode 100644 drivers/net/phy/bcm-phy-lib.c
 create mode 100644 drivers/net/phy/bcm-phy-lib.h

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index b57f6c2..0c85a01 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -69,8 +69,12 @@ config SMSC_PHY
---help---
  Currently supports the LAN83C185, LAN8187 and LAN8700 PHYs
 
+config BCM_NET_PHYLIB
+   bool
+
 config BROADCOM_PHY
tristate "Drivers for Broadcom PHYs"
+   select BCM_NET_PHYLIB
---help---
  Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464,
  BCM5481 and BCM5482 PHYs.
@@ -78,11 +82,13 @@ config BROADCOM_PHY
 config BCM63XX_PHY
tristate "Drivers for Broadcom 63xx SOCs internal PHY"
depends on BCM63XX
+   select BCM_NET_PHYLIB
---help---
  Currently supports the 6348 and 6358 PHYs.
 
 config BCM7XXX_PHY
tristate "Drivers for Broadcom 7xxx SOCs internal PHYs"
+   select BCM_NET_PHYLIB
---help---
  Currently supports the BCM7366, BCM7439, BCM7445, and
  40nm and 65nm generation of BCM7xxx Set Top Box SoCs.
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index f4e6eb9..6932475 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_QSEMI_PHY)   += qsemi.o
 obj-$(CONFIG_SMSC_PHY) += smsc.o
 obj-$(CONFIG_TERANETICS_PHY)   += teranetics.o
 obj-$(CONFIG_VITESSE_PHY)  += vitesse.o
+obj-$(CONFIG_BCM_NET_PHYLIB)   += bcm-phy-lib.o
 obj-$(CONFIG_BROADCOM_PHY) += broadcom.o
 obj-$(CONFIG_BCM63XX_PHY)  += bcm63xx.o
 obj-$(CONFIG_BCM7XXX_PHY)  += bcm7xxx.o
diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
new file mode 100644
index 000..13e161e
--- /dev/null
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (C) 2015 Broadcom Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation version 2.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include "bcm-phy-lib.h"
+#include 
+#include 
+#include 
+#include 
+
+#define MII_BCM_CHANNEL_WIDTH 0x2000
+#define BCM_CL45VEN_EEE_ADV   0x3c
+
+int bcm_phy_write_exp(struct phy_device *phydev, u16 reg, u16 val)
+{
+   int rc;
+
+   rc = phy_write(phydev, MII_BCM54XX_EXP_SEL, reg);
+   if (rc < 0)
+   return rc;
+
+   return phy_write(phydev, MII_BCM54XX_EXP_DATA, val);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_write_exp);
+
+int bcm_phy_read_exp(struct phy_device *phydev, u16 reg)
+{
+   int val;
+
+   val = phy_write(phydev, MII_BCM54XX_EXP_SEL, reg);
+   if (val < 0)
+   return val;
+
+   val = phy_read(phydev, MII_BCM54XX_EXP_DATA);
+
+   /* Restore default value.  It's O.K. if this write fails. */
+   phy_write(phydev, MII_BCM54XX_EXP_SEL, 0);
+
+   return val;
+}
+EXPORT_SYMBOL_GPL(bcm_phy_read_exp);
+
+int bcm_phy_write_misc(struct phy_device *phydev,
+  u16 reg, u16 chl, u16 val)
+{
+   int rc;
+   int tmp;
+
+   rc = phy_write(phydev, MII_BCM54XX_AUX_CTL,
+  MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   if (rc < 0)
+   return rc;
+
+   tmp = phy_read(phydev, MII_BCM54XX_AUX_CTL);
+   tmp |= MII_BCM54XX_AUXCTL_ACTL_SMDSP_ENA;
+   rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, tmp);
+   if (rc < 0)
+   return rc;
+
+   tmp = (chl * MII_BCM_CHANNEL_WIDTH) | reg;
+   rc = bcm_phy_write_exp(phydev, tmp, val);
+
+   return rc;
+}
+EXPORT_SYMBOL_GPL(bcm_phy_write_misc);
+
+int bcm_phy_read_misc(struct phy_device *phydev,
+ u16 reg, u16 chl)
+{
+   int rc;
+   int tmp;
+
+   rc = phy_write(phydev, MII_BCM54XX_AUX_CTL,
+  MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   if (rc < 0)
+   return rc;

Re: Problem with ICMP rate limiting and redirects

2015-09-30 Thread Eric Dumazet
On Wed, 2015-09-30 at 15:13 -0300, Hugo Vasconcelos Saldanha wrote:
> Hi Eric,
> 
> On Wed, Sep 30, 2015 at 1:42 PM, Eric Dumazet  wrote:
> > On Wed, 2015-09-30 at 13:10 -0300, Hugo Vasconcelos Saldanha wrote:
> >> Hi,
> >>
> >> While updating the kernel from v3.2 to v3.14, I started to see a
> >> different behavior concerning ICMP redirects sent by this updated
> >> server. The network is somewhat configured like this:
> >>
> >>---|firewall|{Internet}
> >> |client|--|
> >>   |
> >>---|router|--|172.16/12 network|
> >>
> >> The client's default gateway is 'firewall', which is the updated
> >> server. It has a static route to 172.16 network by 'router'. If
> >> 'client' wants to talk to a server in that network, 'firewall' sends a
> >> ICMP redirect pointing to router as the gateway.
> >>
> >> This worked fine with v3.2. But after the upgrade, if an ICMP message
> >> that is rate-limited (by the sysctl_icmp_ratelimit mask) is sent to
> >> 'client', ICMP redirects stop being sent to the same client. This
> >> happens, for example, when traceroute'ing from the client to the
> >> server inside the mentioned network. In this situation, a ICMP Time
> >> Exceeded message is sent in response to traceroute's first packet, but
> >> then the following packets never generate any ICMP redirect messages
> >> in 'firewall'.
> >>
> >> Debugging the code, I was able to see that the problem is being caused
> >> by the fact that ip_rt_send_redirect() started to use the inetpeer
> >> cache and the fields used to rate limit ICMP redirects (rate_tokens
> >> and rate_last) are now being shared with the algorithm applied in
> >> inet_peer_xrlim_allow(). This never happened with v3.2 because
> >> apparently inet_peer_xrlim_allow() and ip_rt_send_redirect() used
> >> different inetpeer objects.
> >>
> >> The reason why this breaks the functionality is that, while
> >> inet_peer_xrlim_allow() uses a time bucket, ip_rt_send_redirect() uses
> >> rate_tokens as a packet counter. Not to mention the fact that these
> >> are two completely different policies which should be controlled by
> >> different buckets, counters, flags, etc. Because of this,
> >> ip_rt_redirect_silence, ip_rt_redirect_number and ip_rt_redirect_load
> >> /proc files are broken also.
> >>
> >> The easiest solution would be to create new fields in 'struct
> >> inetpeer' to control ICMP redirects only, but I'm not able to measure
> >> its convenience.
> >>
> >> Any thoughts?
> >>
> >> PS: Apparently, a similar problem was reported here:
> >> http://marc.info/?l=linux-netdev=139696540600985
> >>
> >> PS2: I could try to reproduce the problem with the latest code if this
> >> is really necessary.
> >
> > Hmm... Do you have commit
> >
> > 4cdf507d54525842dfd9f6313fdafba039084046
> > ("icmp: add a global rate limitation")
> > in your kernel ?
> >
> 
> No, but i just tested it and problem continues. AFAICT, ICMP redirects
> shouldn't be limited by the logic implemented by that patch, at least
> with default icmp_ratemask. And the algorithm in ip_rt_send_redirect()
> has a different purpose, too.

OK thanks.

I guess I also gave the commit to give a hint why relying on inetpeer
might open doors for DDOS.


Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7c1a
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on
inetpeer.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DSA driver - how to glue to a PCI based NIC's mdio?

2015-09-30 Thread Florian Fainelli
On 30/09/15 14:27, Tim Harvey wrote:
> On Wed, Sep 30, 2015 at 2:12 PM, Andrew Lunn  wrote:
>> On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote:
>>> Greetings,
>>>
>>> I'm working on adding DSA support for a PCIe expansion card (designed
>>> by us) that has common PCIe NIC connected via its mii-bus to a Marvell
>>> MV88E6171. Because the NIC is a PCIe device, it has no device-tree
>>> representation of its NIC or its mdio bus, but does register its mdio
>>> bus with Linux.
>>
>> It is possible to represent PCIe devices in device tree. Take a look
>> at ePAPR. Is the PCIe host in DT?
> 
> It is possible to represent PCI devices in device-tree however not in
> a dynamic or plug-able fashion - they have to be nested per bus/slot
> which defeats the purpose of dynamic enumeration.

Even though a bus is completely auto-discoverable, if there is
additional information needed to supplement that topology, having things
be represented in Device Tree is typically accepted.

> 
>>
>>> Perhaps the right approach is to program the NIC's EEPROM on our board
>>> with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the
>>> NIC's driver, and within the NIC's driver create and register dsa
>>> platform device when our ID is encountered?
>>
>> This sounds sensible. But i doubt you can add your DSA platform
>> information to the NIC's device driver. Better would be to have a
>> small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would
>> instantiate the NIC driver, and insert a DSA platform device.
> 
> I was thinking of this as well, but then I would still need that shim
> to know the netdevice that the driver I'm shimming creates so I can't
> figure a way to do it without touching the PCI driver.

You can register a network device notifier, and try to extract that
information about this network device you need once you see that device
being registered. As an example, there is a loopback/fake DSA switch
driver here which uses the loopback interface as a parent network device
(NB: this is using the network device name, which is pretty lame, but
that does the job):

https://github.com/ffainelli/linux/commit/67d1db45d17f8cc3b32d7a46c49d5df736cee56c
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next 4/6] net: switchdev: pass callback to dump operation

2015-09-30 Thread Jiri Pirko
Tue, Sep 29, 2015 at 06:07:16PM CEST, vivien.dide...@savoirfairelinux.com wrote:
>Similar to the notifier_call callback of a notifier_block, change the
>function signature of switchdev dump operation to:
>
>int switchdev_port_obj_dump(struct net_device *dev,
>enum switchdev_obj_id id, void *obj,
>int (*cb)(void *obj));
>
>This allows the caller to pass and expect back a specific
>switchdev_obj_* structure instead of the generic switchdev_obj one.
>
>Drivers implementation of dump operation can now expect this specific
>structure and call the callback with it. Drivers have been changed
>accordingly.
>
>Signed-off-by: Vivien Didelot 
>---
> drivers/net/ethernet/rocker/rocker.c | 21 +
> include/net/switchdev.h  |  9 +---
> net/dsa/slave.c  | 26 +++--
> net/switchdev/switchdev.c| 45 ++--
> 4 files changed, 53 insertions(+), 48 deletions(-)
>
>diff --git a/drivers/net/ethernet/rocker/rocker.c 
>b/drivers/net/ethernet/rocker/rocker.c
>index 78fd443..107adb6 100644
>--- a/drivers/net/ethernet/rocker/rocker.c
>+++ b/drivers/net/ethernet/rocker/rocker.c
>@@ -4538,10 +4538,10 @@ static int rocker_port_obj_del(struct net_device *dev,
> }
> 
> static int rocker_port_fdb_dump(const struct rocker_port *rocker_port,
>-  struct switchdev_obj *obj)
>+  struct switchdev_obj_fdb *fdb,
>+  int (*cb)(void *obj))

 we should have some
typedef for this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 01/22] net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop

2015-09-30 Thread Julien Grall
The skb doesn't change within the function. Therefore it's only
necessary to check if we need GSO once at the beginning.

Signed-off-by: Julien Grall 
Acked-by: Wei Liu 

---
Cc: Ian Campbell 
Cc: netdev@vger.kernel.org

Changes in v4:
- Add Wei's acked

Changes in v2:
- Patch added
---
 drivers/net/xen-netback/netback.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index ec98d43..c4e6c02 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -288,6 +288,13 @@ static void xenvif_gop_frag_copy(struct xenvif_queue 
*queue, struct sk_buff *skb
unsigned long bytes;
int gso_type = XEN_NETIF_GSO_TYPE_NONE;
 
+   if (skb_is_gso(skb)) {
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+   gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
+   else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
+   gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
+   }
+
/* Data must not cross a page boundary. */
BUG_ON(size + offset > PAGE_SIZE<gso_type & SKB_GSO_TCPV4)
-   gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
-   else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
-   gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
-   }
-
if (*head && ((1 << gso_type) & queue->vif->gso_mask))
queue->rx.req_cons++;
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] amd-xgbe: fix potential memory leak in xgbe-debugfs

2015-09-30 Thread Geliang Tang
Added kfree() to avoid the memory leak when debugfs_create_dir() fails.

Signed-off-by: Geliang Tang 
---
 drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
index 2c063b6..66137ff 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
@@ -330,6 +330,7 @@ void xgbe_debugfs_init(struct xgbe_prv_data *pdata)
pdata->xgbe_debugfs = debugfs_create_dir(buf, NULL);
if (!pdata->xgbe_debugfs) {
netdev_err(pdata->netdev, "debugfs_create_dir failed\n");
+   kfree(buf);
return;
}
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net ipv4: use preferred log methods

2015-09-30 Thread Bastian Stender

Hi,

On 09/30/2015 11:20 AM, Bastian Stender wrote:

Replace printk calls with preferred unconditional log method calls to keep
kernel messages clean.

Signed-off-by: Bastian Stender 
---
  net/ipv4/ipconfig.c| 53 +-
  net/ipv4/netfilter/arp_tables.c| 17 +
  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  2 +-
  net/ipv4/netfilter/nf_nat_snmp_basic.c | 31 ---
  4 files changed, 36 insertions(+), 67 deletions(-)


Please ignore my previous patch. I'll test it again and resubmit it.

Thanks for your patience.

Regards,
Bastian Stender
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >