Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC
Hello. On 09/14/2015 03:42 AM, Simon Horman wrote: Sorry for delayed reply, I thought I'd already replied to this. :-/ From: Kazuya MizuguchiThis patch updates the ravb binding to support the r8a7795 SoC by: - Adding a compat string for the new hardware - Adding 25 named interrupts to binding for the new SoC; older SoCs continue to use a single multiplexed interrupt The example is also updated to reflect the r8a7795 as this is the more complex case. Based on work by Kazuya Mizuguchi and others. Signed-off-by: Simon Horman --- v2 * First post; broken out of a driver update patch * As discussed with Geert Uytterhoeven and Sergei Shtylyov - Binding: Make all interrupts mandatory as named-interrupts of the form ch%u --- .../devicetree/bindings/net/renesas,ravb.txt | 65 +++--- 1 file changed, 58 insertions(+), 7 deletions(-) diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 1fd8831437bf..6c360f993d33 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt [...] @@ -27,13 +33,46 @@ Optional properties: Example: ethernet@e680 { - compatible = "renesas,etheravb-r8a7790"; - reg = <0 0xe680 0 0x800>, <0 0xee0e8000 0 0x4000>; + compatible = "renesas,etheravb-r8a7795"; + reg = <0 0xe680 0 0x800>, <0 0xe6a0 0 0x1>; interrupt-parent = <>; - interrupts = <0 163 IRQ_TYPE_LEVEL_HIGH>; - clocks = <_clks R8A7790_CLK_ETHERAVB>; - phy-mode = "rmii"; + interrupts = , +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +; + interrupt-names = "ch0", "ch1", "ch2", "ch3", + "ch4", "ch5", "ch6", "ch7", + "ch8", "ch9", "ch10", "ch11", + "ch12", "ch13", "ch14", "ch15", + "ch16", "ch17", "ch18", "ch19", + "ch20", "ch21", "ch22", "ch23", + "ch24"; To me, these names don't look very helpful. You could as well omit them and use platform_get_irq() with the channel #. These names reflect the hardware; which is the aim of DT. Indeed (I've looked into the manuals by now). They just look poorly chosen. :-) As I believe you pointed out earlier it is preferred to use named interrupts when there is more than one. Do I misunderstand the situation there? Yes. If you have a positive contribution to make regarding better names then I am all ears. I liked your "tx", "rx" variant better... MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD
On Wed, Sep 30, 2015 at 11:27:34AM -0700, Andy Lutomirski wrote: > On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen >wrote: > > This patch introduces the concept of a seccomp fd, with a similar interface > > and usage to ebpf fds. Initially, one is allowed to create, install, and > > dump these fds. Any manipulation of seccomp fds requires users to be root > > in their own user namespace, matching the checks done for > > SECCOMP_SET_MODE_FILTER. > > > > Installing a filterfd has some gotchas, though. Andy mentioned previously > > that we should restrict installation to filter fds whose parent is already > > in the filter tree. This doesn't quite work in the case of created seccomp > > fds, since once you install a filter fd, you can't install any other filter > > fd since it has no parent and there is no way to "pre-chain" filters before > > installing them. > > ISTM, if we like the seccomp fd approach, we should have them be > created with a parent already set. IOW the default should be that > their parent is the creator's seccomp fd and, if needed, creators > could specify a different parent. Allowing people doing SECCOMP_FD_NEW to specify a parent fd would work. Then we can disallow installing a seccomp fd if its parent is not the current filter, and get rid of the whole mess with prev locking and all that. Tycho -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 06/14] RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based transports. We can safely issue rds_send_xmit() and the using its return value take decision on deferred work. This will also fix the scenario where at times we are seeing connections stuck with the LL_SEND_FULL bit getting set and never cleared. We kick krdsd after any time we see -ENOMEM or -EAGAIN from the ring allocation code. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/send.c| 10 ++ net/rds/threads.c | 2 ++ 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index f1e709c..9d8b52d 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1122,8 +1122,9 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) */ rds_stats_inc(s_send_queued); - if (!test_bit(RDS_LL_SEND_FULL, >c_flags)) - rds_send_xmit(conn); + ret = rds_send_xmit(conn); + if (ret == -ENOMEM || ret == -EAGAIN) + queue_delayed_work(rds_wq, >c_send_w, 1); rds_message_put(rm); return payload_len; @@ -1179,8 +1180,9 @@ rds_send_pong(struct rds_connection *conn, __be16 dport) rds_stats_inc(s_send_queued); rds_stats_inc(s_send_pong); - if (!test_bit(RDS_LL_SEND_FULL, >c_flags)) - queue_delayed_work(rds_wq, >c_send_w, 0); + ret = rds_send_xmit(conn); + if (ret == -ENOMEM || ret == -EAGAIN) + queue_delayed_work(rds_wq, >c_send_w, 1); rds_message_put(rm); return 0; diff --git a/net/rds/threads.c b/net/rds/threads.c index dc2402e..454aa6d 100644 --- a/net/rds/threads.c +++ b/net/rds/threads.c @@ -162,7 +162,9 @@ void rds_send_worker(struct work_struct *work) int ret; if (rds_conn_state(conn) == RDS_CONN_UP) { + clear_bit(RDS_LL_SEND_FULL, >c_flags); ret = rds_send_xmit(conn); + cond_resched(); rdsdebug("conn %p ret %d\n", conn, ret); switch (ret) { case -EAGAIN: -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/14] RDS: IB: ack more receive completions to improve performance
For better performance, we split the receive completion IRQ handler. That lets us acknowledge several WCE events in one call. We also limit the WC to max 32 to avoid latency. Acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed. In next patch, send completion is also split which re-uses the poll_cq() and hence the code is moved to ib_cm.c Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib.h | 28 +-- net/rds/ib_cm.c| 70 ++- net/rds/ib_recv.c | 136 +++-- net/rds/ib_stats.c | 3 +- 4 files changed, 132 insertions(+), 105 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index f1fd5ff..727759b 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -24,6 +24,8 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 +#define RDS_IB_WC_MAX 32 + extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; @@ -89,6 +91,20 @@ struct rds_ib_work_ring { atomic_tw_free_ctr; }; +/* Rings are posted with all the allocations they'll need to queue the + * incoming message to the receiving socket so this can't fail. + * All fragments start with a header, so we can make sure we're not receiving + * garbage, and we can tell a small 8 byte fragment from an ACK frame. + */ +struct rds_ib_ack_state { + u64 ack_next; + u64 ack_recv; + unsigned intack_required:1; + unsigned intack_next_valid:1; + unsigned intack_recv_valid:1; +}; + + struct rds_ib_device; struct rds_ib_connection { @@ -102,6 +118,10 @@ struct rds_ib_connection { struct ib_pd*i_pd; struct ib_cq*i_send_cq; struct ib_cq*i_recv_cq; + struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + + /* interrupt handling */ + struct tasklet_struct i_recv_tasklet; /* tx */ struct rds_ib_work_ring i_send_ring; @@ -112,7 +132,6 @@ struct rds_ib_connection { atomic_ti_signaled_sends; /* rx */ - struct tasklet_struct i_recv_tasklet; struct mutexi_recv_mutex; struct rds_ib_work_ring i_recv_ring; struct rds_ib_incoming *i_ibinc; @@ -199,13 +218,14 @@ struct rds_ib_statistics { uint64_ts_ib_connect_raced; uint64_ts_ib_listen_closed_stale; uint64_ts_ib_tx_cq_call; + uint64_ts_ib_evt_handler_call; + uint64_ts_ib_tasklet_call; uint64_ts_ib_tx_cq_event; uint64_ts_ib_tx_ring_full; uint64_ts_ib_tx_throttle; uint64_ts_ib_tx_sg_mapping_failure; uint64_ts_ib_tx_stalled; uint64_ts_ib_tx_credit_updates; - uint64_ts_ib_rx_cq_call; uint64_ts_ib_rx_cq_event; uint64_ts_ib_rx_ring_empty; uint64_ts_ib_rx_refill_from_cq; @@ -324,7 +344,8 @@ void rds_ib_recv_free_caches(struct rds_ib_connection *ic); void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp); void rds_ib_inc_free(struct rds_incoming *inc); int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to); -void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc, +struct rds_ib_ack_state *state); void rds_ib_recv_tasklet_fn(unsigned long data); void rds_ib_recv_init_ring(struct rds_ib_connection *ic); void rds_ib_recv_clear_ring(struct rds_ib_connection *ic); @@ -332,6 +353,7 @@ void rds_ib_recv_init_ack(struct rds_ib_connection *ic); void rds_ib_attempt_ack(struct rds_ib_connection *ic); void rds_ib_ack_send_complete(struct rds_ib_connection *ic); u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic); +void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, int ack_required); /* ib_ring.c */ void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 9043f5c..28e0979 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -216,6 +216,72 @@ static void rds_ib_cq_event_handler(struct ib_event *event, void *data) event->event, ib_event_msg(event->event), data); } +/* Plucking the oldest entry from the ring can be done concurrently with + * the thread refilling the ring. Each ring operation is protected by + * spinlocks and the transient state of refilling doesn't change the + * recording of which entry is oldest. + * + * This relies on IB only calling one cq comp_handler for each cq so that + * there will only be one caller of rds_recv_incoming() per RDS connection. + */ +static void
[PATCH v2 01/14] RDS: use kfree_rcu in rds_ib_remove_ipaddr
synchronize_rcu() slowing down un-necessarily the socket shutdown path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr() which is perfect usecase for kfree_rcu(); So lets use that to gain some speedup. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 + net/rds/ib_rdma.c | 6 ++ 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index aae60fd..f1fd5ff 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -164,6 +164,7 @@ struct rds_ib_connection { struct rds_ib_ipaddr { struct list_headlist; __be32 ipaddr; + struct rcu_head rcu; }; struct rds_ib_device { diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 251d1ce..872f523 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -159,10 +159,8 @@ static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) } spin_unlock_irq(_ibdev->spinlock); - if (to_free) { - synchronize_rcu(); - kfree(to_free); - } + if (to_free) + kfree_rcu(to_free, rcu); } int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/14] RDS: IB: fix the rds_ib_fmr_wq kick call
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need to use queue_delayed_work() to kick the work. This was hurting the performance since pool maintenance was less often triggered from other path. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib_rdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 872f523..b6644fa 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -319,7 +319,7 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) int err = 0, iter = 0; if (atomic_read(>dirty_count) >= pool->max_items / 10) - schedule_delayed_work(>flush_worker, 10); + queue_delayed_work(rds_ib_fmr_wq, >flush_worker, 10); while (1) { ibmr = rds_ib_reuse_fmr(pool); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 08/14] RDS: IB: split send completion handling and do batch ack
Similar to what we did with receive CQ completion handling, we split the transmit completion handler so that it lets us implement batched work completion handling. We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to identify the send vs receive completion event handler invocation. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib.h | 6 ++- net/rds/ib_cm.c| 45 -- net/rds/ib_send.c | 110 + net/rds/ib_stats.c | 1 - net/rds/send.c | 1 + 5 files changed, 98 insertions(+), 65 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index 727759b..3a8cd31 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -25,6 +25,7 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 +#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; @@ -118,9 +119,11 @@ struct rds_ib_connection { struct ib_pd*i_pd; struct ib_cq*i_send_cq; struct ib_cq*i_recv_cq; + struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; /* interrupt handling */ + struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; /* tx */ @@ -217,7 +220,6 @@ struct rds_ib_device { struct rds_ib_statistics { uint64_ts_ib_connect_raced; uint64_ts_ib_listen_closed_stale; - uint64_ts_ib_tx_cq_call; uint64_ts_ib_evt_handler_call; uint64_ts_ib_tasklet_call; uint64_ts_ib_tx_cq_event; @@ -371,7 +373,7 @@ extern wait_queue_head_t rds_ib_ring_empty_wait; void rds_ib_xmit_complete(struct rds_connection *conn); int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, unsigned int hdr_off, unsigned int sg, unsigned int off); -void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); void rds_ib_send_init_ring(struct rds_ib_connection *ic); void rds_ib_send_clear_ring(struct rds_ib_connection *ic); int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 28e0979..8f51d0d 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -250,11 +250,34 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_recv_cqe_handler(ic, wc, ack_state); + + if (wc->wr_id & RDS_IB_SEND_OP) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_recv_cqe_handler(ic, wc, ack_state); } } } +static void rds_ib_tasklet_fn_send(unsigned long data) +{ + struct rds_ib_connection *ic = (struct rds_ib_connection *)data; + struct rds_connection *conn = ic->conn; + struct rds_ib_ack_state state; + + rds_ib_stats_inc(s_ib_tasklet_call); + + memset(, 0, sizeof(state)); + poll_cq(ic, ic->i_send_cq, ic->i_send_wc, ); + ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); + poll_cq(ic, ic->i_send_cq, ic->i_send_wc, ); + + if (rds_conn_up(conn) && + (!test_bit(RDS_LL_SEND_FULL, >c_flags) || + test_bit(0, >c_map_queued))) + rds_send_xmit(ic->conn); +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -304,6 +327,18 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data) } } +static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_ib_connection *ic = conn->c_transport_data; + + rdsdebug("conn %p cq %p\n", conn, cq); + + rds_ib_stats_inc(s_ib_evt_handler_call); + + tasklet_schedule(>i_send_tasklet); +} + /* * This needs to be very careful to not leave IS_ERR pointers around for * cleanup to trip over. @@ -337,7 +372,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn) ic->i_pd = rds_ibdev->pd; cq_attr.cqe = ic->i_send_ring.w_nr + 1; - ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler, + + ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, _attr); if (IS_ERR(ic->i_send_cq))
[PATCH v2 04/14] RDS: Use per-bucket rw lock for bind hash-table
One global lock protecting hash-tables with 1024 buckets isn't efficient and it shows up in a massive systems with truck loads of RDS sockets serving multiple databases. The perf data clearly highlights the contention on the rw lock in these massive workloads. When the contention gets worse, the code gets into a state where it decides to back off on the lock. So while it has disabled interrupts, it sits and backs off on this lock get. This causes the system to become sluggish and eventually all sorts of bad things happen. The simple fix is to move the lock into the hash bucket and use per-bucket lock to improve the scalability. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 2 ++ net/rds/bind.c | 47 --- net/rds/rds.h| 1 + 3 files changed, 35 insertions(+), 15 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index dc08766..384ea1e 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -582,6 +582,8 @@ static int rds_init(void) { int ret; + rds_bind_lock_init(); + ret = rds_conn_init(); if (ret) goto out; diff --git a/net/rds/bind.c b/net/rds/bind.c index 166c605..bc6b93e 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -38,22 +38,27 @@ #include #include "rds.h" +struct bind_bucket { + rwlock_tlock; + struct hlist_head head; +}; + #define BIND_HASH_SIZE 1024 -static struct hlist_head bind_hash_table[BIND_HASH_SIZE]; -static DEFINE_RWLOCK(rds_bind_lock); +static struct bind_bucket bind_hash_table[BIND_HASH_SIZE]; -static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port) +static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port) { return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) & (BIND_HASH_SIZE - 1)); } /* must hold either read or write lock (write lock for insert != NULL) */ -static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port, +static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket, + __be32 addr, __be16 port, struct rds_sock *insert) { struct rds_sock *rs; - struct hlist_head *head = hash_to_bucket(addr, port); + struct hlist_head *head = >head; u64 cmp; u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); @@ -91,10 +96,11 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port) { struct rds_sock *rs; unsigned long flags; + struct bind_bucket *bucket = hash_to_bucket(addr, port); - read_lock_irqsave(_bind_lock, flags); - rs = rds_bind_lookup(addr, port, NULL); - read_unlock_irqrestore(_bind_lock, flags); + read_lock_irqsave(>lock, flags); + rs = rds_bind_lookup(bucket, addr, port, NULL); + read_unlock_irqrestore(>lock, flags); if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) { rds_sock_put(rs); @@ -113,6 +119,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) unsigned long flags; int ret = -EADDRINUSE; u16 rover, last; + struct bind_bucket *bucket; if (*port != 0) { rover = be16_to_cpu(*port); @@ -122,13 +129,15 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) last = rover - 1; } - write_lock_irqsave(_bind_lock, flags); - do { struct rds_sock *rrs; if (rover == 0) rover++; - rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs); + + bucket = hash_to_bucket(addr, cpu_to_be16(rover)); + write_lock_irqsave(>lock, flags); + rrs = rds_bind_lookup(bucket, addr, cpu_to_be16(rover), rs); + write_unlock_irqrestore(>lock, flags); if (!rrs) { *port = rs->rs_bound_port; ret = 0; @@ -140,16 +149,16 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) } } while (rover++ != last); - write_unlock_irqrestore(_bind_lock, flags); - return ret; } void rds_remove_bound(struct rds_sock *rs) { unsigned long flags; + struct bind_bucket *bucket = + hash_to_bucket(rs->rs_bound_addr, rs->rs_bound_port); - write_lock_irqsave(_bind_lock, flags); + write_lock_irqsave(>lock, flags); if (rs->rs_bound_addr) { rdsdebug("rs %p unbinding from %pI4:%d\n", @@ -161,7 +170,7 @@ void rds_remove_bound(struct rds_sock *rs) rs->rs_bound_addr = 0; } - write_unlock_irqrestore(_bind_lock, flags); + write_unlock_irqrestore(>lock, flags); } int rds_bind(struct
Re: [patch net-next] switchdev: bring back switchdev_obj and use it as a generic object param
Hi Jiri, On Sep. Wednesday 30 (40) 06:00 PM, Jiri Pirko wrote: > From: Jiri Pirko> > Replace "void *obj" with a generic structure. Introduce couple of > helpers along that. > > Signed-off-by: Jiri Pirko > --- > drivers/net/ethernet/rocker/rocker.c | 41 +-- > include/net/switchdev.h | 42 > ++-- > net/bridge/br_fdb.c | 2 +- > net/bridge/br_vlan.c | 6 -- > net/dsa/slave.c | 35 ++ > net/switchdev/switchdev.c| 40 ++ > 6 files changed, 104 insertions(+), 62 deletions(-) > > diff --git a/drivers/net/ethernet/rocker/rocker.c > b/drivers/net/ethernet/rocker/rocker.c > index 9773f5b..1236835 100644 > --- a/drivers/net/ethernet/rocker/rocker.c > +++ b/drivers/net/ethernet/rocker/rocker.c > @@ -4437,7 +4437,8 @@ static int rocker_port_fdb_add(struct rocker_port > *rocker_port, > } > > static int rocker_port_obj_add(struct net_device *dev, > -enum switchdev_obj_id id, const void *obj, > +enum switchdev_obj_id id, > +const struct switchdev_obj *obj, > struct switchdev_trans *trans) > { > struct rocker_port *rocker_port = netdev_priv(dev); > @@ -4446,16 +4447,18 @@ static int rocker_port_obj_add(struct net_device *dev, > > switch (id) { > case SWITCHDEV_OBJ_PORT_VLAN: > - err = rocker_port_vlans_add(rocker_port, trans, obj); > + err = rocker_port_vlans_add(rocker_port, trans, > + SWITCHDEV_OBJ_VLAN(obj)); > break; > case SWITCHDEV_OBJ_IPV4_FIB: > - fib4 = obj; > + fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj); > err = rocker_port_fib_ipv4(rocker_port, trans, > htonl(fib4->dst), fib4->dst_len, > fib4->fi, fib4->tb_id, 0); > break; > case SWITCHDEV_OBJ_PORT_FDB: > - err = rocker_port_fdb_add(rocker_port, trans, obj); > + err = rocker_port_fdb_add(rocker_port, trans, > + SWITCHDEV_OBJ_FDB(obj)); > break; > default: > err = -EOPNOTSUPP; > @@ -4508,7 +4511,8 @@ static int rocker_port_fdb_del(struct rocker_port > *rocker_port, > } > > static int rocker_port_obj_del(struct net_device *dev, > -enum switchdev_obj_id id, const void *obj) > +enum switchdev_obj_id id, > +const struct switchdev_obj *obj) > { > struct rocker_port *rocker_port = netdev_priv(dev); > const struct switchdev_obj_ipv4_fib *fib4; > @@ -4516,17 +4520,19 @@ static int rocker_port_obj_del(struct net_device *dev, > > switch (id) { > case SWITCHDEV_OBJ_PORT_VLAN: > - err = rocker_port_vlans_del(rocker_port, obj); > + err = rocker_port_vlans_del(rocker_port, > + SWITCHDEV_OBJ_VLAN(obj)); > break; > case SWITCHDEV_OBJ_IPV4_FIB: > - fib4 = obj; > + fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj); > err = rocker_port_fib_ipv4(rocker_port, NULL, > htonl(fib4->dst), fib4->dst_len, > fib4->fi, fib4->tb_id, > ROCKER_OP_FLAG_REMOVE); > break; > case SWITCHDEV_OBJ_PORT_FDB: > - err = rocker_port_fdb_del(rocker_port, NULL, obj); > + err = rocker_port_fdb_del(rocker_port, NULL, > + SWITCHDEV_OBJ_FDB(obj)); > break; > default: > err = -EOPNOTSUPP; > @@ -4538,7 +4544,7 @@ static int rocker_port_obj_del(struct net_device *dev, > > static int rocker_port_fdb_dump(const struct rocker_port *rocker_port, > struct switchdev_obj_fdb *fdb, > - int (*cb)(void *obj)) > + switchdev_obj_dump_cb_t *cb) > { > struct rocker *rocker = rocker_port->rocker; > struct rocker_fdb_tbl_entry *found; > @@ -4555,7 +4561,7 @@ static int rocker_port_fdb_dump(const struct > rocker_port *rocker_port, > fdb->ndm_state = NUD_REACHABLE; > fdb->vid = rocker_port_vlan_to_vid(rocker_port, > found->key.vlan_id); > - err = cb(fdb); > + err = cb(>obj); > if (err) > break; > } > @@ -4566,7 +4572,7 @@ static int rocker_port_fdb_dump(const struct > rocker_port *rocker_port, > > static int rocker_port_vlan_dump(const
checkpoint/restore of seccomp filters v3
Hi all, Here's a re-worked set for c/r of seccomp filters which keeps around the original bpf program passed to the kernel instead of trying to dump the ebpf version. There are various comments/questions in the individual patch notes. I'm not sure this needs to go via net-next any more, as the impact in net/ is fairly minimal, and it seems more seccomp heavy. As such, this set is based on seccomp/tip. Thoughts welcome, Tycho P.S. Man page patches to come once we agree on the API :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFT v3] geneve: implement support for IPv6-based tunnels
Signed-off-by: John W. Linville--- v3: - declare geneve_remote_unspec as static v2: - do not require remote address for tx on metadata tunnels - pass correct sockaddr family to udp_tun_rx_dst in geneve_rx - accommodate both ipv4 and ipv6 sockets open on same tunnel - move declaration of geneve_get_dst for aesthetic purposes drivers/net/geneve.c | 430 --- include/uapi/linux/if_link.h | 1 + 2 files changed, 368 insertions(+), 63 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 8f5c02eed47d..291d3d7754a8 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -46,16 +46,28 @@ struct geneve_net { static int geneve_net_id; +union geneve_addr { + struct sockaddr_in sin; + struct sockaddr_in6 sin6; + struct sockaddr sa; +}; + +static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, }; + +#define GENEVE_F_IPV6 0x0001 + /* Pseudo network device */ struct geneve_dev { struct hlist_node hlist; /* vni hash table */ struct net *net;/* netns for packet i/o */ struct net_device *dev;/* netdev for geneve tunnel */ - struct geneve_sock *sock; /* socket used for geneve tunnel */ + struct geneve_sock *sock4; /* IPv4 socket used for geneve tunnel */ + struct geneve_sock *sock6; /* IPv6 socket used for geneve tunnel */ u8 vni[3]; /* virtual network ID for tunnel */ u8 ttl; /* TTL override */ u8 tos; /* TOS override */ - struct sockaddr_in remote; /* IPv4 address for link partner */ + u32flags; /* GENEVE_F_* above */ + union geneve_addr remote; /* IP address for link partner */ struct list_head next;/* geneve's per namespace list */ __be16 dst_port; bool collect_md; @@ -103,11 +115,32 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, vni_list_head = >vni_list[hash]; hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) && - addr == geneve->remote.sin_addr.s_addr) + addr == geneve->remote.sin.sin_addr.s_addr) + return geneve; + } + return NULL; +} + +#if IS_ENABLED(CONFIG_IPV6) +static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, +struct in6_addr addr6, u8 vni[]) +{ + struct hlist_head *vni_list_head; + struct geneve_dev *geneve; + __u32 hash; + + /* Find the device for this VNI */ + hash = geneve_net_vni_hash(vni); + vni_list_head = >vni_list[hash]; + hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { + if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) && + !memcmp(, >remote.sin6.sin6_addr, + sizeof(addr6))) return geneve; } return NULL; } +#endif static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb) { @@ -121,24 +154,47 @@ static void geneve_rx(struct geneve_sock *gs, struct sk_buff *skb) struct metadata_dst *tun_dst = NULL; struct geneve_dev *geneve = NULL; struct pcpu_sw_netstats *stats; - struct iphdr *iph; - u8 *vni; + struct iphdr *iph = NULL; __be32 addr; - int err; + static u8 zero_vni[3]; + u8 *vni; + int err = 0; + sa_family_t sa_family; +#if IS_ENABLED(CONFIG_IPV6) + struct ipv6hdr *ip6h = NULL; + struct in6_addr addr6; + static struct in6_addr zero_addr6; +#endif + + sa_family = gs->sock->sk->sk_family; - iph = ip_hdr(skb); /* outer IP header... */ + if (sa_family == AF_INET) { + iph = ip_hdr(skb); /* outer IP header... */ - if (gs->collect_md) { - static u8 zero_vni[3]; + if (gs->collect_md) { + vni = zero_vni; + addr = 0; + } else { + vni = gnvh->vni; - vni = zero_vni; - addr = 0; - } else { - vni = gnvh->vni; - addr = iph->saddr; - } + addr = iph->saddr; + } + + geneve = geneve_lookup(gs, addr, vni); + } else if (sa_family == AF_INET6) { + ip6h = ipv6_hdr(skb); /* outer IPv6 header... */ + + if (gs->collect_md) { + vni = zero_vni; + addr6 = zero_addr6; + } else { + vni = gnvh->vni; + + addr6 = ip6h->saddr; + } - geneve = geneve_lookup(gs, addr,
Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD
On Wed, Sep 30, 2015 at 11:36 AM, Tycho Andersenwrote: > On Wed, Sep 30, 2015 at 11:27:34AM -0700, Andy Lutomirski wrote: >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen >> wrote: >> > This patch introduces the concept of a seccomp fd, with a similar interface >> > and usage to ebpf fds. Initially, one is allowed to create, install, and >> > dump these fds. Any manipulation of seccomp fds requires users to be root >> > in their own user namespace, matching the checks done for >> > SECCOMP_SET_MODE_FILTER. >> > >> > Installing a filterfd has some gotchas, though. Andy mentioned previously >> > that we should restrict installation to filter fds whose parent is already >> > in the filter tree. This doesn't quite work in the case of created seccomp >> > fds, since once you install a filter fd, you can't install any other filter >> > fd since it has no parent and there is no way to "pre-chain" filters before >> > installing them. >> >> ISTM, if we like the seccomp fd approach, we should have them be >> created with a parent already set. IOW the default should be that >> their parent is the creator's seccomp fd and, if needed, creators >> could specify a different parent. > > Allowing people doing SECCOMP_FD_NEW to specify a parent fd would > work. Then we can disallow installing a seccomp fd if its parent is > not the current filter, and get rid of the whole mess with prev > locking and all that. > Yes, please. --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 14/14] RDS: IB: split mr pool to improve 8K messages performance
8K message sizes are pretty important usecase for RDS current workloads so we make provison to have 8K mrs available from the pool. Based on number of SG's in the RDS message, we pick a pool to use. Also to make sure that we don't under utlise mrs when say 8k messages are dominating which could lead to 8k pull being exhausted, we fall-back to 1m pool till 8k pool recovers for use. This helps to at least push ~55 kB/s bidirectional data which is a nice improvement. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib.c | 47 + net/rds/ib.h | 43 --- net/rds/ib_rdma.c | 101 + net/rds/ib_stats.c | 18 ++ 4 files changed, 147 insertions(+), 62 deletions(-) diff --git a/net/rds/ib.c b/net/rds/ib.c index 883813a..a833ab7 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -43,14 +43,14 @@ #include "rds.h" #include "ib.h" -static unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE; -unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */ +unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; +unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(fmr_pool_size, int, 0444); -MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA"); -module_param(fmr_message_size, int, 0444); -MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer"); +module_param(rds_ib_fmr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); +module_param(rds_ib_fmr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -97,8 +97,10 @@ static void rds_ib_dev_free(struct work_struct *work) struct rds_ib_device *rds_ibdev = container_of(work, struct rds_ib_device, free_work); - if (rds_ibdev->mr_pool) - rds_ib_destroy_mr_pool(rds_ibdev->mr_pool); + if (rds_ibdev->mr_8k_pool) + rds_ib_destroy_mr_pool(rds_ibdev->mr_8k_pool); + if (rds_ibdev->mr_1m_pool) + rds_ib_destroy_mr_pool(rds_ibdev->mr_1m_pool); if (rds_ibdev->pd) ib_dealloc_pd(rds_ibdev->pd); @@ -148,9 +150,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32; - rds_ibdev->max_fmrs = dev_attr->max_mr ? - min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) : - fmr_pool_size; + rds_ibdev->max_1m_fmrs = dev_attr->max_mr ? + min_t(unsigned int, (dev_attr->max_mr / 2), + rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + + rds_ibdev->max_8k_fmrs = dev_attr->max_mr ? + min_t(unsigned int, ((dev_attr->max_mr / 2) * RDS_MR_8K_SCALE), + rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom; rds_ibdev->max_responder_resources = dev_attr->max_qp_rd_atom; @@ -162,12 +168,25 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev); - if (IS_ERR(rds_ibdev->mr_pool)) { - rds_ibdev->mr_pool = NULL; + rds_ibdev->mr_1m_pool = + rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_1M_POOL); + if (IS_ERR(rds_ibdev->mr_1m_pool)) { + rds_ibdev->mr_1m_pool = NULL; goto put_dev; } + rds_ibdev->mr_8k_pool = + rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_8K_POOL); + if (IS_ERR(rds_ibdev->mr_8k_pool)) { + rds_ibdev->mr_8k_pool = NULL; + goto put_dev; + } + + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", +dev_attr->max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, +rds_ibdev->max_8k_fmrs); + INIT_LIST_HEAD(_ibdev->ipaddr_list); INIT_LIST_HEAD(_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 3a8cd31..f17d095 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -9,8 +9,11 @@ #include "rds.h" #include "rdma_transport.h" -#define RDS_FMR_SIZE 256 -#define RDS_FMR_POOL_SIZE 8192 +#define RDS_FMR_1M_POOL_SIZE (8192 / 2) +#define RDS_FMR_1M_MSG_SIZE256 +#define RDS_FMR_8K_MSG_SIZE
[PATCH net-next 4/5] bridge: vlan: fix possible null ptr derefs on port init and deinit
From: Nikolay AleksandrovWhen a new port is being added we need to make vlgrp available after rhashtable has been initialized and when removing a port we need to flush the vlans and free the resources after we're sure noone can use the port, i.e. after it's removed from the port list and synchronize_rcu is executed. Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_if.c | 3 ++- net/bridge/br_vlan.c | 16 ++-- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 45e4757c6fd2..934cae9fa317 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -248,7 +248,6 @@ static void del_nbp(struct net_bridge_port *p) list_del_rcu(>list); - nbp_vlan_flush(p); br_fdb_delete_by_port(br, p, 0, 1); nbp_update_port_count(br); @@ -257,6 +256,8 @@ static void del_nbp(struct net_bridge_port *p) dev->priv_flags &= ~IFF_BRIDGE_PORT; netdev_rx_handler_unregister(dev); + /* use the synchronize_rcu done by netdev_rx_handler_unregister */ + nbp_vlan_flush(p); br_multicast_del_port(p); diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 90ac4b0c55c1..7e9d60a402e2 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -854,16 +854,20 @@ err_rhtbl: int nbp_vlan_init(struct net_bridge_port *p) { + struct net_bridge_vlan_group *vg; int ret = -ENOMEM; - p->vlgrp = kzalloc(sizeof(struct net_bridge_vlan_group), GFP_KERNEL); - if (!p->vlgrp) + vg = kzalloc(sizeof(struct net_bridge_vlan_group), GFP_KERNEL); + if (!vg) goto out; - ret = rhashtable_init(>vlgrp->vlan_hash, _vlan_rht_params); + ret = rhashtable_init(>vlan_hash, _vlan_rht_params); if (ret) goto err_rhtbl; - INIT_LIST_HEAD(>vlgrp->vlan_list); + INIT_LIST_HEAD(>vlan_list); + /* Make sure everything's committed before publishing vg */ + smp_wmb(); + p->vlgrp = vg; if (p->br->default_pvid) { ret = nbp_vlan_add(p, p->br->default_pvid, BRIDGE_VLAN_INFO_PVID | @@ -875,9 +879,9 @@ out: return ret; err_vlan_add: - rhashtable_destroy(>vlgrp->vlan_hash); + rhashtable_destroy(>vlan_hash); err_rhtbl: - kfree(p->vlgrp); + kfree(vg); goto out; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/5] seccomp: save the original filter
In order to implement checkpoint of seccomp filters, we need to keep track of the original filter as the user gave it to us. Since we're doing this, we need to also use bpf_prog_destroy to free the struct bpf_brogs so we don't leak this memory. Signed-off-by: Tycho AndersenCC: Kees Cook CC: Will Drewry CC: Oleg Nesterov CC: Andy Lutomirski CC: Pavel Emelyanov CC: Serge E. Hallyn CC: Alexei Starovoitov CC: Daniel Borkmann --- include/linux/filter.h | 2 ++ kernel/seccomp.c | 24 net/core/filter.c | 4 ++-- 3 files changed, 20 insertions(+), 10 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index fa2cab9..6c045ba 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -410,6 +410,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog); int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, bpf_aux_classic_check_t trans); void bpf_prog_destroy(struct bpf_prog *fp); +int bpf_prog_store_orig_filter(struct bpf_prog *fp, + const struct sock_fprog *fprog); int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_attach_bpf(u32 ufd, struct sock *sk); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 5bd4779..09f3769 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -337,6 +337,14 @@ static inline void seccomp_sync_threads(void) } } +static inline void seccomp_filter_free(struct seccomp_filter *filter) +{ + if (filter) { + bpf_prog_destroy(filter->prog); + kfree(filter); + } +} + /** * seccomp_prepare_filter: Prepares a seccomp filter for use. * @fprog: BPF program to install @@ -376,6 +384,14 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) return ERR_PTR(ret); } + if (config_enabled(CONFIG_CHECKPOINT_RESTORE)) { + ret = bpf_prog_store_orig_filter(sfilter->prog, fprog); + if (ret < 0) { + seccomp_filter_free(sfilter); + return ERR_PTR(ret); + } + } + atomic_set(>usage, 1); return sfilter; @@ -466,14 +482,6 @@ void get_seccomp_filter(struct task_struct *tsk) atomic_inc(>usage); } -static inline void seccomp_filter_free(struct seccomp_filter *filter) -{ - if (filter) { - bpf_prog_free(filter->prog); - kfree(filter); - } -} - /* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */ void put_seccomp_filter(struct task_struct *tsk) { diff --git a/net/core/filter.c b/net/core/filter.c index 13079f0..70995dd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -832,8 +832,8 @@ static int bpf_check_classic(const struct sock_filter *filter, return -EINVAL; } -static int bpf_prog_store_orig_filter(struct bpf_prog *fp, - const struct sock_fprog *fprog) +int bpf_prog_store_orig_filter(struct bpf_prog *fp, + const struct sock_fprog *fprog) { unsigned int fsize = bpf_classic_proglen(fprog); struct sock_fprog_kern *fkprog; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/5] bridge: vlan: adjust rhashtable initial size and hash locks size
From: Nikolay AleksandrovAs Stephen pointed out the default initial size is more than we need, so let's start small (4 elements, thus nelem_hint = 3). Also limit the hash locks to the number of CPUs as we don't need any write-side scaling and this looks like the minimum. Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_vlan.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index e227164bc3e1..283d012c3d89 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -19,6 +19,8 @@ static const struct rhashtable_params br_vlan_rht_params = { .head_offset = offsetof(struct net_bridge_vlan, vnode), .key_offset = offsetof(struct net_bridge_vlan, vid), .key_len = sizeof(u16), + .nelem_hint = 3, + .locks_mul = 1, .max_size = VLAN_N_VID, .obj_cmpfn = br_vlan_cmp, .automatic_shrinking = true, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/5] bridge: vlan: move pvid inside net_bridge_vlan_group
From: Nikolay AleksandrovOne obvious way to converge more code (which was also used by the previous vlan code) is to move pvid inside net_bridge_vlan_group. This allows us to simplify some and remove other port-specific functions. Also gives us the ability to simply pass the vlan group and use all of the contained information. Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_device.c | 2 +- net/bridge/br_input.c | 2 +- net/bridge/br_netlink.c | 42 +--- net/bridge/br_private.h | 44 ++--- net/bridge/br_vlan.c| 103 5 files changed, 75 insertions(+), 118 deletions(-) diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index c915c5b408ea..bdfb9544ca03 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -56,7 +56,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) skb_reset_mac_header(skb); skb_pull(skb, ETH_HLEN); - if (!br_allowed_ingress(br, skb, )) + if (!br_allowed_ingress(br, br_vlan_group(br), skb, )) goto out; if (is_broadcast_ether_addr(dest)) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index e27d0dfd2ee9..f5c5a4500e2f 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -140,7 +140,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb if (!p || p->state == BR_STATE_DISABLED) goto drop; - if (!nbp_allowed_ingress(p, skb, )) + if (!br_allowed_ingress(p->br, nbp_vlan_group(p), skb, )) goto out; /* insert into forwarding database after filtering to avoid spoofing */ diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index bb8bb7b36f04..c64dcad11662 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -22,17 +22,17 @@ #include "br_private_stp.h" static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg, - u32 filter_mask, - u16 pvid) + u32 filter_mask) { struct net_bridge_vlan *v; u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0; - u16 flags; + u16 flags, pvid; int num_vlans = 0; if (!(filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED)) return 0; + pvid = br_get_pvid(vg); /* Count number of vlan infos */ list_for_each_entry(v, >vlan_list, vlist) { flags = 0; @@ -74,7 +74,7 @@ initvars: } static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg, -u32 filter_mask, u16 pvid) +u32 filter_mask) { if (!vg) return 0; @@ -82,7 +82,7 @@ static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg, if (filter_mask & RTEXT_FILTER_BRVLAN) return vg->num_vlans; - return __get_num_vlan_infos(vg, filter_mask, pvid); + return __get_num_vlan_infos(vg, filter_mask); } static size_t br_get_link_af_size_filtered(const struct net_device *dev, @@ -92,19 +92,16 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev, struct net_bridge_port *p; struct net_bridge *br; int num_vlan_infos; - u16 pvid = 0; rcu_read_lock(); if (br_port_exists(dev)) { p = br_port_get_rcu(dev); vg = nbp_vlan_group(p); - pvid = nbp_get_pvid(p); } else if (dev->priv_flags & IFF_EBRIDGE) { br = netdev_priv(dev); vg = br_vlan_group(br); - pvid = br_get_pvid(br); } - num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask, pvid); + num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask); rcu_read_unlock(); /* Each VLAN is returned in bridge_vlan_info along with flags */ @@ -196,18 +193,18 @@ nla_put_failure: } static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb, -struct net_bridge_vlan_group *vg, -u16 pvid) +struct net_bridge_vlan_group *vg) { struct net_bridge_vlan *v; u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0; - u16 flags; + u16 flags, pvid; int err = 0; /* Pack IFLA_BRIDGE_VLAN_INFO's for every vlan * and mark vlan info with begin and end flags * if vlaninfo represents a range */ + pvid = br_get_pvid(vg); list_for_each_entry(v, >vlan_list, vlist) { flags = 0; if (!br_vlan_should_use(v)) @@ -251,12 +248,13 @@ initvars: } static int br_fill_ifvlaninfo(struct sk_buff *skb, - struct
[PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD
This patch introduces the concept of a seccomp fd, with a similar interface and usage to ebpf fds. Initially, one is allowed to create, install, and dump these fds. Any manipulation of seccomp fds requires users to be root in their own user namespace, matching the checks done for SECCOMP_SET_MODE_FILTER. Installing a filterfd has some gotchas, though. Andy mentioned previously that we should restrict installation to filter fds whose parent is already in the filter tree. This doesn't quite work in the case of created seccomp fds, since once you install a filter fd, you can't install any other filter fd since it has no parent and there is no way to "pre-chain" filters before installing them. To work around this, we allow installing filters who have no parent. If the filter has a parent, we require the current filter try to be an ancestor of it. I'm not quite sure that the ancestor restriction is correct, since it can still allow for "re-parenting" of filters, potentially introducing new filters to a task. However, since these operations are limited to root in the user ns, perhaps it is ok. There is also some potentially racy behavior where a task re-parents a filter that another task has installed. One option to work around this is to keep a bit on struct seccomp_filter to allow each filter to have its parent set exactly once. (This would still allow you to install a filter multiple times, as long as the parent was the same in each case.) Signed-off-by: Tycho AndersenCC: Kees Cook CC: Will Drewry CC: Oleg Nesterov CC: Andy Lutomirski CC: Pavel Emelyanov CC: Serge E. Hallyn CC: Alexei Starovoitov CC: Daniel Borkmann --- include/uapi/linux/seccomp.h | 24 ++ kernel/seccomp.c | 189 ++- 2 files changed, 210 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 0f238a4..4ee8770 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -13,10 +13,16 @@ /* Valid operations for seccomp syscall. */ #define SECCOMP_SET_MODE_STRICT0 #define SECCOMP_SET_MODE_FILTER1 +#define SECCOMP_FILTER_FD 2 /* Valid flags for SECCOMP_SET_MODE_FILTER */ #define SECCOMP_FILTER_FLAG_TSYNC 1 +/* Valid commands for SECCOMP_FILTER_FD */ +#define SECCOMP_FD_NEW 0 +#define SECCOMP_FD_INSTALL 1 +#define SECCOMP_FD_DUMP2 + /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. @@ -51,4 +57,22 @@ struct seccomp_data { __u64 args[6]; }; +struct seccomp_fd { + __u32 size; + + union { + /* SECCOMP_FD_NEW */ + struct sock_fprog __user*new_prog; + + /* SECCOMP_FD_INSTALL */ + int install_fd; + + /* SECCOMP_FD_DUMP */ + struct { + int dump_fd; + struct sock_filter __user *insns; + }; + }; +}; + #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 09f3769..6f0465c 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -26,6 +26,8 @@ #endif #ifdef CONFIG_SECCOMP_FILTER +#include +#include #include #include #include @@ -58,6 +60,7 @@ struct seccomp_filter { atomic_t usage; struct seccomp_filter *prev; struct bpf_prog *prog; + spinlock_t prev_lock; }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -393,6 +396,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) } atomic_set(>usage, 1); + sfilter->prev_lock = __SPIN_LOCK_UNLOCKED(>prev_lock); return sfilter; } @@ -441,6 +445,7 @@ static long seccomp_attach_filter(unsigned int flags, struct seccomp_filter *walker; assert_spin_locked(>sighand->siglock); + assert_spin_locked(>prev_lock); /* Validate resulting filter length. */ total_insns = filter->prog->len; @@ -482,10 +487,8 @@ void get_seccomp_filter(struct task_struct *tsk) atomic_inc(>usage); } -/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */ -void put_seccomp_filter(struct task_struct *tsk) +static void seccomp_filter_decref(struct seccomp_filter *orig) { - struct seccomp_filter *orig = tsk->seccomp.filter; /* Clean up single-reference branches iteratively. */ while (orig && atomic_dec_and_test(>usage)) { struct seccomp_filter *freeme = orig; @@ -494,6 +497,12 @@ void put_seccomp_filter(struct task_struct *tsk) } } +/* put_seccomp_filter - decrements the ref count of
[PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
This command allows comparing the underling private data of two fds. This is useful e.g. to find out if a seccomp filter is inherited, since struct seccomp_filter are unique across tasks and are the private_data seccomp fds. Signed-off-by: Tycho AndersenCC: Kees Cook CC: Will Drewry CC: Oleg Nesterov CC: Andy Lutomirski CC: Pavel Emelyanov CC: Serge E. Hallyn CC: Alexei Starovoitov CC: Daniel Borkmann --- include/uapi/linux/kcmp.h | 1 + kernel/kcmp.c | 14 ++ 2 files changed, 15 insertions(+) diff --git a/include/uapi/linux/kcmp.h b/include/uapi/linux/kcmp.h index 84df14b..ed389d2 100644 --- a/include/uapi/linux/kcmp.h +++ b/include/uapi/linux/kcmp.h @@ -10,6 +10,7 @@ enum kcmp_type { KCMP_SIGHAND, KCMP_IO, KCMP_SYSVSEM, + KCMP_FILE_PRIVATE_DATA, KCMP_TYPES, }; diff --git a/kernel/kcmp.c b/kernel/kcmp.c index 0aa69ea..9ae673b 100644 --- a/kernel/kcmp.c +++ b/kernel/kcmp.c @@ -165,6 +165,20 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, ret = -EOPNOTSUPP; #endif break; + case KCMP_FILE_PRIVATE_DATA: { + struct file *filp1, *filp2; + + filp1 = get_file_raw_ptr(task1, idx1); + filp2 = get_file_raw_ptr(task2, idx2); + + if (filp1 && filp2) + ret = kcmp_ptr(filp1->private_data, + filp2->private_data, + KCMP_FILE_PRIVATE_DATA); + else + ret = -EBADF; + break; + } default: ret = -EINVAL; break; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 0/5] bridge: vlan: cleanups & fixes
From: Nikolay AleksandrovHi, This is the first follow-up set, patch 01 reduces the default rhashtable size and the number of locks that can be allocated. Patch 02 and 04 fix possible null pointer dereferences due to the new ordering and initialization on port add/del, and patch 03 moves the "pvid" member in the net_bridge_vlan_group struct in order to simplify code (similar to how it was with the older struct). Patch 05 fixes adding a vlan on a port which is pvid and doesn't have a global context yet. Please review carefully, I think this is the first use of rhashtable's "locks_mul" member in the tree and I'd like to make sure it's correct. Another thing that needs special attention is the nbp_vlan_flush() move after the rx_handler unregister. Cheers, Nik Nikolay Aleksandrov (5): bridge: vlan: adjust rhashtable initial size and hash locks size bridge: vlan: fix possible null vlgrp deref while registering new port bridge: vlan: move pvid inside net_bridge_vlan_group bridge: vlan: fix possible null ptr derefs on port init and deinit bridge: vlan: don't pass flags when creating context only net/bridge/br_device.c | 2 +- net/bridge/br_if.c | 3 +- net/bridge/br_input.c | 2 +- net/bridge/br_netlink.c | 42 +++- net/bridge/br_private.h | 44 + net/bridge/br_vlan.c| 127 ++-- 6 files changed, 93 insertions(+), 127 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/5] bridge: vlan: fix possible null vlgrp deref while registering new port
From: Nikolay AleksandrovWhile a new port is being initialized the rx_handler gets set, but the vlans get initialized later in br_add_if() and in that window if we receive a frame with a link-local address we can try to dereference p->vlgrp in: br_handle_frame() -> br_handle_local_finish() -> br_should_learn() Fix this by checking vlgrp before using it. Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_vlan.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 283d012c3d89..678d5c41b551 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -476,13 +476,15 @@ bool br_allowed_egress(struct net_bridge_vlan_group *vg, /* Called under RCU */ bool br_should_learn(struct net_bridge_port *p, struct sk_buff *skb, u16 *vid) { + struct net_bridge_vlan_group *vg; struct net_bridge *br = p->br; /* If filtering was disabled at input, let it pass. */ if (!br->vlan_enabled) return true; - if (!p->vlgrp->num_vlans) + vg = p->vlgrp; + if (!vg || !vg->num_vlans) return false; if (!br_vlan_get_tag(skb, vid) && skb->vlan_proto != br->vlan_proto) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD
On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersenwrote: > This patch introduces the concept of a seccomp fd, with a similar interface > and usage to ebpf fds. Initially, one is allowed to create, install, and > dump these fds. Any manipulation of seccomp fds requires users to be root > in their own user namespace, matching the checks done for > SECCOMP_SET_MODE_FILTER. > > Installing a filterfd has some gotchas, though. Andy mentioned previously > that we should restrict installation to filter fds whose parent is already > in the filter tree. This doesn't quite work in the case of created seccomp > fds, since once you install a filter fd, you can't install any other filter > fd since it has no parent and there is no way to "pre-chain" filters before > installing them. ISTM, if we like the seccomp fd approach, we should have them be created with a parent already set. IOW the default should be that their parent is the creator's seccomp fd and, if needed, creators could specify a different parent. --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersenwrote: > This command allows comparing the underling private data of two fds. This > is useful e.g. to find out if a seccomp filter is inherited, since struct > seccomp_filter are unique across tasks and are the private_data seccomp > fds. This is very implementation-specific and may have nasty ABI consequences far outside seccomp. Let's do something specific to seccomp and/or eBPF. --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote: > On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen >wrote: > > This command allows comparing the underling private data of two fds. This > > is useful e.g. to find out if a seccomp filter is inherited, since struct > > seccomp_filter are unique across tasks and are the private_data seccomp > > fds. > > This is very implementation-specific and may have nasty ABI > consequences far outside seccomp. Let's do something specific to > seccomp and/or eBPF. We could change the name to a less generic KCMP_SECCOMP_FD or something, but without some sort of GUID on each struct seccomp_filter, the implementation would be effectively the same as it is today. Is that enough, or do we need a GUID? Tycho -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/14] RDS: IB: use max_mr from HCA caps than max_fmr
All HCA drivers seems to popullate max_mr caps and few of them do both max_mr and max_fmr. Hence update RDS code to make use of max_mr. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/ib.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/ib.c b/net/rds/ib.c index 2d3f2ab..883813a 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -148,8 +148,8 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32; - rds_ibdev->max_fmrs = dev_attr->max_fmr ? - min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) : + rds_ibdev->max_fmrs = dev_attr->max_mr ? + min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) : fmr_pool_size; rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/14] RDS: defer the over_batch work to send worker
Current process gives up if its send work over the batch limit. The work queue will get kicked to finish off any other requests. This fixes remainder condition from commit 443be0e5affe ("RDS: make sure not to loop forever inside rds_send_xmit"). The restart condition is only for the case where we reached to over_batch code for some other reason so just retrying again before giving up. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/send.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/rds/send.c b/net/rds/send.c index 4df61a5..f1e709c 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -423,7 +423,9 @@ over_batch: !list_empty(>c_send_queue)) && send_gen == conn->c_send_gen) { rds_stats_inc(s_send_lock_queue_raced); - goto restart; + if (batch_count < 1024) + goto restart; + queue_delayed_work(rds_wq, >c_send_w, 1); } } out: -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/14] RDS: fix rds_sock reference bug while doing bind
One need to take rds socket reference while using it and release it once done with it. rds_add_bind() code path does not do that so lets fix it. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/bind.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/net/rds/bind.c b/net/rds/bind.c index 01989e2..166c605 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -61,8 +61,10 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port, cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | be16_to_cpu(rs->rs_bound_port); - if (cmp == needle) + if (cmp == needle) { + rds_sock_addref(rs); return rs; + } } if (insert) { @@ -94,10 +96,10 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port) rs = rds_bind_lookup(addr, port, NULL); read_unlock_irqrestore(_bind_lock, flags); - if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) - rds_sock_addref(rs); - else + if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) { + rds_sock_put(rs); rs = NULL; + } rdsdebug("returning rs %p for %pI4:%u\n", rs, , ntohs(port)); @@ -123,14 +125,18 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) write_lock_irqsave(_bind_lock, flags); do { + struct rds_sock *rrs; if (rover == 0) rover++; - if (!rds_bind_lookup(addr, cpu_to_be16(rover), rs)) { + rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs); + if (!rrs) { *port = rs->rs_bound_port; ret = 0; rdsdebug("rs %p binding to %pI4:%d\n", rs, , (int)ntohs(*port)); break; + } else { + rds_sock_put(rrs); } } while (rover++ != last); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT v2] geneve: implement support for IPv6-based tunnels
Hi John, [auto build test results on v4.3-rc3 -- if it's inappropriate base, please ignore] reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) >> drivers/net/geneve.c:55:19: sparse: symbol 'geneve_remote_unspec' was not >> declared. Should it be static? Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 5/6] ipv6: Call xfrm6_xlat_addr from ipv6_rcv
On Wed, Sep 30, 2015 at 2:06 AM, Steffen Klassertwrote: > On Tue, Sep 29, 2015 at 03:17:22PM -0700, Tom Herbert wrote: >> Call before performing NF_HOOK and routing in order to perform address >> translation in the receive path. >> >> Signed-off-by: Tom Herbert >> --- >> net/ipv6/ip6_input.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c >> index 9075acf..06dac55 100644 >> --- a/net/ipv6/ip6_input.c >> +++ b/net/ipv6/ip6_input.c >> @@ -183,6 +183,9 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device >> *dev, struct packet_type *pt >> /* Must drop socket now because of tproxy. */ >> skb_orphan(skb); >> >> + /* Translate destination address before routing */ >> + xfrm6_xlat_addr(skb); >> + > > This shows that xfrm is not the right place to add this. The existing > xfrm hooks are located at the same place as your current LWT hooks are. > > You could use the existing xfrm hooks similar to xfrm tunnel modes. > This reinserts the transformed packet back into layer2, but I guess > this is not what you want. > > I'm currently paying with a GRO codepath for IPsec to get the > packets transformed early. If you can do your address translation > that early, it could be an option too. This clearly depends on > enabled GRO at the receiving device, but you would still have > the LWT hook as a fallback. > GRO probably doesn't help here. ILA already works with GRO, and performing translation for every segment instead of just once for the GRO packet would be unnecessary overhead. Besides, that still doesn't address the problem of how to hook in a lookup and translation function in the data path. >> return NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING, >> net, NULL, skb, dev, NULL, >> ip6_rcv_finish); > > Or, try to use the netfilter hook that seems to be at the right > place at least. > My original patch did hook into nf so it didn't require any change to IP data path. The suggested alternatives were to use iptables or nft, but the overhead of is too great for these to be useful for as a performance optimization. The problem is that any additional lookup added for this purpose only makes sense if it is significantly cheaper than the cost of doing a route lookup (the part that can be eliminated by early demux), and needs to have near zero impact on unrelated traffic. Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v3 02/10] switchdev: introduce transaction item queue for attr_set and obj_add
Hi all, On Sep. Friday 25 (39) 11:03 AM, Vivien Didelot wrote: > On Sep. Thursday 24 (39) 10:55 PM, David Miller wrote: > > From: Scott Feldman> > Date: Thu, 24 Sep 2015 22:29:43 -0700 > > > > > I'd rather keep 2-phase not optional, or at least make it some what of > > > a pain for drivers to opt-out of 2-phase. Forcing the driver to see > > > both phases means the driver needs to put some code to skip phase 1 > > > (and hopefully has some persistent comment explaining why its being > > > skipped). Something like: > > > > > > /* I'm skipping phase 1 prepare for this operation. I have infinite > > > hardware > > > * resources and I'm not setting any persistent state in the driver or > > > device > > > * and I don't need any dynamic resources from the kernel, so its > > > impossible > > > * for me to fail phase 2 commit. Nothing to prepare, sorry. > > > */ > > > > I agree with Scott here. > > > > If you can opt out of something, you can not think about it and thus > > more likely get it wrong. > > > > I can just see a driver not implementing prepare at all and then doing > > stupid things in commit when they hit some resource limit or whatever, > > rather than taking care of such issues in prepare. > > OK, I have no experience with stacked devices nor what it actually looks > like, but I understand that it is a redundant setup where it makes sense > to ensure that an operation is feasible before programming the hardware. > > I agree with both of you on imposing switchdev drivers such notion. > > I was confused with the rtnl lock (from bridge netlink requests) which > seemed to limit a lot the usage of this prepare phase. > > I don't know the batch mode neither, but I can think about a potentially > powerful usage of the prepare phase in Marvell switches (or any basic > home router switches), please tell me if the following is feasible: > > Every hardware VLANs I know of are programmed with all port membership > in one shot. This is not feasible today with the bridge command. If I > could bundle in one request the equivalent of ("VID 100: 0u 1u 5t"): > > bridge vlan add master dev swp0 vid 100 pvid untagged > bridge vlan add master dev swp1 vid 100 pvid untagged > bridge vlan add master dev swp5 vid 100 # cpu > > In such case the prepare phase could be great to allocate and populate a > VLAN entry structure (i.e. struct mv88e6xxx_vtu_stu_entry) before > programming the hardware *just once*. Is that doable? May I get answers for this? I'd need that in order to suggest a next step for the prepare phase in DSA drivers. Thanks, -v -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote: > On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen >wrote: > > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote: > >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen > >> wrote: > >> > This command allows comparing the underling private data of two fds. This > >> > is useful e.g. to find out if a seccomp filter is inherited, since struct > >> > seccomp_filter are unique across tasks and are the private_data seccomp > >> > fds. > >> > >> This is very implementation-specific and may have nasty ABI > >> consequences far outside seccomp. Let's do something specific to > >> seccomp and/or eBPF. > > > > We could change the name to a less generic KCMP_SECCOMP_FD or > > something, but without some sort of GUID on each struct > > seccomp_filter, the implementation would be effectively the same as it > > is today. Is that enough, or do we need a GUID? > > > > I don't care about the GUID. I think we should name it > KCMP_SECCOMP_FD and make it only work on seccomp fds. Ok, I can do that. > Alternatively, we could figure out why KCMP_FILE doesn't do the trick > and consider fixing it. IMO it's really too bad that struct file is > so heavyweight that we can't really just embed one in all kinds of > structures. The problem is that KCMP_FILE compares the file objects themselves, instead of the underlying data. If I ask for a seccomp fd for filter 0 twice, I'll have two different file objects and they won't be equal. I suppose we could add some special logic inside KCMP_FILE to compare the underlying data in special cases (seccomp, ebpf, others?), but it seems cleaner to have a separate command as you described above. Tycho -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/14] RDS: make socket bind/release locking scheme simple and more efficient
RDS bind and release locking scheme is very inefficient. It uses RCU for maintaining the bind hash-table which is great but it also needs to hold spinlock for [add/remove]_bound(). So overall usecase, the hash-table concurrent speedup doesn't pay off. In fact blocking nature of synchronize_rcu() makes the RDS socket shutdown too slow which hurts RDS performance since connection shutdown and re-connect happens quite often to maintain the RC part of the protocol. So we make the locking scheme simpler and more efficient by replacing spin_locks with reader/writer locks and getting rid off rcu for bind hash-table. In subsequent patch, we also covert the global lock with per-bucket lock to reduce the global lock contention. Signed-off-by: Santosh ShilimkarSigned-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 6 -- net/rds/bind.c | 35 +++ 2 files changed, 15 insertions(+), 26 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index a2f28a6..dc08766 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -72,13 +72,7 @@ static int rds_release(struct socket *sock) rds_clear_recv_queue(rs); rds_cong_remove_socket(rs); - /* -* the binding lookup hash uses rcu, we need to -* make sure we synchronize_rcu before we free our -* entry -*/ rds_remove_bound(rs); - synchronize_rcu(); rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); diff --git a/net/rds/bind.c b/net/rds/bind.c index dd666fb..01989e2 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -40,7 +40,7 @@ #define BIND_HASH_SIZE 1024 static struct hlist_head bind_hash_table[BIND_HASH_SIZE]; -static DEFINE_SPINLOCK(rds_bind_lock); +static DEFINE_RWLOCK(rds_bind_lock); static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port) { @@ -48,6 +48,7 @@ static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port) (BIND_HASH_SIZE - 1)); } +/* must hold either read or write lock (write lock for insert != NULL) */ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port, struct rds_sock *insert) { @@ -56,30 +57,24 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port, u64 cmp; u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); - rcu_read_lock(); - hlist_for_each_entry_rcu(rs, head, rs_bound_node) { + hlist_for_each_entry(rs, head, rs_bound_node) { cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | be16_to_cpu(rs->rs_bound_port); - if (cmp == needle) { - rcu_read_unlock(); + if (cmp == needle) return rs; - } } - rcu_read_unlock(); if (insert) { /* * make sure our addr and port are set before -* we are added to the list, other people -* in rcu will find us as soon as the -* hlist_add_head_rcu is done +* we are added to the list. */ insert->rs_bound_addr = addr; insert->rs_bound_port = port; rds_sock_addref(insert); - hlist_add_head_rcu(>rs_bound_node, head); + hlist_add_head(>rs_bound_node, head); } return NULL; } @@ -93,8 +88,11 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port, struct rds_sock *rds_find_bound(__be32 addr, __be16 port) { struct rds_sock *rs; + unsigned long flags; + read_lock_irqsave(_bind_lock, flags); rs = rds_bind_lookup(addr, port, NULL); + read_unlock_irqrestore(_bind_lock, flags); if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) rds_sock_addref(rs); @@ -103,6 +101,7 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port) rdsdebug("returning rs %p for %pI4:%u\n", rs, , ntohs(port)); + return rs; } @@ -121,7 +120,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) last = rover - 1; } - spin_lock_irqsave(_bind_lock, flags); + write_lock_irqsave(_bind_lock, flags); do { if (rover == 0) @@ -135,7 +134,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) } } while (rover++ != last); - spin_unlock_irqrestore(_bind_lock, flags); + write_unlock_irqrestore(_bind_lock, flags); return ret; } @@ -144,19 +143,19 @@ void rds_remove_bound(struct rds_sock *rs) { unsigned long flags; - spin_lock_irqsave(_bind_lock, flags); + write_lock_irqsave(_bind_lock, flags); if (rs->rs_bound_addr) { rdsdebug("rs %p unbinding
[PATCH v3 5/5] bpf: save the program the user actually supplied
In some cases (e.g. seccomp) the program result might be translated from the original program the user supplied. If we're saving the result for checkpoint/restore, we should save exactly the program the user initially supplied. This causes problems when the translations seccomp makes are not allowed by bpf_check_classic. Signed-off-by: Tycho AndersenCC: Kees Cook CC: Will Drewry CC: Oleg Nesterov CC: Andy Lutomirski CC: Pavel Emelyanov CC: Serge E. Hallyn CC: Alexei Starovoitov CC: Daniel Borkmann --- net/core/filter.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 70995dd..5a4596b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -845,8 +845,7 @@ int bpf_prog_store_orig_filter(struct bpf_prog *fp, fkprog = fp->orig_prog; fkprog->len = fprog->len; - fkprog->filter = kmemdup(fp->insns, fsize, -GFP_KERNEL | __GFP_NOWARN); + fkprog->filter = memdup_user(fprog->filter, fsize); if (!fkprog->filter) { kfree(fp->orig_prog); return -ENOMEM; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/5] seccomp: add a ptrace command to get seccomp filter fds
I just picked 40 for the constant out of thin air, but there may be a more appropriate value for this. Also, we return EINVAL when there is no filter for the index the user requested, but ptrace also returns EINVAL for invalid commands, making it slightly awkward to test whether or not the kernel supports this feature. It can still be done via, if (is_in_mode_filter(pid)) { int fd; fd = ptrace(PTRACE_SECCOMP_GET_FILTER_FD, pid, NULL, 0); if (fd < 0 && errno == -EINVAL) /* not supported */ ... } since being in SECCOMP_MODE_FILTER implies that there is at least one filter. If there is a more appropriate errno (ESRCH collides as well with ptrace) to give here that may be better. Signed-off-by: Tycho AndersenCC: Kees Cook CC: Will Drewry CC: Oleg Nesterov CC: Andy Lutomirski CC: Pavel Emelyanov CC: Serge E. Hallyn CC: Alexei Starovoitov CC: Daniel Borkmann --- include/linux/seccomp.h | 9 + include/uapi/linux/ptrace.h | 2 ++ kernel/ptrace.c | 4 kernel/seccomp.c| 28 4 files changed, 43 insertions(+) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index f426503..637d91f 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -95,4 +95,13 @@ static inline void get_seccomp_filter(struct task_struct *tsk) return; } #endif /* CONFIG_SECCOMP_FILTER */ + +#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER) +extern long seccomp_get_filter_fd(struct task_struct *task, long data); +#else +static inline long seccomp_get_filter_fd(struct task_struct *task, long data) +{ + return -EINVAL; +} +#endif /* CONFIG_CHECKPOINT_RESTORE && CONFIG_SECCOMP_FILTER */ #endif /* _LINUX_SECCOMP_H */ diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index a7a6979..3271f5a 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -23,6 +23,8 @@ #define PTRACE_SYSCALL 24 +#define PTRACE_SECCOMP_GET_FILTER_FD 40 + /* 0x4200-0x4300 are reserved for architecture-independent additions. */ #define PTRACE_SETOPTIONS 0x4200 #define PTRACE_GETEVENTMSG 0x4201 diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 787320d..aede440 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1016,6 +1016,10 @@ int ptrace_request(struct task_struct *child, long request, break; } #endif + + case PTRACE_SECCOMP_GET_FILTER_FD: + return seccomp_get_filter_fd(child, data); + default: break; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 6f0465c..7275ce0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1058,3 +1058,31 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) /* prctl interface doesn't have flags, so they are always zero. */ return do_seccomp(op, 0, uargs); } + +#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER) +long seccomp_get_filter_fd(struct task_struct *task, long n) +{ + struct seccomp_filter *filter; + long fd; + + if (task->seccomp.mode != SECCOMP_MODE_FILTER) + return -EINVAL; + + filter = task->seccomp.filter; + while (n > 0 && filter) { + filter = filter->prev; + n--; + } + + if (!filter) + return -EINVAL; + + atomic_inc(>usage); + fd = anon_inode_getfd("seccomp", _fops, filter, + O_RDONLY | O_CLOEXEC); + if (fd < 0) + seccomp_filter_decref(filter); + + return fd; +} +#endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
On Wed, Sep 30, 2015 at 11:55 AM, Tycho Andersenwrote: > On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote: >> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen >> wrote: >> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote: >> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen >> >> wrote: >> >> > This command allows comparing the underling private data of two fds. >> >> > This >> >> > is useful e.g. to find out if a seccomp filter is inherited, since >> >> > struct >> >> > seccomp_filter are unique across tasks and are the private_data seccomp >> >> > fds. >> >> >> >> This is very implementation-specific and may have nasty ABI >> >> consequences far outside seccomp. Let's do something specific to >> >> seccomp and/or eBPF. >> > >> > We could change the name to a less generic KCMP_SECCOMP_FD or >> > something, but without some sort of GUID on each struct >> > seccomp_filter, the implementation would be effectively the same as it >> > is today. Is that enough, or do we need a GUID? >> > >> >> I don't care about the GUID. I think we should name it >> KCMP_SECCOMP_FD and make it only work on seccomp fds. > > Ok, I can do that. > >> Alternatively, we could figure out why KCMP_FILE doesn't do the trick >> and consider fixing it. IMO it's really too bad that struct file is >> so heavyweight that we can't really just embed one in all kinds of >> structures. > > The problem is that KCMP_FILE compares the file objects themselves, > instead of the underlying data. If I ask for a seccomp fd for filter 0 > twice, I'll have two different file objects and they won't be equal. I > suppose we could add some special logic inside KCMP_FILE to compare > the underlying data in special cases (seccomp, ebpf, others?), but it > seems cleaner to have a separate command as you described above. > What I meant was that maybe we could get the two requests to actually produce the same struct file. But that could get very messy memory-wise. --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rate limiting AP bandwidth change messages in ieee80211_config_bw?
> > I'm not sure ratelimiting it would even work - it's not *that* high > frequency? Not really sure though. > > I think we can do either, it's not such a terribly important message as > far as I can tell. > Seems like Emmanuel would like to see the message stay in some form - perhaps we should try rate limiting it then? Could you check if that actually works? johannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/14] RDS: IB: handle rds_ibdev release case instead of crashing the kernel
From: Santosh ShilimkarJust in case we are still handling the QP receive completion while the rds_ibdev is released, drop the connection instead of crashing the kernel. Signed-off-by: Santosh Shilimkar --- net/rds/ib_cm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 8f51d0d..2b2370e 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -285,7 +285,8 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) struct rds_ib_device *rds_ibdev = ic->rds_ibdev; struct rds_ib_ack_state state; - BUG_ON(!rds_ibdev); + if (!rds_ibdev) + rds_conn_drop(conn); rds_ib_stats_inc(s_ib_tasklet_call); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] net: dsa: Complete and fix the dsa unbinding
On 30/09/15 01:21, Neil Armstrong wrote: > In order to cleanly unbind the dsa core, either as a module removal, > or a platform device unbind, switch the allocation the their devm_ > counterparts and complete the destroy functions. > > The last patch is an experimental way to exit the probe when no > switch is found in the discover process. > > The patches are based on the current net-next. I looked at the patches and they bring DSA in a better direction. For future submissions, could you CC people who recently worked on DSA, like Andrew Lunn, Guenter Roeck, Vivien Didelot and myself? We can typically give your patches a try fairly quickly. In case you are seriously considering making DSA a loadable module, there were an earlier attempt here: http://comments.gmane.org/gmane.linux.network/345803 Thanks! > > Neil Armstrong (3): > net: dsa: Use devm_ prefixed allocations > net: dsa: complete dsa_switch_destroy calls > net: dsa: exit probe if no switch were found > > net/dsa/dsa.c | 67 > --- > 1 file changed, 60 insertions(+), 7 deletions(-) > -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 5/5] bridge: vlan: don't pass flags when creating context only
From: Nikolay AleksandrovWe should not pass the original flags when creating a context vlan only because they may contain some flags that change behaviour in the bridge. The new global context should be with minimal set of flags, so pass 0 and let br_vlan_add() set the master flag only. Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_vlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 7e9d60a402e2..75214a51cf0e 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -197,7 +197,7 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags) masterv = br_vlan_find(br->vlgrp, v->vid); if (!masterv) { /* missing global ctx, create it now */ - err = br_vlan_add(br, v->vid, master_flags); + err = br_vlan_add(br, v->vid, 0); if (err) goto out_filt; masterv = br_vlan_find(br->vlgrp, v->vid); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA
On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersenwrote: > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote: >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen >> wrote: >> > This command allows comparing the underling private data of two fds. This >> > is useful e.g. to find out if a seccomp filter is inherited, since struct >> > seccomp_filter are unique across tasks and are the private_data seccomp >> > fds. >> >> This is very implementation-specific and may have nasty ABI >> consequences far outside seccomp. Let's do something specific to >> seccomp and/or eBPF. > > We could change the name to a less generic KCMP_SECCOMP_FD or > something, but without some sort of GUID on each struct > seccomp_filter, the implementation would be effectively the same as it > is today. Is that enough, or do we need a GUID? > I don't care about the GUID. I think we should name it KCMP_SECCOMP_FD and make it only work on seccomp fds. Alternatively, we could figure out why KCMP_FILE doesn't do the trick and consider fixing it. IMO it's really too bad that struct file is so heavyweight that we can't really just embed one in all kinds of structures. --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 00/14] RDS: connection scalability and performance improvements
[v2]: Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from earlier version [1]. I plan to address the hash table scalability using re-sizable hash tables as suggested by David Laight and David Miller [2] This series addresses RDS connection bottlenecks on massive workloads and improve the RDMA performance almost by 3X. RDS TCP also gets a small gain of about 12%. RDS is being used in massive systems with high scalability where several hundred thousand end points and tens of thousands of local processes are operating in tens of thousand sockets. Being RC(reliable connection), socket bind and release happens very often and any inefficiencies in bind hash look ups hurts the overall system performance. RDS bin hash-table uses global spin-lock which is the biggest bottleneck. To make matter worst, it uses rcu inside global lock for hash buckets. This is being addressed by simply using per bucket rw lock which makes the locking simple and very efficient. The hash table size is still an issue and I plan to address it by using re-sizable hash tables as suggested on the list. For RDS RDMA improvement, the completion handling is revamped so that we can do batch completions. Both send and receive completion handlers are split logically to achieve the same. RDS 8K messages being one of the key usecase, mr pool is adapted to have the 8K mrs along with default 1M mrs. And while doing this, few fixes and couple of bottlenecks seen with rds_sendmsg() are addressed. Series applies against 4.3-rc1 as well as net-next. Its tested on Oracle hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is tested with iXGB NIC. Like last time, iWARP transport is untested with these changes. The patchset is also available at below git repo: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v2 As a side note, the IB HCA driver I used for testing misses at least 3 important patches in upstream to see the full blown IB performance and am hoping to get that in mainline with help of them. Santosh Shilimkar (14): RDS: use kfree_rcu in rds_ib_remove_ipaddr RDS: make socket bind/release locking scheme simple and more efficient RDS: fix rds_sock reference bug while doing bind RDS: Use per-bucket rw lock for bind hash-table RDS: defer the over_batch work to send worker RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL RDS: IB: ack more receive completions to improve performance RDS: IB: split send completion handling and do batch ack RDS: IB: handle rds_ibdev release case instead of crashing the kernel RDS: IB: fix the rds_ib_fmr_wq kick call RDS: IB: use already available pool handle from ibmr RDS: IB: mark rds_ib_fmr_wq static RDS: IB: use max_mr from HCA caps than max_fmr RDS: IB: split mr pool to improve 8K messages performance net/rds/af_rds.c | 8 +--- net/rds/bind.c | 76 ++ net/rds/ib.c | 47 -- net/rds/ib.h | 78 +++--- net/rds/ib_cm.c| 114 ++-- net/rds/ib_rdma.c | 116 ++--- net/rds/ib_recv.c | 136 +++-- net/rds/ib_send.c | 110 --- net/rds/ib_stats.c | 22 + net/rds/rds.h | 1 + net/rds/send.c | 15 -- net/rds/threads.c | 2 + 12 files changed, 445 insertions(+), 280 deletions(-) -- 1.9.1 Regards, Santosh [1] https://lkml.org/lkml/2015/9/19/384 [2] https://lkml.org/lkml/2015/9/21/828 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] net: mvneta: Switch to per-CPU irq and make rxq_def useful
On Wed, 30 Sep 2015, David Miller wrote: > From: Thomas Gleixner> Date: Wed, 30 Sep 2015 16:56:06 +0200 (CEST) > > > On Tue, 29 Sep 2015, David Miller wrote: > >> From: Gregory CLEMENT > >> Date: Fri, 25 Sep 2015 18:09:31 +0200 > >> > >> > As stated in the first version: "this patchset reworks the Marvell > >> > neta driver in order to really support its per-CPU interrupts, instead > >> > of faking them as SPI, and allow the use of any RX queue instead of > >> > the hardcoded RX queue 0 that we have currently." > >> > >> Series applied, thanks. > > > > You could have had the courtesy to wait for an ack for the core irq > > parts at least > > Sorry, my impression was that those parts were already discussed and > agreed upon. No problem. I would have preferred to merge them to a separate branch which you could have pulled so we don't end up with conflicts on further changes in that area. But it's ok as it is. The patches are good to go. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH wpan-tools 1/2] security: add nl802154 security support
Hi, On Wed, Sep 30, 2015 at 04:46:30PM +0200, Stefan Schmidt wrote: > Hello. > > A really huge patch. I will start on it. Not sure I can do a full review in > one go though. > > On 28/09/15 09:25, Alexander Aring wrote: > >This patch introduce support for the experimental seucirty support for > > Type. Security. > >nl802154. We currently support add/del settings for manipulating > >security table entries. The dump functionality is a "really" keep it > > is really a > >short and stupid handling, the dump will printout the printout the right > > dump will printout the right calls to add the entry ok. > >add calls which was called to add the entry. This can be used for > >storing the current security tables by some script. The interface > >argument is replaced by $WPAN_DEV variable, so it's possible to move one > >interface configuration to another one. > > > >Signed-off-by: Alexander Aring> >--- > > src/Makefile.am |1 + > > src/interface.c | 100 + > > src/nl802154.h | 191 ++ > > src/security.c | 1118 > > +++ > > 4 files changed, 1410 insertions(+) > > create mode 100644 src/security.c > > > >diff --git a/src/Makefile.am b/src/Makefile.am > >index 2d54576..b2177a2 100644 > >--- a/src/Makefile.am > >+++ b/src/Makefile.am > >@@ -9,6 +9,7 @@ iwpan_SOURCES = \ > > interface.c \ > > phy.c \ > > mac.c \ > >+security.c \ > > nl_extras.h \ > > nl802154.h > >diff --git a/src/interface.c b/src/interface.c > >index 85d40a8..076e7c3 100644 > >--- a/src/interface.c > >+++ b/src/interface.c > >@@ -10,6 +10,7 @@ > > #include > > #include > >+#define CONFIG_IEEE802154_NL802154_EXPERIMENTAL > > #include "nl802154.h" > > #include "nl_extras.h" > > #include "iwpan.h" > >@@ -226,6 +227,105 @@ static int print_iface_handler(struct nl_msg *msg, > >void *arg) > > if (tb_msg[NL802154_ATTR_ACKREQ_DEFAULT]) > > printf("%s\tackreq_default %d\n", indent, > > nla_get_u8(tb_msg[NL802154_ATTR_ACKREQ_DEFAULT])); > >+if (tb_msg[NL802154_ATTR_SEC_ENABLED]) > >+printf("%s\tsecurity %d\n", indent, > >nla_get_u8(tb_msg[NL802154_ATTR_SEC_ENABLED])); > >+if (tb_msg[NL802154_ATTR_SEC_OUT_LEVEL]) > >+printf("%s\tout_level %d\n", indent, > >nla_get_u8(tb_msg[NL802154_ATTR_SEC_OUT_LEVEL])); > >+if (tb_msg[NL802154_ATTR_SEC_OUT_KEY_ID]) { > >+struct nlattr *tb_key_id[NL802154_KEY_ID_ATTR_MAX + 1]; > >+static struct nla_policy key_id_policy[NL802154_KEY_ID_ATTR_MAX > >+ 1] = { > >+[NL802154_KEY_ID_ATTR_MODE] = { .type = NLA_U32 }, > >+[NL802154_KEY_ID_ATTR_INDEX] = { .type = NLA_U8 }, > >+[NL802154_KEY_ID_ATTR_IMPLICIT] = { .type = NLA_NESTED > >}, > >+[NL802154_KEY_ID_ATTR_SOURCE_SHORT] = { .type = NLA_U32 > >}, > >+[NL802154_KEY_ID_ATTR_SOURCE_EXTENDED] = { .type = > >NLA_U64 }, > >+}; > >+ > >+nla_parse_nested(tb_key_id, NL802154_KEY_ID_ATTR_MAX, > >+ tb_msg[NL802154_ATTR_SEC_OUT_KEY_ID], > >key_id_policy); > >+printf("%s\tout_key_id\n", indent); > >+ > >+if (tb_key_id[NL802154_KEY_ID_ATTR_MODE]) { > >+enum nl802154_key_id_modes key_id_mode; > >+ > >+key_id_mode = > >nla_get_u32(tb_key_id[NL802154_KEY_ID_ATTR_MODE]); ... > >+enum nl802154_dev_addr_modes { > >+NL802154_DEV_ADDR_NONE, > >+__NL802154_DEV_ADDR_INVALID, > >+NL802154_DEV_ADDR_SHORT, > >+NL802154_DEV_ADDR_EXTENDED, > >+ > >+/* keep last */ > >+__NL802154_DEV_ADDR_AFTER_LAST, > > Hmm, why bother with AFTER_LAST here and not just use ADDR_MAX as sentinal > for this enum? Looks redundant to me. > At first I want to keep the wireless nl80211 userspace uapi header, which declarate this hidden __FOOBAR enum in "mostly" every their enum declaration. See [0], I simple adapt this convention for nl802154. The reason is probaly they want some automatic mechanism to increment the MAX value. Also it differs if you declare an array for netlink policy [1] or give the length argument for parsing [2], which occurs sometimes in off-by-one errors. ... > >+ > >+static int handle_out_key_id_set(struct nl802154_state *state, struct nl_cb > >*cb, > >+ struct nl_msg *msg, int argc, char **argv, > >+ enum id_input id) > >+{ > >+return handle_parse_key_id(msg, NL802154_ATTR_SEC_OUT_KEY_ID, , > >); > >+ > >+} > >+COMMAND(set, out_key_id, > >+"<0 <2 |3 >>|" > >+"<1 >|" > >+"<2 >|" > >+"<3 >", > > What are these extra >>| for ? > The numbers are acutally the enums value which is usually some specific mode, in this case the key_id_mode. Of course each of them has a proper name and we should add some helper functions to map these enums to a string. The '>' should
RE: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register
Hari, I think you missed the corresponding change that's needed for the const char *owner[] array. You need to add an "" entry so the index of "4" makes sense. Casey From: Hariprasad Shenai [haripra...@chelsio.com] Sent: Wednesday, September 30, 2015 8:03 AM To: netdev@vger.kernel.org Cc: da...@davemloft.net; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S Subject: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register T4 doesn't have the Shadow copy of the register which we can read without side effect. So don't read mbox control register for T4 adapter Signed-off-by: Hariprasad Shenai--- drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c index 0a87a32..8001619 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c @@ -1134,12 +1134,20 @@ static int mbox_show(struct seq_file *seq, void *v) unsigned int mbox = (uintptr_t)seq->private & 7; struct adapter *adap = seq->private - mbox; void __iomem *addr = adap->regs + PF_REG(mbox, CIM_PF_MAILBOX_DATA_A); - unsigned int ctrl_reg = (is_t4(adap->params.chip) -? CIM_PF_MAILBOX_CTRL_A -: CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A); - void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg); - i = MBOWNER_G(readl(ctrl)); + /* For T4 we don't have a shadow copy of the Mailbox Control register. +* And since reading that real register causes a side effect of +* granting ownership, we're best of simply not reading it at all. +*/ + if (is_t4(adap->params.chip)) { + i = 4; /* index of "" */ + } else { + unsigned int ctrl_reg = CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A; + void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg); + + i = MBOWNER_G(readl(ctrl)); + } + seq_printf(seq, "mailbox owned by %s\n\n", owner[i]); for (i = 0; i < MBOX_LEN; i += 8) -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD
Hi Tycho, [auto build test results on v4.3-rc3 -- if it's inappropriate base, please ignore] config: i386-alldefconfig (attached as .config) reproduce: git checkout 9613ae6bf5f111701614acb3eda3123d21a59239 # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): >> kernel/seccomp.c:998:10: error: expected ';', ',' or ')' before 'const' const char __user *filter) ^ kernel/seccomp.c: In function 'do_seccomp': >> kernel/seccomp.c:1016:10: error: implicit declaration of function >> 'seccomp_filter_fd' [-Werror=implicit-function-declaration] return seccomp_filter_fd(flags, uargs); ^ cc1: some warnings being treated as errors vim +998 kernel/seccomp.c 992 const char __user *filter) 993 { 994 return -EINVAL; 995 } 996 997 static inline long seccomp_filter_fd(unsigned int flags > 998 const char __user *filter) 999 { 1000 return -EINVAL; 1001 } 1002 #endif 1003 1004 /* Common entry point for both prctl and syscall. */ 1005 static long do_seccomp(unsigned int op, unsigned int flags, 1006 const char __user *uargs) 1007 { 1008 switch (op) { 1009 case SECCOMP_SET_MODE_STRICT: 1010 if (flags != 0 || uargs != NULL) 1011 return -EINVAL; 1012 return seccomp_set_mode_strict(); 1013 case SECCOMP_SET_MODE_FILTER: 1014 return seccomp_set_mode_filter(flags, uargs); 1015 case SECCOMP_FILTER_FD: > 1016 return seccomp_filter_fd(flags, uargs); 1017 default: 1018 return -EINVAL; 1019 } --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: Rate limiting AP bandwidth change messages in ieee80211_config_bw?
On Wed, 2015-09-30 at 13:02 -0400, Josh Boyer wrote: > Hi Johannes, > > We've seen a handful of reports that seem to have verbose output from > the ieee80211_config_bw function in net/mac80211/mlme.c. It looks > similar to this: > > [ 66.578652] wlp3s0: AP xx:xx:xx:xx:xx changed bandwidth, new config > is 2437 MHz, width 2 (2447/0 MHz) > [ 68.522437] wlp3s0: AP xx:xx:xx:xx:xx changed bandwidth, new config > is 2437 MHz, width 1 (2437/0 MHz) > Essentially, this looks like the AP is changing the bandwidth (and > only the width) every second or so. Why it is doing this, I'm not > sure. However, this doesn't seem to actually be an error case yet the > kernel logs are getting spammed with this message. > > I'm wondering if we could either change this message to use sdata_dbg > instead of sdata_info, or if we could possibly ratelimit it somehow. > I'd be happy to come up with a patch for either, but I wanted to get > your feedback on it before I started. Do you have any objections or > preference? > I'm not sure ratelimiting it would even work - it's not *that* high frequency? Not really sure though. I think we can do either, it's not such a terribly important message as far as I can tell. johannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/4] ravb: Provide dev parameter to DMA API
From: Kazuya MizuguchiThis patch is in preparation for using this driver on arm64 where the implementation of __dma_alloc_coherent fails if a device parameter is not provided. Signed-off-by: Kazuya Mizuguchi Signed-off-by: Yoshihiro Shimoda Signed-off-by: Masaru Nagai [horms: squashed into a single patch] Signed-off-by: Simon Horman --- * [horms] I have only tested this on arm64 using r8a7795/salvator-x. v0 [Kazuya Mizuguchi, Yoshihiro Shimoda, Masaru Nagai] v1 [Simon Horman] * Squashed into a single patch v2 [Simon Horman] * No change v4 * No change --- drivers/net/ethernet/renesas/ravb_main.c | 38 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 450899e9cea2..4ca093d033f8 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -201,7 +201,7 @@ static void ravb_ring_free(struct net_device *ndev, int q) if (priv->rx_ring[q]) { ring_size = sizeof(struct ravb_ex_rx_desc) * (priv->num_rx_ring[q] + 1); - dma_free_coherent(NULL, ring_size, priv->rx_ring[q], + dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q], priv->rx_desc_dma[q]); priv->rx_ring[q] = NULL; } @@ -209,7 +209,7 @@ static void ravb_ring_free(struct net_device *ndev, int q) if (priv->tx_ring[q]) { ring_size = sizeof(struct ravb_tx_desc) * (priv->num_tx_ring[q] * NUM_TX_DESC + 1); - dma_free_coherent(NULL, ring_size, priv->tx_ring[q], + dma_free_coherent(ndev->dev.parent, ring_size, priv->tx_ring[q], priv->tx_desc_dma[q]); priv->tx_ring[q] = NULL; } @@ -240,13 +240,13 @@ static void ravb_ring_format(struct net_device *ndev, int q) rx_desc = >rx_ring[q][i]; /* The size of the buffer should be on 16-byte boundary. */ rx_desc->ds_cc = cpu_to_le16(ALIGN(PKT_BUF_SZ, 16)); - dma_addr = dma_map_single(>dev, priv->rx_skb[q][i]->data, + dma_addr = dma_map_single(ndev->dev.parent, priv->rx_skb[q][i]->data, ALIGN(PKT_BUF_SZ, 16), DMA_FROM_DEVICE); /* We just set the data size to 0 for a failed mapping which * should prevent DMA from happening... */ - if (dma_mapping_error(>dev, dma_addr)) + if (dma_mapping_error(ndev->dev.parent, dma_addr)) rx_desc->ds_cc = cpu_to_le16(0); rx_desc->dptr = cpu_to_le32(dma_addr); rx_desc->die_dt = DT_FEMPTY; @@ -309,7 +309,7 @@ static int ravb_ring_init(struct net_device *ndev, int q) /* Allocate all RX descriptors. */ ring_size = sizeof(struct ravb_ex_rx_desc) * (priv->num_rx_ring[q] + 1); - priv->rx_ring[q] = dma_alloc_coherent(NULL, ring_size, + priv->rx_ring[q] = dma_alloc_coherent(ndev->dev.parent, ring_size, >rx_desc_dma[q], GFP_KERNEL); if (!priv->rx_ring[q]) @@ -320,7 +320,7 @@ static int ravb_ring_init(struct net_device *ndev, int q) /* Allocate all TX descriptors. */ ring_size = sizeof(struct ravb_tx_desc) * (priv->num_tx_ring[q] * NUM_TX_DESC + 1); - priv->tx_ring[q] = dma_alloc_coherent(NULL, ring_size, + priv->tx_ring[q] = dma_alloc_coherent(ndev->dev.parent, ring_size, >tx_desc_dma[q], GFP_KERNEL); if (!priv->tx_ring[q]) @@ -443,7 +443,7 @@ static int ravb_tx_free(struct net_device *ndev, int q) size = le16_to_cpu(desc->ds_tagl) & TX_DS; /* Free the original skb. */ if (priv->tx_skb[q][entry / NUM_TX_DESC]) { - dma_unmap_single(>dev, le32_to_cpu(desc->dptr), + dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr), size, DMA_TO_DEVICE); /* Last packet descriptor? */ if (entry % NUM_TX_DESC == NUM_TX_DESC - 1) { @@ -546,7 +546,7 @@ static bool ravb_rx(struct net_device *ndev, int *quota, int q) skb = priv->rx_skb[q][entry]; priv->rx_skb[q][entry] = NULL; - dma_unmap_single(>dev, le32_to_cpu(desc->dptr), + dma_unmap_single(ndev->dev.parent,
[PATCH net-next 1/4] phylib: Add phy_set_max_speed helper
Add a helper to allow ethernet drivers to limit the speed of a phy (that they are attached to). This mainly involves factoring out the business-end of of_set_phy_supported() and exporting a new symbol. This code seems to be open coded in several places, in several different variants. It is is envisaged that this will be used in situations where setting the "max-speed" property in DT is not appropriate, e.g. because the maximum speed is not a property of the phy hardware. Signed-off-by: Simon Horman--- v2 * First post v3 * As suggested by Florian Fainelli - Do not check for !IS_ENABLED(CONFIG_OF_MDIO) in __set_phy_supported. This is already done in of_set_phy_supported() and is not relevant to phy_set_max_speed) - Return -ENOTSUPP if 'max_speed' is not an unknown value * As suggested by Sergei Shtylyov - White-space and comment enhancements. v4 * No change --- drivers/net/phy/phy_device.c | 59 ++-- include/linux/phy.h | 1 + 2 files changed, 41 insertions(+), 19 deletions(-) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index f761288abe66..383389146099 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -1239,6 +1239,44 @@ static int gen10g_resume(struct phy_device *phydev) return 0; } +static int __set_phy_supported(struct phy_device *phydev, u32 max_speed) +{ + /* The default values for phydev->supported are provided by the PHY +* driver "features" member, we want to reset to sane defaults first +* before supporting higher speeds. +*/ + phydev->supported &= PHY_DEFAULT_FEATURES; + + switch (max_speed) { + default: + return -ENOTSUPP; + case SPEED_1000: + phydev->supported |= PHY_1000BT_FEATURES; + /* fall through */ + case SPEED_100: + phydev->supported |= PHY_100BT_FEATURES; + /* fall through */ + case SPEED_10: + phydev->supported |= PHY_10BT_FEATURES; + } + + return 0; +} + +int phy_set_max_speed(struct phy_device *phydev, u32 max_speed) +{ + int err; + + err = __set_phy_supported(phydev, max_speed); + if (err) + return err; + + phydev->advertising = phydev->supported; + + return 0; +} +EXPORT_SYMBOL(phy_set_max_speed); + static void of_set_phy_supported(struct phy_device *phydev) { struct device_node *node = phydev->dev.of_node; @@ -1250,25 +1288,8 @@ static void of_set_phy_supported(struct phy_device *phydev) if (!node) return; - if (!of_property_read_u32(node, "max-speed", _speed)) { - /* The default values for phydev->supported are provided by the PHY -* driver "features" member, we want to reset to sane defaults fist -* before supporting higher speeds. -*/ - phydev->supported &= PHY_DEFAULT_FEATURES; - - switch (max_speed) { - default: - return; - - case SPEED_1000: - phydev->supported |= PHY_1000BT_FEATURES; - case SPEED_100: - phydev->supported |= PHY_100BT_FEATURES; - case SPEED_10: - phydev->supported |= PHY_10BT_FEATURES; - } - } + if (!of_property_read_u32(node, "max-speed", _speed)) + __set_phy_supported(phydev, max_speed); } /** diff --git a/include/linux/phy.h b/include/linux/phy.h index 4a4e3a092337..4c477e6ece33 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -798,6 +798,7 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq *ifr, int cmd); int phy_start_interrupts(struct phy_device *phydev); void phy_print_status(struct phy_device *phydev); void phy_device_free(struct phy_device *phydev); +int phy_set_max_speed(struct phy_device *phydev, u32 max_speed); int phy_register_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask, int (*run)(struct phy_device *)); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 4/4] ravb: Add support for r8a7795 SoC
From: Kazuya MizuguchiThis patch supports the r8a7795 SoC by: - Using two interrupts + One for E-MAC + One for everything else + Both can be handled by the existing common interrupt handler, which affords a simpler update to support the new SoC. In future some consideration may be given to implementing multiple interrupt handlers - Limiting the phy speed to 100Mbit/s for the new SoC; at this time it is not clear how this restriction may be lifted but I hope it will be possible as more information comes to light Signed-off-by: Kazuya Mizuguchi [horms: reworked] Signed-off-by: Simon Horman --- v0 [Kazuya Mizuguchi] v1 [Simon Horman] * Updated patch subject v2 [Simon Horman] * Reworked based on extensive feedback from Geert Uytterhoeven and Sergei Shtylyov. * Broke binding update out into separate patch v3 [Simon Horman] * Check new return value of phy_set_max_speed() v4 * No change --- drivers/net/ethernet/renesas/ravb.h | 7 drivers/net/ethernet/renesas/ravb_main.c | 63 2 files changed, 62 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index a157ff6a..0623fff932e4 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -766,6 +766,11 @@ struct ravb_ptp { struct ravb_ptp_perout perout[N_PER_OUT]; }; +enum ravb_chip_id { + RCAR_GEN2, + RCAR_GEN3, +}; + struct ravb_private { struct net_device *ndev; struct platform_device *pdev; @@ -806,6 +811,8 @@ struct ravb_private { int msg_enable; int speed; int duplex; + int emac_irq; + enum ravb_chip_id chip_id; unsigned no_avb_link:1; unsigned avb_link_active_low:1; diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 4ca093d033f8..8cc5ec5ed19a 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -889,6 +889,22 @@ static int ravb_phy_init(struct net_device *ndev) return -ENOENT; } + /* This driver only support 10/100Mbit speeds on Gen3 +* at this time. +*/ + if (priv->chip_id == RCAR_GEN3) { + int err; + + err = phy_set_max_speed(phydev, SPEED_100); + if (err) { + netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n"); + phy_disconnect(phydev); + return err; + } + + netdev_info(ndev, "limited PHY to 100Mbit/s\n"); + } + netdev_info(ndev, "attached PHY %d (IRQ %d) to driver %s\n", phydev->addr, phydev->irq, phydev->drv->name); @@ -1197,6 +1213,15 @@ static int ravb_open(struct net_device *ndev) goto out_napi_off; } + if (priv->chip_id == RCAR_GEN3) { + error = request_irq(priv->emac_irq, ravb_interrupt, + IRQF_SHARED, ndev->name, ndev); + if (error) { + netdev_err(ndev, "cannot request IRQ\n"); + goto out_free_irq; + } + } + /* Device init */ error = ravb_dmac_init(ndev); if (error) @@ -1220,6 +1245,7 @@ out_ptp_stop: ravb_ptp_stop(ndev); out_free_irq: free_irq(ndev->irq, ndev); + free_irq(priv->emac_irq, ndev); out_napi_off: napi_disable(>napi[RAVB_NC]); napi_disable(>napi[RAVB_BE]); @@ -1625,10 +1651,20 @@ static int ravb_mdio_release(struct ravb_private *priv) return 0; } +static const struct of_device_id ravb_match_table[] = { + { .compatible = "renesas,etheravb-r8a7790", .data = (void *)RCAR_GEN2 }, + { .compatible = "renesas,etheravb-r8a7794", .data = (void *)RCAR_GEN2 }, + { .compatible = "renesas,etheravb-r8a7795", .data = (void *)RCAR_GEN3 }, + { } +}; +MODULE_DEVICE_TABLE(of, ravb_match_table); + static int ravb_probe(struct platform_device *pdev) { struct device_node *np = pdev->dev.of_node; + const struct of_device_id *match; struct ravb_private *priv; + enum ravb_chip_id chip_id; struct net_device *ndev; int error, irq, q; struct resource *res; @@ -1657,7 +1693,14 @@ static int ravb_probe(struct platform_device *pdev) /* The Ether-specific entries in the device structure. */ ndev->base_addr = res->start; ndev->dma = -1; - irq = platform_get_irq(pdev, 0); + + match = of_match_device(of_match_ptr(ravb_match_table), >dev); + chip_id = (enum ravb_chip_id)match->data; + + if (chip_id == RCAR_GEN3) + irq = platform_get_irq_byname(pdev, "ch22"); + else + irq = platform_get_irq(pdev,
[PATCH net-next 3/4] ravb: Document binding for r8a7795 SoC
From: Kazuya MizuguchiThis patch updates the ravb binding to support the r8a7795 SoC by: - Adding a compat string for the new hardware - Adding 25 named interrupts to binding for the new SoC; older SoCs continue to use a single multiplexed interrupt The example is also updated to reflect the r8a7795 as this is the more complex case. Based on work by Kazuya Mizuguchi and others. Signed-off-by: Simon Horman Acked-by: Geert Uytterhoeven --- v2 * First post; broken out of a driver update patch * As discussed with Geert Uytterhoeven and Sergei Shtylyov - Binding: Make all interrupts mandatory as named-interrupts of the form ch%u v3 * A suggested by Geert Uytterhoeven - Reword description of interrupts and interrupt-names to make things clearer. It is now based to some extent on spi-rspi.txt and renesas,usb-dmac.txt. * As suggested by Sergei Shtylyov - Drop phy-reset-gpio from example * Added power-domains to example v4 * A suggested by Geert Uytterhoeven - grammar fix for interrupt-names description * Add ack --- .../devicetree/bindings/net/renesas,ravb.txt | 69 +++--- 1 file changed, 62 insertions(+), 7 deletions(-) diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 1fd8831437bf..b486f3f5f6a3 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -6,8 +6,12 @@ interface contains. Required properties: - compatible: "renesas,etheravb-r8a7790" if the device is a part of R8A7790 SoC. "renesas,etheravb-r8a7794" if the device is a part of R8A7794 SoC. + "renesas,etheravb-r8a7795" if the device is a part of R8A7795 SoC. - reg: offset and length of (1) the register block and (2) the stream buffer. -- interrupts: interrupt specifier for the sole interrupt. +- interrupts: A list of interrupt-specifiers, one for each entry in + interrupt-names. + If interrupt-names is not present, an interrupt specifier + for a single muxed interrupt. - phy-mode: see ethernet.txt file in the same directory. - phy-handle: see ethernet.txt file in the same directory. - #address-cells: number of address cells for the MDIO bus, must be equal to 1. @@ -18,6 +22,12 @@ Required properties: Optional properties: - interrupt-parent: the phandle for the interrupt controller that services interrupts for this device. +- interrupt-names: A list of interrupt names. + For the R8A7795 SoC this property is mandatory; + it should include one entry per channel, named "ch%u", + where %u is the channel number ranging from 0 to 24. + For other SoCs this property is optional; if present + it should contain "mux" for a single muxed interrupt. - pinctrl-names: pin configuration state name ("default"). - renesas,no-ether-link: boolean, specify when a board does not provide a proper AVB_LINK signal. @@ -27,13 +37,46 @@ Optional properties: Example: ethernet@e680 { - compatible = "renesas,etheravb-r8a7790"; - reg = <0 0xe680 0 0x800>, <0 0xee0e8000 0 0x4000>; + compatible = "renesas,etheravb-r8a7795"; + reg = <0 0xe680 0 0x800>, <0 0xe6a0 0 0x1>; interrupt-parent = <>; - interrupts = <0 163 IRQ_TYPE_LEVEL_HIGH>; - clocks = <_clks R8A7790_CLK_ETHERAVB>; - phy-mode = "rmii"; + interrupts = , +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +; + interrupt-names = "ch0", "ch1", "ch2", "ch3", + "ch4", "ch5", "ch6", "ch7", + "ch8", "ch9", "ch10", "ch11", + "ch12", "ch13", "ch14", "ch15", + "ch16", "ch17", "ch18", "ch19", + "ch20", "ch21", "ch22", "ch23", + "ch24"; + clocks = <_clks R8A7795_CLK_ETHERAVB>; +
[PATCH net-next 0/4] ravb: Add support for r8a7795 SoC
Dave, please consider this series for net-next. It enhances the ravb driver to support the r8a7795 SoC. Changes: * Dropped RFC prefix * Details in changelog of individual patches Base: * net-next/master Availability: To aid review of this in conjunction with other EtherAVB changes the following branches are available in my renesas tree on kernel.org. * me/r8a7795-ravb-driver-v4: this series * me/r8a7795-ravb-pfc-v2: r8a7795 sh-pfc update for EthernetAVB * me/r8a7795-ravb-integration-v4: enable EthernetAVB on r8a7795 * me/r8a7795-ravb-driver-and-integration-v4.runtime: the above three branches with their runtime dependencies Kazuya Mizuguchi (3): ravb: Provide dev parameter to DMA API ravb: Document binding for r8a7795 SoC ravb: Add support for r8a7795 SoC Simon Horman (1): phylib: Add phy_set_max_speed helper .../devicetree/bindings/net/renesas,ravb.txt | 69 -- drivers/net/ethernet/renesas/ravb.h| 7 ++ drivers/net/ethernet/renesas/ravb_main.c | 101 +++-- drivers/net/phy/phy_device.c | 59 include/linux/phy.h| 1 + 5 files changed, 184 insertions(+), 53 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] ntp/pps: use timespec64 for hardpps()
On Mon, 28 Sep 2015, Arnd Bergmann wrote: > There is only one user of the hardpps function in the kernel, so > it makes sense to atomically change it over to using 64-bit > timestamps for y2038 safety. In the hardpps implementation, > we also need to change the pps_normtime structure, which is > similar to struct timespec and also requires a 64-bit > seconds portion. > > This introduces two temporary variables in pps_kc_event() to > do the conversion, they will be removed again in the next step, > which seemed preferable to having a larger patch changing it > all at the same time. > > Signed-off-by: Arnd BergmannReviewed-by: Thomas Gleixner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] ntp/pps: replace getnstime_raw_and_real with 64-bit version
On Mon, 28 Sep 2015, Arnd Bergmann wrote: > There is exactly one caller of getnstime_raw_and_real in the kernel, > which is the pps_get_ts function. This changes the caller and > the implementation to work on timespec64 types rather than timespec, > to avoid the time_t overflow on 32-bit architectures. > > For consistency with the other new functions (ktime_get_seconds, > ktime_get_real_*, ...), I'm renaming the function to > ktime_get_raw_and_real_ts64. > > We still need to convert from the internal 64-bit type to 32 bit > types in the caller, but this conversion is now pushed out from > getnstime_raw_and_real to pps_get_ts. A follow-up patch changes > the remaining pps code to completely avoid the conversion. > > Signed-off-by: Arnd BergmannReviewed-by: Thomas Gleixner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] net: dsa: complete dsa_switch_destroy calls
When unbinding dsa, complete the dsa_switch_destroy to cleanly destroy and unregister the net and mdio devices. Signed-off-by: Neil Armstrong--- net/dsa/dsa.c | 42 ++ 1 file changed, 42 insertions(+) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 98f94c2..0c104af 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "dsa_priv.h" char dsa_driver_version[] = "0.1"; @@ -420,10 +421,51 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, static void dsa_switch_destroy(struct dsa_switch *ds) { + struct device_node *port_dn; + struct phy_device *phydev; + struct dsa_chip_data *cd = ds->pd; + int port; + #ifdef CONFIG_NET_DSA_HWMON if (ds->hwmon_dev) hwmon_device_unregister(ds->hwmon_dev); #endif + + /* Disable configuration of the CPU and DSA ports */ + for (port = 0; port < DSA_MAX_PORTS; port++) { + if (!(dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) + continue; + + port_dn = cd->port_dn[port]; + if (of_phy_is_fixed_link(port_dn)) { + phydev = of_phy_find_device(port_dn); + if (phydev) { + int addr = phydev->addr; + phy_device_free(phydev); + of_node_put(port_dn); + fixed_phy_del(addr); + } + } + } + + /* +* Destroy network devices for physical switch ports. +*/ + for (port = 0; port < DSA_MAX_PORTS; port++) { + if (!(ds->phys_port_mask & (1 << port))) + continue; + + if (!ds->ports[port]) + continue; + + unregister_netdev(ds->ports[port]); + free_netdev(ds->ports[port]); + } + + /* +* Do basic unregister. +*/ + mdiobus_unregister(ds->slave_mii_bus); } #ifdef CONFIG_PM_SLEEP -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/3] net: dsa: Complete and fix the dsa unbinding
In order to cleanly unbind the dsa core, either as a module removal, or a platform device unbind, switch the allocation the their devm_ counterparts and complete the destroy functions. The last patch is an experimental way to exit the probe when no switch is found in the discover process. The patches are based on the current net-next. Neil Armstrong (3): net: dsa: Use devm_ prefixed allocations net: dsa: complete dsa_switch_destroy calls net: dsa: exit probe if no switch were found net/dsa/dsa.c | 67 --- 1 file changed, 60 insertions(+), 7 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] net: dsa: Use devm_ prefixed allocations
To simplify and prevent memory leakage when unbinding, use the devm_ memory allocation calls. Signed-off-by: Neil Armstrong--- net/dsa/dsa.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index c59fa5d..98f94c2 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -305,7 +305,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent) if (ret < 0) goto out; - ds->slave_mii_bus = mdiobus_alloc(); + ds->slave_mii_bus = devm_mdiobus_alloc(parent); if (ds->slave_mii_bus == NULL) { ret = -ENOMEM; goto out; @@ -400,7 +400,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, /* * Allocate and initialise switch state. */ - ds = kzalloc(sizeof(*ds) + drv->priv_size, GFP_KERNEL); + ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL); if (ds == NULL) return ERR_PTR(-ENOMEM); @@ -883,7 +883,7 @@ static int dsa_probe(struct platform_device *pdev) goto out; } - dst = kzalloc(sizeof(*dst), GFP_KERNEL); + dst = devm_kzalloc(>dev, sizeof(*dst), GFP_KERNEL); if (dst == NULL) { dev_put(dev); ret = -ENOMEM; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/3] net: dsa: exit probe if no switch were found
If no switch were found in dsa_setup_dst, return -ENODEV and exit the dsa_probe cleanly. Signed-off-by: Neil Armstrong--- net/dsa/dsa.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 0c104af..6ae1ab9 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -844,10 +844,11 @@ static inline void dsa_of_remove(struct device *dev) } #endif -static void dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev, +static int dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev, struct device *parent, struct dsa_platform_data *pd) { int i; + unsigned configured = 0; dst->pd = pd; dst->master_netdev = dev; @@ -867,9 +868,17 @@ static void dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev, dst->ds[i] = ds; if (ds->drv->poll_link != NULL) dst->link_poll_needed = 1; + + ++configured; } /* +* If no switch was found, exit cleanly +*/ + if (!configured) + return -ENODEV; + + /* * If we use a tagging format that doesn't have an ethertype * field, make sure that all packets from this point on get * sent to the tag format's receive function. @@ -885,6 +894,8 @@ static void dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev, dst->link_poll_timer.expires = round_jiffies(jiffies + HZ); add_timer(>link_poll_timer); } + + return 0; } static int dsa_probe(struct platform_device *pdev) @@ -934,9 +945,9 @@ static int dsa_probe(struct platform_device *pdev) platform_set_drvdata(pdev, dst); - dsa_setup_dst(dst, dev, >dev, pd); - - return 0; + ret = dsa_setup_dst(dst, dev, >dev, pd); + if (!ret) + return 0; out: dsa_of_remove(>dev); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type
On 09/30/2015 09:40 AM, Jan Engelhardt wrote: > > On Wednesday 2015-09-30 09:24, Daniel Mack wrote: >> >>> Drop? Makes no sense, else application would not be running in the first >>> place. >> >> Of course you can drop certain packets at this point, depending on other >> details. Say, for instance, you want to match all packets that are >> received by a certain task [...] >> Another use case is accounting. If you want to know how much traffic a >> certain service or application in your system has caused > > But the sk info would be available in INPUT already, would it not? No, only for established connections, as those are subject to early demux which sets skb->sk. For all other packets, netfilter callbacks are called with skb->sk == NULL. That's the whole point of this patch set ;) Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type
On 09/29/2015 11:19 PM, Florian Westphal wrote: > Daniel Mackwrote: >> Add a new chain type NF_INET_LOCAL_SOCKET_IN which is ran after the >> input demux is complete and the final destination socket (if any) >> has been determined. >> >> This helps filtering packets based on information stored in the >> destination socket, such as cgroup controller supplied net class IDs. > > This still seems like the 'x y' problem ("want to do X, think Y is > correct solution; ask about Y, but thats a strange thing to do"). > > There is nothing that this offers over INPUT *except* that sk is > available. But there is zero benefit as far as I am concerned -- > why would you want to do any meaningful filtering based on the sk at > that point...? Well, INPUT and SOCKET_INPUT are just two different tools that help solve different classes of problems. INPUT is for filtering all local traffic while SOCKET_INPUT is just for such that actually has a listener, and they both make sense in different scenarios. > Drop? Makes no sense, else application would not be running in the first > place. Of course you can drop certain packets at this point, depending on other details. Say, for instance, you want to match all packets that are received by a certain task and that are originated from IP addresses of a specific subnet, and drop the rest. Rather than adding matches to your global firewall configuration for all the ports that tasks may or may not listen on, you can just do it on a higher level, from the perspective of an administrator. If you decide to let your web server listen on another port as well, no firewall rule configuration change is needed at all. Another use case is accounting. If you want to know how much traffic a certain service or application in your system has caused, you don't want to match all its ports to firewall rules just in order to get that information. Instead, you can now derive that information on a per-application base. With this patch set, this even works just fine for multicast listeners, which is something that is currently impossible to achieve otherwise. > So the only 'benefit' is that netcls id is available; but > a) why is that even needed and It's currently the only way of realizing application-level firewalls, and it'd be an awesome feature if it actually worked. > b) is such a huge sledgehammer just for net cgroup accounting > worth it? I really don't know if this approach is intrusive enough to make it qualify as sledgehammer. I'd like to see some real-world benchmarks and have proof there is a performance decrease for setups that don't use such chains. > Another question is what other strange things come up once we would > open this door. So let's discuss the possible drawbacks. Again, the deal with this new chain type is simple: if there is no local listener, the rules are not looked at. If you need rules that are processed either way, put them in LOCAL_IN, as you always did. >> listening on a specific task, the resulting error code that is sent >> back to the remote peer can't be controlled with rules in >> NF_INET_LOCAL_SOCKET_IN chains. > > Right, and that makes this even weirder. Well, to be more specific: you can only control the resulting error code that is sent back to the remote peer _if_ there is a local listener. You can do _anything_ _if_ there is a local listener. This is in line with the above description and shouldn't cause much surprises for users. > For deterministic ingress filtering you can only rely on what > is contained in the packet. Why so? For deterministic ingress filtering of traffic directed to a local socket, you can as well rely on information associated with that socket. And this is what application-level firewall rule sets are all about. Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] ntp: use timespec64 in sync_cmos_clock
On Mon, 28 Sep 2015, Arnd Bergmann wrote: > The sync_cmos_clock has one use of struct timespec, which we want to > eventually replace with timespec64 or similar in the kernel. There > is no way this one can overflow, but the conversion to timespec64 > is trivial and has no other dependencies. > > Signed-off-by: Arnd BergmannReviewed-by: Thomas Gleixner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key
On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote: > We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6 > tunnel keys. This can be stored in ip_tunnel_key.tun_flags. Not really. This was my original approach, too, but openvswitch is not the only user of struct ip_tunnel_key, and in the lwtunnel core, tun_flags are handled in the way that makes this impractical. Most importantly, the tun_flags value is directly taken from/stored to LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc. > That also saves space in flow key. The field was added to a 2 byte hole in the struct sw_flow_key (leaving still 1 byte free), thus there's no additional space used. Jiri -- Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 net-next 0/2] ipv4: Hash-based multipath routing
When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increases the risk of out-of-order packets and makes it impossible to use multipath together with anycast services. This patch series replaces the old implementation with flow-based load balancing based on a hash over the source and destination addresses. Distribution of the hash is done with thresholds as described in RFC 2992. This reduces the disruption when a path is added/remove when having more than two paths. To futher the chance of successful usage in conjuction with anycast, ICMP error packets are hashed over the inner IP addresses. This ensures that PMTU will work together with anycast or load-balancers such as IPVS. Port numbers are not considered since fragments could cause problems with anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient, since ICMP inspection effectively extracts information from the opposite flow which might have a different state of the DF-flag. This is also why the RSS hash is not used. These are typically based on the NDIS RSS spec which mandates TCP support. Measurements of the additional overhead of a two-path multipath (p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz): Original per-packet: ~394 cycles/packet L3 hash: ~76 cycles/packet Changes in v5: - Fixed compilation error Changes in v4: - Functions take hash directly instead of func ptr - Added inline hash function - Added dummy macros to minimize ifdefs - Use upper 31 bits of hash instead of lower Changes in v3: - Multipath algorithm is no longer configurable (always L3) - Added random seed to hash - Moved ICMP inspection to isolated function - Ignore source quench packets (deprecated as per RFC 6633) Changes in v2: - Replaced 8-bit xor hash with 31-bit jenkins hash - Don't scale weights (since 31-bit) - Avoided unnecesary renaming of variables - Rely on DF-bit instead of fragment offset when checking for fragmentation - upper_bound is now inclusive to avoid overflow - Use a callback to postpone extracting flow information until necessary - Skipped ICMP inspection entirely with L4 hashing - Handle newly added sysctl ignore_routes_with_linkdown Best Regards Peter Nørlund Peter Nørlund (2): ipv4: L3 hash-based multipath ipv4: ICMP packet inspection for multipath include/net/ip_fib.h | 14 - include/net/route.h | 11 +++- net/ipv4/fib_semantics.c | 140 ++ net/ipv4/icmp.c | 19 +- net/ipv4/route.c | 65 ++-- 5 files changed, 173 insertions(+), 76 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 net-next 2/2] ipv4: ICMP packet inspection for multipath
From: Peter NørlundICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: Peter Nørlund --- include/net/route.h | 11 +- net/ipv4/icmp.c | 19 - net/ipv4/route.c| 59 +-- 3 files changed, 80 insertions(+), 9 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index f46af25..7d79c05 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -110,7 +111,15 @@ struct in_device; int ip_rt_init(void); void rt_cache_flush(struct net *net); void rt_flush_dev(struct net_device *dev); -struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp); +struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp, + int mp_hash); + +static inline struct rtable *__ip_route_output_key(struct net *net, + struct flowi4 *flp) +{ + return __ip_route_output_key_hash(net, flp, -1); +} + struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp, struct sock *sk); struct dst_entry *ipv4_blackhole_route(struct net *net, diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index e5eb8ac..b3a1620 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -440,6 +440,22 @@ out_unlock: icmp_xmit_unlock(sk); } +#ifdef CONFIG_IP_ROUTE_MULTIPATH + +/* Source and destination is swapped. See ip_multipath_icmp_hash */ +static int icmp_multipath_hash_skb(const struct sk_buff *skb) +{ + const struct iphdr *iph = ip_hdr(skb); + + return fib_multipath_hash(iph->daddr, iph->saddr); +} + +#else + +#define icmp_multipath_hash_skb(skb) (-1) + +#endif + static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4, struct sk_buff *skb_in, @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev); security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); - rt = __ip_route_output_key(net, fl4); + rt = __ip_route_output_key_hash(net, fl4, + icmp_multipath_hash_skb(skb_in)); if (IS_ERR(rt)) return rt; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 64367f3..a2479a4 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1646,6 +1646,48 @@ out: return err; } +#ifdef CONFIG_IP_ROUTE_MULTIPATH + +/* To make ICMP packets follow the right flow, the multipath hash is + * calculated from the inner IP addresses in reverse order. + */ +static int ip_multipath_icmp_hash(struct sk_buff *skb) +{ + const struct iphdr *outer_iph = ip_hdr(skb); + struct icmphdr _icmph; + const struct icmphdr *icmph; + struct iphdr _inner_iph; + const struct iphdr *inner_iph; + + if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0)) + goto standard_hash; + + icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph), + &_icmph); + if (!icmph) + goto standard_hash; + + if (icmph->type != ICMP_DEST_UNREACH && + icmph->type != ICMP_REDIRECT && + icmph->type != ICMP_TIME_EXCEEDED && + icmph->type != ICMP_PARAMETERPROB) { + goto standard_hash; + } + + inner_iph = skb_header_pointer(skb, + outer_iph->ihl * 4 + sizeof(_icmph), + sizeof(_inner_iph), &_inner_iph); + if (!inner_iph) + goto standard_hash; + + return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr); + +standard_hash: + return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr); +} + +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ + static int ip_mkroute_input(struct sk_buff *skb, struct fib_result *res, const struct flowi4 *fl4, @@ -1656,7 +1698,10 @@ static int ip_mkroute_input(struct sk_buff *skb, if (res->fi && res->fi->fib_nhs > 1) { int h; - h = fib_multipath_hash(saddr, daddr); + if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP)) + h = ip_multipath_icmp_hash(skb); + else + h = fib_multipath_hash(saddr, daddr); fib_select_multipath(res, h); } #endif @@ -2042,7 +2087,8 @@ add: * Major route resolver routine. */ -struct rtable *__ip_route_output_key(struct net *net, struct
[PATCH v5 net-next 1/2] ipv4: L3 hash-based multipath
From: Peter NørlundReplaces the per-packet multipath with a hash-based multipath using source and destination address. Signed-off-by: Peter Nørlund --- include/net/ip_fib.h | 14 - net/ipv4/fib_semantics.c | 140 +- net/ipv4/route.c | 16 -- 3 files changed, 98 insertions(+), 72 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 727d6e9..7a51fd8 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -79,7 +79,7 @@ struct fib_nh { unsigned char nh_scope; #ifdef CONFIG_IP_ROUTE_MULTIPATH int nh_weight; - int nh_power; + atomic_tnh_upper_bound; #endif #ifdef CONFIG_IP_ROUTE_CLASSID __u32 nh_tclassid; @@ -118,7 +118,7 @@ struct fib_info { #define fib_advmss fib_metrics[RTAX_ADVMSS-1] int fib_nhs; #ifdef CONFIG_IP_ROUTE_MULTIPATH - int fib_power; + int fib_weight; #endif struct rcu_head rcu; struct fib_nh fib_nh[0]; @@ -320,7 +320,15 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev); int fib_sync_down_dev(struct net_device *dev, unsigned long event); int fib_sync_down_addr(struct net *net, __be32 local); int fib_sync_up(struct net_device *dev, unsigned int nh_flags); -void fib_select_multipath(struct fib_result *res); + +extern u32 fib_multipath_secret __read_mostly; + +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) +{ + return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1; +} + +void fib_select_multipath(struct fib_result *res, int hash); /* Exported by fib_trie.c */ void fib_trie_init(void); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 064bd3c..0c49d2f 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; #ifdef CONFIG_IP_ROUTE_MULTIPATH - -static DEFINE_SPINLOCK(fib_multipath_lock); +u32 fib_multipath_secret __read_mostly; #define for_nexthops(fi) { \ int nhsel; const struct fib_nh *nh; \ @@ -532,7 +531,67 @@ errout: return ret; } -#endif +static void fib_rebalance(struct fib_info *fi) +{ + int total; + int w; + struct in_device *in_dev; + + if (fi->fib_nhs < 2) + return; + + total = 0; + for_nexthops(fi) { + if (nh->nh_flags & RTNH_F_DEAD) + continue; + + in_dev = __in_dev_get_rcu(nh->nh_dev); + + if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nh->nh_flags & RTNH_F_LINKDOWN) + continue; + + total += nh->nh_weight; + } endfor_nexthops(fi); + + w = 0; + change_nexthops(fi) { + int upper_bound; + + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); + + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { + upper_bound = -1; + } else if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) { + upper_bound = -1; + } else { + w += nexthop_nh->nh_weight; + upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w, + total) - 1; + } + + atomic_set(_nh->nh_upper_bound, upper_bound); + } endfor_nexthops(fi); + + net_get_random_once(_multipath_secret, + sizeof(fib_multipath_secret)); +} + +static inline void fib_add_weight(struct fib_info *fi, + const struct fib_nh *nh) +{ + fi->fib_weight += nh->nh_weight; +} + +#else /* CONFIG_IP_ROUTE_MULTIPATH */ + +#define fib_rebalance(fi) do { } while (0) +#define fib_add_weight(fi, nh) do { } while (0) + +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ static int fib_encap_match(struct net *net, u16 encap_type, struct nlattr *encap, @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg) change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); + fib_add_weight(fi, nexthop_nh); } endfor_nexthops(fi) + fib_rebalance(fi); + link_it: ofi = fib_find_info(fi); if (ofi) { @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
Re: [PATCH net-next 4/6] xfrm: Add xfrm6 address translation function
On Tue, Sep 29, 2015 at 04:58:46PM -0600, David Ahern wrote: > Hi Tom: > > On 9/29/15 4:17 PM, Tom Herbert wrote: > >This patch adds xfrm6_xlat_addr which is called in the data path > >to perform address translation (primarily for the receive path). Modules > >may register their own callback to perform a translation-- this > >registration is managed by xfrm6_xlat_addr_add and xfrm6_xlat_addr_del. > >xfrm6_xlat_addr allows translation of addresses for an sk_buff. > > > Seems like a stretch to lump this into xfrms. You have a separate > genl based config as opposed to the netlink xfrm API and you are > calling the xlat_addr function directly in ip6_rcv as opposed to via > some policy with dst_ops driven redirection. Why call this a xfrm? I have to agree here. We have policies and states to do the lookups and to describe the transformation. Just adding a callback to do this in a different way does not integrate well into xfrm. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key
On Tue, 29 Sep 2015 19:08:44 -0700, Jesse Gross wrote: > On Tue, Sep 29, 2015 at 10:52 AM, Jiri Bencwrote: > > diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c > > index 5c030a4d7338..03ba070c3256 100644 > > --- a/net/openvswitch/flow_netlink.c > > +++ b/net/openvswitch/flow_netlink.c > > @@ -643,6 +643,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr > > *attr, > > } > > > > SW_FLOW_KEY_PUT(match, tun_key.tun_flags, tun_flags, is_mask); > > + SW_FLOW_KEY_PUT(match, tun_proto, AF_INET, is_mask); > > I don't think this is right in the case of the mask. It will cause the > the mask to be the value AF_INET - instead you want to set the mask to > be 0xff. I think you're right, this is a special case. I'll fix it. Thanks, Jiri -- Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote: > On 29 September 2015 at 21:09, Jason Baronwrote: > > However, if we call connect on socket 's', to connect to a new socket 'o2', > > we > > drop the reference on the original socket 'o'. Thus, we can now close socket > > 'o' without unregistering from epoll. Then, when we either close the ep > > or unregister 'o', we end up with this list corruption. Thus, this is not a > > race per se, but can be triggered sequentially. > > Sounds profound, but the reproducers calls connect only once per > socket. So there is no "connect to a new socket", no? I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is still connected to it. This is detected by 's' in unix_dgram_sendmsg() so that 's' releases its reference on 'o' and 'o' can be freed. If this happens before 's' is unregistered, we get use-after-free as 'o' has never been unregistered. And as the interval between freeing 'o' and unregistering 's' can be quite long, there is a chance for the memory to be reused. This is what one of our customers has seen: [exception RIP: _raw_spin_lock_irqsave+156] RIP: 8040f5bc RSP: 8800e929de78 RFLAGS: 00010082 RAX: a32c RBX: 88003954ab80 RCX: 1000 RDX: f232 RSI: f232 RDI: 88003954ab80 RBP: 5220 R8: dead00100100 R9: R10: 7fff1a284960 R11: 0246 R12: R13: 8800e929de8c R14: 000e R15: ORIG_RAX: CS: 1e030 SS: e02b #8 [8800e929de70] _raw_spin_lock_irqsave at 8040f5a9 #9 [8800e929deb0] remove_wait_queue at 8006ad09 #10 [8800e929ded0] ep_unregister_pollwait at 80170043 #11 [8800e929def0] ep_remove at 80170073 #12 [8800e929df10] sys_epoll_ctl at 80171453 #13 [8800e929df80] system_call_fastpath at 80417553 In this case, crash happened on unregistering 's' which had null peer (i.e. not reconnected but rather disconnected) but there were still two items in the list, the other pointing to an unallocated page which has apparently been modified in between. IMHO unix_dgram_disonnected() could be the place to handle this issue: it is called from both places where we disconnect from a peer (dead peer detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect()) just before the reference to peer is released. I'm not familiar with the epoll implementation so I'm still trying to find what exactly needs to be done to unregister the peer at this moment. > That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for > write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the > reproducer. Sounds likely as this is the commit that introduced unix_dgram_poll() with the code which adds the "asymmetric peer" to monitor its queue state. More precisely, the asymmetricity check has been added by ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") shortly after that. Michal Kubecek -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling
On Tue, 29 Sep 2015 20:05:00 -0700, Jesse Gross wrote: > This appears to me to be a bug in the existing code. > ovs_tunnel_get_egress_info() as a general mechanism is still in use > and should work with both the old and new configuration methods. It's currently used only from the compat layer (the API that the user space that is unaware of lwtunnels use). I don't understand what it would be good for with lwtunnel based tunnels. The metadata_dst is created in the validate_and_copy_set_tun function (net/openvswitch/flow_netlink.c) and used to specify egress encapsulation metadata. The ovs_tunnel_get_egress_info function is not needed. > However, I agree that it doesn't look like it will work currently with > tunnel devices. I think we need to fix this rather than making it more > broken. I'm not making it more broken. We currently (i.e. right now, in the current net.git) have two APIs for tunnel specification in the ovs kernel datapath: the old one, which is translated by the compat layer to create a net_device, and the lwtunnel one, which requires user space to create a (metadata) tunnel net_device and add it to the datapath. I'm simply not adding more code to the first, legacy interface, which seems to be the correct thing to do. Jiri -- Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] y2038 conversion for ntp/pps and sfc driver
On Tue, 29 Sep 2015, David Miller wrote: > From: Arnd Bergmann> Date: Mon, 28 Sep 2015 22:21:27 +0200 > > > When trying to build a kernel with time_t commented out, I found that > > the ntp subsystem still relies on timespec for its pps handling. > > > > This series addresses this and converts all the code to use timespec64 > > instead, step by step. There is one device driver that interacts with > > this code directly (rather than only through the ptp subsystem), so > > I have to convert that driver at the same time. > > > > The patches should ideally stay together as a series, but they do > > span multiple subsystems, so I'm also looking for the right person > > to merge them. > > I'm happy with this going via a tree other than mine, and for the I think it should go via John Stultz timekeeping tree. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] net: sfc: avoid using timespec
On Mon, 28 Sep 2015, Arnd Bergmann wrote: > The sfc driver internally uses a time format based on 32-bit (unsigned) > seconds and 32-bit nanoseconds. This means it will overflow in 2106, > but the value we pass into it is a signed 32-bit tv_sec that already > overflows in 2038 to a negative value. > > This patch changes the logic to use the lower 32 bits of the timespec64 > tv_sec in efx_ptp_ns_to_s_ns, which will have the correct value beyond the > overflow. > While this does not change any of the register values, it lets us > keep using the driver after we deprecate the use of the timespec type > in the kernel. > > In the efx_ptp_process_times function, the change to use timespec64 > is similar, in that the tv_sec portion is ignored anyway and we only > care about the nanosecond portion that remains unchanged. > > Signed-off-by: Arnd BergmannReviewed-by: Thomas Gleixner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ntp/pps: use y2038 safe types in pps_event_time
On Mon, 28 Sep 2015, Arnd Bergmann wrote: > The pps_event_time uses two 'timespec' structures internally, which > suffer from the y2038 problem. The uses of this structure are > fairly self-contained in the pps code, so this replaces them all at > once. > > Unfortunately, this includes the sfc ethernet driver aside from the > pps subsystem, so we change that one as well. Both touch the > same data structure, and there probably is no good way to split > the patch into smaller units. > > Signed-off-by: Arnd BergmannReviewed-by: Thomas Gleixner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/7] netfilter: add NF_INET_LOCAL_SOCKET_IN chain type
On Wednesday 2015-09-30 09:24, Daniel Mack wrote: > >> Drop? Makes no sense, else application would not be running in the first >> place. > >Of course you can drop certain packets at this point, depending on other >details. Say, for instance, you want to match all packets that are >received by a certain task [...] >Another use case is accounting. If you want to know how much traffic a >certain service or application in your system has caused But the sk info would be available in INPUT already, would it not? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key
On Wed, Sep 30, 2015 at 12:09 AM, Jiri Bencwrote: > On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote: >> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6 >> tunnel keys. This can be stored in ip_tunnel_key.tun_flags. > > Not really. This was my original approach, too, but openvswitch is not > the only user of struct ip_tunnel_key, and in the lwtunnel core, > tun_flags are handled in the way that makes this impractical. Most > importantly, the tun_flags value is directly taken from/stored to > LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in > net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc. > How is it impractical ? Userspace can set flag for IPv6 tunnel info. That should be easy. IPv6 bit can not be masked anyways so I do not see problem with masking this flag due to the new bit. Since this field is exposed to userspace. TUNNEL_* flags needs to be moved to uapi header. >> That also saves space in flow key. > > The field was added to a 2 byte hole in the struct sw_flow_key (leaving > still 1 byte free), thus there's no additional space used. > > Jiri > > -- > Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune
Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K) clobbers efficient use of TSO because it inflates the size_goal that is computed in tcp_sendmsg/tcp_sendpage and skews packet latency, and the default values for these parameters actually results in significantly better performance. In request-response tests using rds-stress with a packet size of 100K with 16 threads (test parameters -q 10 -a 256 -t16 -d16) between a single pair of IP addresses achieves a throughput of 6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under equivalent conditions on these platforms. Signed-off-by: Sowmini Varadhan--- net/rds/tcp.c | 16 1 files changed, 4 insertions(+), 12 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index c42b60b..9d6ddba 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -67,21 +67,13 @@ void rds_tcp_nonagle(struct socket *sock) set_fs(oldfs); } +/* All module specific customizations to the RDS-TCP socket should be done in + * rds_tcp_tune() and applied after socket creation. In general these + * customizations should be tunable via module_param() + */ void rds_tcp_tune(struct socket *sock) { - struct sock *sk = sock->sk; - rds_tcp_nonagle(sock); - - /* -* We're trying to saturate gigabit with the default, -* see svc_sock_setbufsize(). -*/ - lock_sock(sk); - sk->sk_sndbuf = RDS_TCP_DEFAULT_BUFSIZE; - sk->sk_rcvbuf = RDS_TCP_DEFAULT_BUFSIZE; - sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK; - release_sock(sk); } u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements
A 3-part patchset that (a) improves current RDS-TCP perf by 2X-3X and (b) refactors earlier robustness code for better observability/scaling. Patch 1 is an enhancment of earlier robustness fixes that had used separate sockets for client and server endpoints to resolve race conditions. It is possible to have an equivalent solution that does not use 2 sockets. The benefit of a single socket solution is that it results in more predictable and observable behavior for the underlying TCP pipe of an RDS connection Patches 2 and 3 are simple, straightforward perf bug fixes that align the RDS TCP socket with other parts of the kernel stack. v2: fix kbuild-test-robot warnings, comments from Sergei Shtylov and Santosh Shilimkar. Sowmini Varadhan (3): Use a single TCP socket for both send and receive. Do not bloat sndbuf/rcvbuf in rds_tcp_tune Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit net/rds/connection.c | 22 ++ net/rds/rds.h|4 +++- net/rds/tcp.c| 16 net/rds/tcp_listen.c | 22 +- net/rds/tcp_send.c |8 +++- 5 files changed, 29 insertions(+), 43 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next 1/3] RDS: Use a single TCP socket for both send and receive.
Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.") modified rds-tcp so that an incoming SYN would ignore an existing "client" TCP connection which had the local port set to the transient port. The motivation for ignoring the existing "client" connection in f711a6ae was to avoid race conditions and an endless duel of reconnect attempts triggered by a restart/abort of one of the nodes in the TCP connection. However, having separate sockets for active and passive sides is avoidable, and the simpler model of a single TCP socket for both send and receives of all RDS connections associated with that tcp socket makes for easier observability. We avoid the race conditions from f711a6ae by attempting reconnects in rds_conn_shutdown if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP. The c_outgoing bit is initialized in __rds_conn_create(). A side-effect of re-using the client rds_connection for an incoming SYN is the potential of encountering duelling SYNs, i.e., we have an outgoing RDS_CONN_CONNECTING socket when we get the incoming SYN. The logic to arbitrate this criss-crossing SYN exchange in rds_tcp_accept_one() has been modified to emulate the BGP state machine: the smaller IP address should back off from the connection attempt. Signed-off-by: Sowmini Varadhan--- v2: kbuild-test-robot warning around __be32, modify subject line per Santosh Shilimkar net/rds/connection.c | 22 ++ net/rds/rds.h|4 +++- net/rds/tcp_listen.c | 22 +- 3 files changed, 18 insertions(+), 30 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index 49adeef..d456403 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -128,10 +128,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, struct rds_transport *loop_trans; unsigned long flags; int ret; - struct rds_transport *otrans = trans; - if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP) - goto new_conn; rcu_read_lock(); conn = rds_conn_lookup(net, head, laddr, faddr, trans); if (conn && conn->c_loopback && conn->c_trans != _loop_transport && @@ -147,7 +144,6 @@ static struct rds_connection *__rds_conn_create(struct net *net, if (conn) goto out; -new_conn: conn = kmem_cache_zalloc(rds_conn_slab, gfp); if (!conn) { conn = ERR_PTR(-ENOMEM); @@ -207,6 +203,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, atomic_set(>c_state, RDS_CONN_DOWN); conn->c_send_gen = 0; + conn->c_outgoing = (is_outgoing ? 1 : 0); conn->c_reconnect_jiffies = 0; INIT_DELAYED_WORK(>c_send_w, rds_send_worker); INIT_DELAYED_WORK(>c_recv_w, rds_recv_worker); @@ -243,22 +240,13 @@ static struct rds_connection *__rds_conn_create(struct net *net, /* Creating normal conn */ struct rds_connection *found; - if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP) - found = NULL; - else - found = rds_conn_lookup(net, head, laddr, faddr, trans); + found = rds_conn_lookup(net, head, laddr, faddr, trans); if (found) { trans->conn_free(conn->c_transport_data); kmem_cache_free(rds_conn_slab, conn); conn = found; } else { - if ((is_outgoing && otrans->t_type == RDS_TRANS_TCP) || - (otrans->t_type != RDS_TRANS_TCP)) { - /* Only the active side should be added to -* reconnect list for TCP. -*/ - hlist_add_head_rcu(>c_hash_node, head); - } + hlist_add_head_rcu(>c_hash_node, head); rds_cong_add_conn(conn); rds_conn_count++; } @@ -337,7 +325,9 @@ void rds_conn_shutdown(struct rds_connection *conn) rcu_read_lock(); if (!hlist_unhashed(>c_hash_node)) { rcu_read_unlock(); - rds_queue_reconnect(conn); + if (conn->c_trans->t_type != RDS_TRANS_TCP || + conn->c_outgoing == 1) + rds_queue_reconnect(conn); } else { rcu_read_unlock(); } diff --git a/net/rds/rds.h b/net/rds/rds.h index afb4048..b4c7ac0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -86,7 +86,9 @@ struct rds_connection { struct hlist_node c_hash_node; __be32 c_laddr; __be32 c_faddr; - unsigned intc_loopback:1; + unsigned intc_loopback:1, +
[PATCH v2 net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
For the same reasons as commit 2f5338442425 ("tcp: allow splice() to build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push() once"), rds_tcp_xmit may have multiple pages to send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage() Signed-off-by: Sowmini Varadhan--- v2: Sergei Shtylov, Santosh Shilimkar comments (some parens retained for readability) net/rds/tcp_send.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c index 53b17ca..2894e60 100644 --- a/net/rds/tcp_send.c +++ b/net/rds/tcp_send.c @@ -83,6 +83,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, struct rds_tcp_connection *tc = conn->c_transport_data; int done = 0; int ret = 0; + int more; if (hdr_off == 0) { /* @@ -116,12 +117,15 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, goto out; } + more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0; while (sg < rm->data.op_nents) { + int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more; + ret = tc->t_sock->ops->sendpage(tc->t_sock, sg_page(>data.op_sg[sg]), rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off, - MSG_DONTWAIT|MSG_NOSIGNAL); + flags); rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(>data.op_sg[sg]), rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off, ret); @@ -134,6 +138,8 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, off = 0; sg++; } + if (sg == rm->data.op_nents - 1) + more = 0; } out: -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 0/6] net: Pass net through ip fragmention
This is the next installment of my work to pass struct net through the output path so the code does not need to guess how to figure out which network namespace it is in, and ultimately routes can have output devices in another network namespace. This round focuses on passing net through ip fragmentation which we seem to call from about everywhere. That is the main ip output paths, the bridge netfilter code, and openvswitch. This has to happend at once accross the tree as function pointers are involved. First some prep work is done, then ipv4 and ipv6 are converted and then temporary helper functions are removed. The changes are also available against nf-next at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master Eric Eric W. Biederman (6): openvswitch: Pass net into ovs_vport_output openvswitch: Pass net into ovs_fragment ipv4: Pass struct net through ip_fragment ipv6: Pass struct net through ip6_fragment bridge: Remove br_nf_push_frag_xmit_sk openvswitch: Remove ovs_vport_output_sk include/linux/netfilter_ipv6.h | 4 ++-- include/net/ip.h| 4 ++-- include/net/ip6_route.h | 4 ++-- net/bridge/br_netfilter_hooks.c | 13 net/ipv4/ip_output.c| 44 +++-- net/ipv6/ip6_output.c | 16 +++ net/ipv6/xfrm6_output.c | 10 -- net/openvswitch/actions.c | 13 ++-- 8 files changed, 52 insertions(+), 56 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling
On Wed, Sep 30, 2015 at 12:28 AM, Jiri Bencwrote: > On Tue, 29 Sep 2015 20:05:00 -0700, Jesse Gross wrote: >> This appears to me to be a bug in the existing code. >> ovs_tunnel_get_egress_info() as a general mechanism is still in use >> and should work with both the old and new configuration methods. > > It's currently used only from the compat layer (the API that the user > space that is unaware of lwtunnels use). Yes but that is a bug. From the perspective of the intended use of this function, I don't think there is any difference between compat and non-compat users. > I don't understand what it would be good for with lwtunnel based > tunnels. The metadata_dst is created in the validate_and_copy_set_tun > function (net/openvswitch/flow_netlink.c) and used to specify egress > encapsulation metadata. The ovs_tunnel_get_egress_info function is not > needed. This function is used to report back information that is the result of the encapsulation process, such as the UDP source port chosen. Take a look at net/openvswitch/actions.c:output_userspace(), particularly the OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key
On Wed, Sep 30, 2015 at 1:13 PM, Pravin Shelarwrote: > On Wed, Sep 30, 2015 at 12:09 AM, Jiri Benc wrote: >> On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote: >>> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6 >>> tunnel keys. This can be stored in ip_tunnel_key.tun_flags. >> >> Not really. This was my original approach, too, but openvswitch is not >> the only user of struct ip_tunnel_key, and in the lwtunnel core, >> tun_flags are handled in the way that makes this impractical. Most >> importantly, the tun_flags value is directly taken from/stored to >> LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in >> net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc. >> > How is it impractical ? Userspace can set flag for IPv6 tunnel info. > That should be easy. > > IPv6 bit can not be masked anyways so I do not see problem with > masking this flag due to the new bit. I think he meant for non-OVS users. > Since this field is exposed to userspace. TUNNEL_* flags needs to be > moved to uapi header. This doesn't really seem all that desirable to me. It's nice to be able to change these as necessary and in the particular case of IPv6, it seems like something that the kernel can manage by itself (as is done in this patch and I think the same strategy would apply regardless of the particular representation). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net-next 1/2] openvswitch: add tunnel protocol to sw_flow_key
On Wed, 30 Sep 2015 13:25:12 -0700, Jesse Gross wrote: > On Wed, Sep 30, 2015 at 1:13 PM, Pravin Shelarwrote: > > On Wed, Sep 30, 2015 at 12:09 AM, Jiri Benc wrote: > >> On Tue, 29 Sep 2015 13:41:34 -0700, Pravin Shelar wrote: > >>> We can add rather add TUNNEL_IPV6 flag to distinguish IPv4 and IPv6 > >>> tunnel keys. This can be stored in ip_tunnel_key.tun_flags. > >> > >> Not really. This was my original approach, too, but openvswitch is not > >> the only user of struct ip_tunnel_key, and in the lwtunnel core, > >> tun_flags are handled in the way that makes this impractical. Most > >> importantly, the tun_flags value is directly taken from/stored to > >> LWTUNNEL_IP_FLAGS/LWTUNNEL_IP6_FLAGS netlink attributes in > >> net/ipv4/ip_tunnel_core.c. This would mean complicated masking, etc. > >> > > How is it impractical ? Userspace can set flag for IPv6 tunnel info. > > That should be easy. > > > > IPv6 bit can not be masked anyways so I do not see problem with > > masking this flag due to the new bit. > > I think he meant for non-OVS users. Yes, I didn't mean masking in ovs, I meant that we'd need to hide the bit from other users, for example in net/ipv4/ip_tunnel_core.c. Currently, the information about ip_tunnel_key protocol is stored outside the structure. Changing this would mean quite big changes in the lwtunnel code (or, rather, IP users of lwtunnel) which doesn't seem worth it just because of ovs. Especially when ovs can store the information just fine without impact on memory footprint. I don't see any real advantage in storing the protocol inside ip_tunnel_key, this looks like it would be just a change for the change. > > Since this field is exposed to userspace. TUNNEL_* flags needs to be > > moved to uapi header. > > This doesn't really seem all that desirable to me. It's nice to be > able to change these as necessary and in the particular case of IPv6, > it seems like something that the kernel can manage by itself (as is > done in this patch and I think the same strategy would apply > regardless of the particular representation). User space can set and get those bits in LWTUNNEL_IP_FLAGS netlink attribute when using lwtunnel+routing rules. It would make sense to move them to uapi but that's for a different patch(set). Jiri -- Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling
On Wed, 30 Sep 2015 13:18:40 -0700, Jesse Gross wrote: > This function is used to report back information that is the result of > the encapsulation process, such as the UDP source port chosen. Take a > look at net/openvswitch/actions.c:output_userspace(), particularly the > OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case. I see. I think it should be addressed separately from this patchset, though, as the function needs to be completely rewritten even for IPv4 and IPv6 can be handled alongside it. I'll change the patch description in v2, the current wording is not correct. I don't think that fixing the bug should be a prerequisite for this patchset, the problem is already there and this patchset doesn't change that. Jiri -- Jiri Benc -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] openvswitch: netlink attributes for IPv6 tunneling
On Wed, Sep 30, 2015 at 2:05 PM, Jiri Bencwrote: > On Wed, 30 Sep 2015 13:18:40 -0700, Jesse Gross wrote: >> This function is used to report back information that is the result of >> the encapsulation process, such as the UDP source port chosen. Take a >> look at net/openvswitch/actions.c:output_userspace(), particularly the >> OVS_USERSPACE_ATTR_EGRESS_TUN_PORT case. > > I see. I think it should be addressed separately from this patchset, > though, as the function needs to be completely rewritten even for IPv4 > and IPv6 can be handled alongside it. > > I'll change the patch description in v2, the current wording is not > correct. I don't think that fixing the bug should be a prerequisite for > this patchset, the problem is already there and this patchset doesn't > change that. Can you at least update the existing code for IPv6 so that this doesn't introduce another lurking issue when the bug is fixed? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: usb: asix: Fix crash on skb alloc failure
If asix_rx_fixup_internal() fails to allocate rx->ax_skb, it will return but not clear rx->size. rx points to driver private data. A later call assumes that nonzero size means ax_skb was allocated and passes a null ax_skb to skb_put. Changed allocation failure return to clear size first. Found testing board with AX88772B devices. Signed-off-by: David B. Robins--- drivers/net/usb/asix_common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c index 75d6f26..079069a 100644 --- a/drivers/net/usb/asix_common.c +++ b/drivers/net/usb/asix_common.c @@ -91,8 +91,10 @@ int asix_rx_fixup_internal(struct usbnet *dev, struct sk_buff *skb, } rx->ax_skb = netdev_alloc_skb_ip_align(dev->net, rx->size); - if (!rx->ax_skb) + if (!rx->ax_skb) { + rx->size = 0; return 0; + } } if (rx->size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Add support for Broadcom's iProc MDIO and Cygnus Ethernet PHY
Hi This patchset adds support for the iProc MDIO interface and the Broadcom Cygnus SoC's internal Ethernet PHY. The internal Ethernet PHY(s) in the Cygnus SoC's are accessed via the MDIO interface found in most of the iProc based chips. The patch also consolidates the common API's used by the Broadcom phys to a common library. Existing Broadcom phy drivers have been modified to use the common library API's. The Ethernet driver for the iProc family will be submitted soon, as will the device tree configurations for the different iProc family SoCs. Arun Parameswaran (4): dt-bindings: net: Broadcom iProc MDIO bus driver device tree binding net: phy: Broadcom iProc MDIO bus driver net: phy: Add Broadcom phy library for common interfaces net: phy: Broadcom Cygnus internal Etherent PHY driver .../devicetree/bindings/net/brcm,iproc-mdio.txt| 23 +++ drivers/net/phy/Kconfig| 28 +++ drivers/net/phy/Makefile | 3 + drivers/net/phy/bcm-cygnus.c | 162 drivers/net/phy/bcm-phy-lib.c | 209 drivers/net/phy/bcm-phy-lib.h | 37 drivers/net/phy/bcm63xx.c | 38 +--- drivers/net/phy/bcm7xxx.c | 127 +++- drivers/net/phy/broadcom.c | 149 +- drivers/net/phy/mdio-bcm-iproc.c | 213 + include/linux/brcmphy.h| 24 +-- 11 files changed, 757 insertions(+), 256 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt create mode 100644 drivers/net/phy/bcm-cygnus.c create mode 100644 drivers/net/phy/bcm-phy-lib.c create mode 100644 drivers/net/phy/bcm-phy-lib.h create mode 100644 drivers/net/phy/mdio-bcm-iproc.c -- 2.5.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] net: phy: Broadcom iProc MDIO bus driver
This patch adds support for the Broadcom iProc MDIO bus interface. The MDIO interface can be found in the Broadcom iProc family Soc's. The MDIO bus is accessed using a combination of command and data registers. This MDIO driver provides access to the Etherent GPHY's connected to the MDIO bus. Signed-off-by: Arun Parameswaran--- drivers/net/phy/Kconfig | 9 ++ drivers/net/phy/Makefile | 1 + drivers/net/phy/mdio-bcm-iproc.c | 213 +++ 3 files changed, 223 insertions(+) create mode 100644 drivers/net/phy/mdio-bcm-iproc.c diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index c5ad98a..b57f6c2 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -225,6 +225,15 @@ config MDIO_BCM_UNIMAC This hardware can be found in the Broadcom GENET Ethernet MAC controllers as well as some Broadcom Ethernet switches such as the Starfighter 2 switches. + +config MDIO_BCM_IPROC + tristate "Broadcom iProc MDIO bus controller" + depends on ARCH_BCM_IPROC || COMPILE_TEST + depends on HAS_IOMEM && OF_MDIO + help + This module provides a driver for the MDIO busses found in the + Broadcom iProc SoC's. + endif # PHYLIB config MICREL_KS8995MA diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index 87f079c..f4e6eb9 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -38,3 +38,4 @@ obj-$(CONFIG_MDIO_SUN4I) += mdio-sun4i.o obj-$(CONFIG_MDIO_MOXART) += mdio-moxart.o obj-$(CONFIG_MDIO_BCM_UNIMAC) += mdio-bcm-unimac.o obj-$(CONFIG_MICROCHIP_PHY)+= microchip.o +obj-$(CONFIG_MDIO_BCM_IPROC) += mdio-bcm-iproc.o diff --git a/drivers/net/phy/mdio-bcm-iproc.c b/drivers/net/phy/mdio-bcm-iproc.c new file mode 100644 index 000..c0b4e65 --- /dev/null +++ b/drivers/net/phy/mdio-bcm-iproc.c @@ -0,0 +1,213 @@ +/* + * Copyright (C) 2015 Broadcom Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation version 2. + * + * This program is distributed "as is" WITHOUT ANY WARRANTY of any + * kind, whether express or implied; without even the implied warranty + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define IPROC_GPHY_MDCDIV0x1a + +#define MII_CTRL_OFFSET 0x000 + +#define MII_CTRL_DIV_SHIFT 0 +#define MII_CTRL_PRE_SHIFT 7 +#define MII_CTRL_BUSY_SHIFT 8 + +#define MII_DATA_OFFSET 0x004 +#define MII_DATA_MASK0x +#define MII_DATA_TA_SHIFT16 +#define MII_DATA_TA_VAL 2 +#define MII_DATA_RA_SHIFT18 +#define MII_DATA_PA_SHIFT23 +#define MII_DATA_OP_SHIFT28 +#define MII_DATA_OP_WRITE1 +#define MII_DATA_OP_READ 2 +#define MII_DATA_SB_SHIFT30 + +struct iproc_mdio_priv { + struct mii_bus *mii_bus; + void __iomem *base; +}; + +static inline int iproc_mdio_wait_for_idle(void __iomem *base) +{ + u32 val; + unsigned int timeout = 1000; /* loop for 1s */ + + do { + val = readl(base + MII_CTRL_OFFSET); + if ((val & BIT(MII_CTRL_BUSY_SHIFT)) == 0) + return 0; + + usleep_range(1000, 2000); + } while (timeout--); + + return -ETIMEDOUT; +} + +static inline void iproc_mdio_config_clk(void __iomem *base) +{ + u32 val; + + val = (IPROC_GPHY_MDCDIV << MII_CTRL_DIV_SHIFT) | + BIT(MII_CTRL_PRE_SHIFT); + writel(val, base + MII_CTRL_OFFSET); +} + +static int iproc_mdio_read(struct mii_bus *bus, int phy_id, int reg) +{ + struct iproc_mdio_priv *priv = bus->priv; + u32 cmd; + int rc; + + rc = iproc_mdio_wait_for_idle(priv->base); + if (rc) + return rc; + + iproc_mdio_config_clk(priv->base); + + /* Prepare the read operation */ + cmd = (MII_DATA_TA_VAL << MII_DATA_TA_SHIFT) | + (reg << MII_DATA_RA_SHIFT) | + (phy_id << MII_DATA_PA_SHIFT) | + BIT(MII_DATA_SB_SHIFT) | + (MII_DATA_OP_READ << MII_DATA_OP_SHIFT); + + writel(cmd, priv->base + MII_DATA_OFFSET); + + rc = iproc_mdio_wait_for_idle(priv->base); + if (rc) + return rc; + + cmd = readl(priv->base + MII_DATA_OFFSET) & MII_DATA_MASK; + + return cmd; +} + +static int iproc_mdio_write(struct mii_bus *bus, int phy_id, + int reg, u16 val) +{ + struct iproc_mdio_priv *priv = bus->priv; + u32 cmd; + int rc; + + rc = iproc_mdio_wait_for_idle(priv->base); + if (rc) + return rc; + + iproc_mdio_config_clk(priv->base); + + /* Prepare the
Re: [PATCH 3/4] net: phy: Add Broadcom phy library for common interfaces
Hi Arun, [auto build test results on v4.3-rc3 -- if it's inappropriate base, please ignore] config: i386-randconfig-i1-201539 (attached as .config) reproduce: git checkout 25a633b2114806a7ce7d4f171c4714880e2c721b # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): >> ERROR: "bcm_phy_config_intr" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_ack_intr" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_read_exp" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_write_exp" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_read_shadow" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_write_shadow" [drivers/net/phy/broadcom.ko] undefined! >> ERROR: "bcm_phy_enable_apd" [drivers/net/phy/bcm7xxx.ko] undefined! >> ERROR: "bcm_phy_enable_eee" [drivers/net/phy/bcm7xxx.ko] undefined! >> ERROR: "bcm_phy_write_misc" [drivers/net/phy/bcm7xxx.ko] undefined! >> ERROR: "bcm_phy_write_exp" [drivers/net/phy/bcm7xxx.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [RFC PATCH 1/3] net: dsa: Use devm_ prefixed allocations
Hi Neil I tested all three patches on a board with three switches. 1) Normal boot 2) Bad address set for the 3rd switch so that it was not found, so causing the probe to fail. No regressions observed. Tested-by: Andrew LunnAs Florian said, this is going in the right direction for modular DSA, but still quite a way to go... Thanks Andrew On Wed, Sep 30, 2015 at 10:21:08AM +0200, Neil Armstrong wrote: > To simplify and prevent memory leakage when unbinding, use > the devm_ memory allocation calls. > > Signed-off-by: Neil Armstrong > --- > net/dsa/dsa.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c > index c59fa5d..98f94c2 100644 > --- a/net/dsa/dsa.c > +++ b/net/dsa/dsa.c > @@ -305,7 +305,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, > struct device *parent) > if (ret < 0) > goto out; > > - ds->slave_mii_bus = mdiobus_alloc(); > + ds->slave_mii_bus = devm_mdiobus_alloc(parent); > if (ds->slave_mii_bus == NULL) { > ret = -ENOMEM; > goto out; > @@ -400,7 +400,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, > /* >* Allocate and initialise switch state. >*/ > - ds = kzalloc(sizeof(*ds) + drv->priv_size, GFP_KERNEL); > + ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL); > if (ds == NULL) > return ERR_PTR(-ENOMEM); > > @@ -883,7 +883,7 @@ static int dsa_probe(struct platform_device *pdev) > goto out; > } > > - dst = kzalloc(sizeof(*dst), GFP_KERNEL); > + dst = devm_kzalloc(>dev, sizeof(*dst), GFP_KERNEL); > if (dst == NULL) { > dev_put(dev); > ret = -ENOMEM; > -- > 1.9.1 > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] dt-bindings: net: Broadcom iProc MDIO bus driver device tree binding
Add device tree binding documentation for the Broadcom iProc MDIO bus driver. Signed-off-by: Arun Parameswaran--- .../devicetree/bindings/net/brcm,iproc-mdio.txt| 23 ++ 1 file changed, 23 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt diff --git a/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt b/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt new file mode 100644 index 000..689f97c --- /dev/null +++ b/Documentation/devicetree/bindings/net/brcm,iproc-mdio.txt @@ -0,0 +1,23 @@ +* Broadcom iProc MDIO bus controller + +Required properties: +- compatible: should be "brcm,iproc-mdio" +- reg: address and length of the register set for the MDIO interface +- #size-cells: must be 1 +- #address-cells: must be 0 + +Child nodes of this MDIO bus controller node are standard Ethernet PHY device +nodes as described in Documentation/devicetree/bindings/net/phy.txt + +Example: + +mdio@0x18002000 { + compatible = "brcm,iproc-mdio"; + reg = <0x18002000 0x8>; + #size-cells = <1>; + #address-cells = <0>; + + enet-gphy@0 { + reg = <0>; + }; +}; -- 2.5.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
DSA driver - how to glue to a PCI based NIC's mdio?
Greetings, I'm working on adding DSA support for a PCIe expansion card (designed by us) that has common PCIe NIC connected via its mii-bus to a Marvell MV88E6171. Because the NIC is a PCIe device, it has no device-tree representation of its NIC or its mdio bus, but does register its mdio bus with Linux. Does anyone know how I would be able to specify the Ethernet device and MDIO bus for this add-in card if it has no device-tree handle? I can write a phy driver that gets probed by matching the MV88E6171 PHY ID when the NIC's mdio bus get's registered yet even then I'm not clear how to get hold of the netdev pointer given the mdio bus so as to build a dsa_chip_data struct to register as a platform device. I'm also not clear in this case how to verify in this case if the MV88E6171 is from 'my' add-in card vs another card. Perhaps the right approach is to program the NIC's EEPROM on our board with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the NIC's driver, and within the NIC's driver create and register dsa platform device when our ID is encountered? Thanks for any advise, Tim Tim Harvey - Principal Software Engineer Gateworks Corporation - http://www.gateworks.com/ 3026 S. Higuera St. San Luis Obispo CA 93401 805-781-2000 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DSA driver - how to glue to a PCI based NIC's mdio?
On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote: > Greetings, > > I'm working on adding DSA support for a PCIe expansion card (designed > by us) that has common PCIe NIC connected via its mii-bus to a Marvell > MV88E6171. Because the NIC is a PCIe device, it has no device-tree > representation of its NIC or its mdio bus, but does register its mdio > bus with Linux. It is possible to represent PCIe devices in device tree. Take a look at ePAPR. Is the PCIe host in DT? > Perhaps the right approach is to program the NIC's EEPROM on our board > with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the > NIC's driver, and within the NIC's driver create and register dsa > platform device when our ID is encountered? This sounds sensible. But i doubt you can add your DSA platform information to the NIC's device driver. Better would be to have a small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would instantiate the NIC driver, and insert a DSA platform device. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DSA driver - how to glue to a PCI based NIC's mdio?
On Wed, Sep 30, 2015 at 2:12 PM, Andrew Lunnwrote: > On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote: >> Greetings, >> >> I'm working on adding DSA support for a PCIe expansion card (designed >> by us) that has common PCIe NIC connected via its mii-bus to a Marvell >> MV88E6171. Because the NIC is a PCIe device, it has no device-tree >> representation of its NIC or its mdio bus, but does register its mdio >> bus with Linux. > > It is possible to represent PCIe devices in device tree. Take a look > at ePAPR. Is the PCIe host in DT? It is possible to represent PCI devices in device-tree however not in a dynamic or plug-able fashion - they have to be nested per bus/slot which defeats the purpose of dynamic enumeration. > >> Perhaps the right approach is to program the NIC's EEPROM on our board >> with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the >> NIC's driver, and within the NIC's driver create and register dsa >> platform device when our ID is encountered? > > This sounds sensible. But i doubt you can add your DSA platform > information to the NIC's device driver. Better would be to have a > small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would > instantiate the NIC driver, and insert a DSA platform device. I was thinking of this as well, but then I would still need that shim to know the netdevice that the driver I'm shimming creates so I can't figure a way to do it without touching the PCI driver. Tim -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] net: phy: Broadcom Cygnus internal Etherent PHY driver
Add support for the Broadcom Cygnus SoCs internal PHY's. The PHYs are 1000M/100M/10M capable with support for 'EEE' and 'APD' (Auto Power Down). This driver supports the following Broadcom Cygnus SoCs: - BCM583XX (BCM58300, BCM58302, BCM58303, BCM58305) - BCM113XX (BCM11300, BCM11320, BCM11350, BCM11360) The PHY's on these SoC's require some workarounds for stable operation, both during configuration time and during suspend/resume. This driver handles the application of the workarounds. Signed-off-by: Arun Parameswaran--- drivers/net/phy/Kconfig | 13 drivers/net/phy/Makefile | 1 + drivers/net/phy/bcm-cygnus.c | 162 +++ include/linux/brcmphy.h | 2 + 4 files changed, 178 insertions(+) create mode 100644 drivers/net/phy/bcm-cygnus.c diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index 0c85a01..da979ec 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -79,6 +79,19 @@ config BROADCOM_PHY Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464, BCM5481 and BCM5482 PHYs. +config BCM_CYGNUS_PHY + tristate "Drivers for Broadcom Cygnus SoC internal PHY" + depends on ARCH_BCM_CYGNUS || COMPILE_TEST + select BCM_NET_PHYLIB + select MDIO_BCM_IPROC + ---help--- + This PHY driver is for the 1G internal PHYs of the Broadcom + Cygnus Family SoC. + + Currently supports internal PHY's used in the BCM11300, + BCM11320, BCM11350, BCM11360, BCM58300, BCM58302, + BCM58303 & BCM58305 Broadcom Cygnus SoCs. + config BCM63XX_PHY tristate "Drivers for Broadcom 63xx SOCs internal PHY" depends on BCM63XX diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index 6932475..7655d47 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -17,6 +17,7 @@ obj-$(CONFIG_BROADCOM_PHY)+= broadcom.o obj-$(CONFIG_BCM63XX_PHY) += bcm63xx.o obj-$(CONFIG_BCM7XXX_PHY) += bcm7xxx.o obj-$(CONFIG_BCM87XX_PHY) += bcm87xx.o +obj-$(CONFIG_BCM_CYGNUS_PHY) += bcm-cygnus.o obj-$(CONFIG_ICPLUS_PHY) += icplus.o obj-$(CONFIG_REALTEK_PHY) += realtek.o obj-$(CONFIG_LSI_ET1011C_PHY) += et1011c.o diff --git a/drivers/net/phy/bcm-cygnus.c b/drivers/net/phy/bcm-cygnus.c new file mode 100644 index 000..28bab20 --- /dev/null +++ b/drivers/net/phy/bcm-cygnus.c @@ -0,0 +1,162 @@ +/* + * Copyright (C) 2015 Broadcom Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation version 2. + * + * This program is distributed "as is" WITHOUT ANY WARRANTY of any + * kind, whether express or implied; without even the implied warranty + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Broadcom Cygnus SoC internal transceivers support. */ +#include "bcm-phy-lib.h" +#include +#include +#include +#include + +/* Broadcom Cygnus Phy specific registers */ +#define MII_BCM_CORE_BASE1E 0x1E /* Core BASE1E register */ +#define MII_BCM_EXPB0 0xB0 /* EXPB0 register */ +#define MII_BCM_EXPB1 0xB1 /* EXPB1 register */ + +#define MII_BCM_CYGNUS_AFE_VDAC_ICTRL_0 0x91E5 /* VDAL Control register */ + +static int bcm_cygnus_afe_config(struct phy_device *phydev) +{ + int rc; + + /* ensure smdspclk is enabled */ + rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, 0x0c30); + if (rc < 0) + return rc; + + /* AFE_VDAC_ICTRL_0 bit 7:4 Iq=1100 for 1g 10bt, normal modes */ + rc = bcm_phy_write_misc(phydev, 0x39, 0x01, 0xA7C8); + if (rc < 0) + return rc; + + /* AFE_HPF_TRIM_OTHERS bit11=1, short cascode enable for all modes*/ + rc = bcm_phy_write_misc(phydev, 0x3A, 0x00, 0x0803); + if (rc < 0) + return rc; + + /* AFE_TX_CONFIG_1 bit 7:4 Iq=1100 for test modes */ + rc = bcm_phy_write_misc(phydev, 0x3A, 0x01, 0xA740); + if (rc < 0) + return rc; + + /* AFE TEMPSEN_OTHERS rcal_HT, rcal_LT 1 */ + rc = bcm_phy_write_misc(phydev, 0x3A, 0x03, 0x8400); + if (rc < 0) + return rc; + + /* AFE_FUTURE_RSV bit 2:0 rccal <2:0>=100 */ + rc = bcm_phy_write_misc(phydev, 0x3B, 0x00, 0x0004); + if (rc < 0) + return rc; + + /* Adjust bias current trim to overcome digital offSet */ + rc = phy_write(phydev, MII_BCM_CORE_BASE1E, 0x02); + if (rc < 0) + return rc; + + /* make rcal=100, since rdb default is 000 */ + rc = bcm_phy_write_exp(phydev, MII_BCM_EXPB1, 0x10); + if (rc < 0) + return rc; + + /* CORE_EXPB0, Reset R_CAL/RC_CAL Engine */ + rc = bcm_phy_write_exp(phydev, MII_BCM_EXPB0, 0x10); + if
[PATCH 3/4] net: phy: Add Broadcom phy library for common interfaces
This patch adds the Broadcom phy library to consolidate common interfaces shared by Broadcom phy's. Moved the common interfaces to the 'bcm-phy-lib.c' and updated the Broadcom PHY drivers to use the new APIs. Signed-off-by: Arun Parameswaran--- drivers/net/phy/Kconfig | 6 ++ drivers/net/phy/Makefile | 1 + drivers/net/phy/bcm-phy-lib.c | 209 ++ drivers/net/phy/bcm-phy-lib.h | 37 drivers/net/phy/bcm63xx.c | 38 +--- drivers/net/phy/bcm7xxx.c | 127 ++--- drivers/net/phy/broadcom.c| 149 +- include/linux/brcmphy.h | 22 + 8 files changed, 333 insertions(+), 256 deletions(-) create mode 100644 drivers/net/phy/bcm-phy-lib.c create mode 100644 drivers/net/phy/bcm-phy-lib.h diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index b57f6c2..0c85a01 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -69,8 +69,12 @@ config SMSC_PHY ---help--- Currently supports the LAN83C185, LAN8187 and LAN8700 PHYs +config BCM_NET_PHYLIB + bool + config BROADCOM_PHY tristate "Drivers for Broadcom PHYs" + select BCM_NET_PHYLIB ---help--- Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464, BCM5481 and BCM5482 PHYs. @@ -78,11 +82,13 @@ config BROADCOM_PHY config BCM63XX_PHY tristate "Drivers for Broadcom 63xx SOCs internal PHY" depends on BCM63XX + select BCM_NET_PHYLIB ---help--- Currently supports the 6348 and 6358 PHYs. config BCM7XXX_PHY tristate "Drivers for Broadcom 7xxx SOCs internal PHYs" + select BCM_NET_PHYLIB ---help--- Currently supports the BCM7366, BCM7439, BCM7445, and 40nm and 65nm generation of BCM7xxx Set Top Box SoCs. diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index f4e6eb9..6932475 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -12,6 +12,7 @@ obj-$(CONFIG_QSEMI_PHY) += qsemi.o obj-$(CONFIG_SMSC_PHY) += smsc.o obj-$(CONFIG_TERANETICS_PHY) += teranetics.o obj-$(CONFIG_VITESSE_PHY) += vitesse.o +obj-$(CONFIG_BCM_NET_PHYLIB) += bcm-phy-lib.o obj-$(CONFIG_BROADCOM_PHY) += broadcom.o obj-$(CONFIG_BCM63XX_PHY) += bcm63xx.o obj-$(CONFIG_BCM7XXX_PHY) += bcm7xxx.o diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c new file mode 100644 index 000..13e161e --- /dev/null +++ b/drivers/net/phy/bcm-phy-lib.c @@ -0,0 +1,209 @@ +/* + * Copyright (C) 2015 Broadcom Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation version 2. + * + * This program is distributed "as is" WITHOUT ANY WARRANTY of any + * kind, whether express or implied; without even the implied warranty + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include "bcm-phy-lib.h" +#include +#include +#include +#include + +#define MII_BCM_CHANNEL_WIDTH 0x2000 +#define BCM_CL45VEN_EEE_ADV 0x3c + +int bcm_phy_write_exp(struct phy_device *phydev, u16 reg, u16 val) +{ + int rc; + + rc = phy_write(phydev, MII_BCM54XX_EXP_SEL, reg); + if (rc < 0) + return rc; + + return phy_write(phydev, MII_BCM54XX_EXP_DATA, val); +} +EXPORT_SYMBOL_GPL(bcm_phy_write_exp); + +int bcm_phy_read_exp(struct phy_device *phydev, u16 reg) +{ + int val; + + val = phy_write(phydev, MII_BCM54XX_EXP_SEL, reg); + if (val < 0) + return val; + + val = phy_read(phydev, MII_BCM54XX_EXP_DATA); + + /* Restore default value. It's O.K. if this write fails. */ + phy_write(phydev, MII_BCM54XX_EXP_SEL, 0); + + return val; +} +EXPORT_SYMBOL_GPL(bcm_phy_read_exp); + +int bcm_phy_write_misc(struct phy_device *phydev, + u16 reg, u16 chl, u16 val) +{ + int rc; + int tmp; + + rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, + MII_BCM54XX_AUXCTL_SHDWSEL_MISC); + if (rc < 0) + return rc; + + tmp = phy_read(phydev, MII_BCM54XX_AUX_CTL); + tmp |= MII_BCM54XX_AUXCTL_ACTL_SMDSP_ENA; + rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, tmp); + if (rc < 0) + return rc; + + tmp = (chl * MII_BCM_CHANNEL_WIDTH) | reg; + rc = bcm_phy_write_exp(phydev, tmp, val); + + return rc; +} +EXPORT_SYMBOL_GPL(bcm_phy_write_misc); + +int bcm_phy_read_misc(struct phy_device *phydev, + u16 reg, u16 chl) +{ + int rc; + int tmp; + + rc = phy_write(phydev, MII_BCM54XX_AUX_CTL, + MII_BCM54XX_AUXCTL_SHDWSEL_MISC); + if (rc < 0) + return rc;
Re: Problem with ICMP rate limiting and redirects
On Wed, 2015-09-30 at 15:13 -0300, Hugo Vasconcelos Saldanha wrote: > Hi Eric, > > On Wed, Sep 30, 2015 at 1:42 PM, Eric Dumazetwrote: > > On Wed, 2015-09-30 at 13:10 -0300, Hugo Vasconcelos Saldanha wrote: > >> Hi, > >> > >> While updating the kernel from v3.2 to v3.14, I started to see a > >> different behavior concerning ICMP redirects sent by this updated > >> server. The network is somewhat configured like this: > >> > >>---|firewall|{Internet} > >> |client|--| > >> | > >>---|router|--|172.16/12 network| > >> > >> The client's default gateway is 'firewall', which is the updated > >> server. It has a static route to 172.16 network by 'router'. If > >> 'client' wants to talk to a server in that network, 'firewall' sends a > >> ICMP redirect pointing to router as the gateway. > >> > >> This worked fine with v3.2. But after the upgrade, if an ICMP message > >> that is rate-limited (by the sysctl_icmp_ratelimit mask) is sent to > >> 'client', ICMP redirects stop being sent to the same client. This > >> happens, for example, when traceroute'ing from the client to the > >> server inside the mentioned network. In this situation, a ICMP Time > >> Exceeded message is sent in response to traceroute's first packet, but > >> then the following packets never generate any ICMP redirect messages > >> in 'firewall'. > >> > >> Debugging the code, I was able to see that the problem is being caused > >> by the fact that ip_rt_send_redirect() started to use the inetpeer > >> cache and the fields used to rate limit ICMP redirects (rate_tokens > >> and rate_last) are now being shared with the algorithm applied in > >> inet_peer_xrlim_allow(). This never happened with v3.2 because > >> apparently inet_peer_xrlim_allow() and ip_rt_send_redirect() used > >> different inetpeer objects. > >> > >> The reason why this breaks the functionality is that, while > >> inet_peer_xrlim_allow() uses a time bucket, ip_rt_send_redirect() uses > >> rate_tokens as a packet counter. Not to mention the fact that these > >> are two completely different policies which should be controlled by > >> different buckets, counters, flags, etc. Because of this, > >> ip_rt_redirect_silence, ip_rt_redirect_number and ip_rt_redirect_load > >> /proc files are broken also. > >> > >> The easiest solution would be to create new fields in 'struct > >> inetpeer' to control ICMP redirects only, but I'm not able to measure > >> its convenience. > >> > >> Any thoughts? > >> > >> PS: Apparently, a similar problem was reported here: > >> http://marc.info/?l=linux-netdev=139696540600985 > >> > >> PS2: I could try to reproduce the problem with the latest code if this > >> is really necessary. > > > > Hmm... Do you have commit > > > > 4cdf507d54525842dfd9f6313fdafba039084046 > > ("icmp: add a global rate limitation") > > in your kernel ? > > > > No, but i just tested it and problem continues. AFAICT, ICMP redirects > shouldn't be limited by the logic implemented by that patch, at least > with default icmp_ratemask. And the algorithm in ip_rt_send_redirect() > has a different purpose, too. OK thanks. I guess I also gave the commit to give a hint why relying on inetpeer might open doors for DDOS. Note that if we really want to send millions of ICMP messages per second, we might extend idea and infra added in commit 04ca6973f7c1a ("ip: make IP identifiers less predictable") : add a token bucket in the ip_idents hash and no longer rely on inetpeer. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DSA driver - how to glue to a PCI based NIC's mdio?
On 30/09/15 14:27, Tim Harvey wrote: > On Wed, Sep 30, 2015 at 2:12 PM, Andrew Lunnwrote: >> On Wed, Sep 30, 2015 at 01:44:52PM -0700, Tim Harvey wrote: >>> Greetings, >>> >>> I'm working on adding DSA support for a PCIe expansion card (designed >>> by us) that has common PCIe NIC connected via its mii-bus to a Marvell >>> MV88E6171. Because the NIC is a PCIe device, it has no device-tree >>> representation of its NIC or its mdio bus, but does register its mdio >>> bus with Linux. >> >> It is possible to represent PCIe devices in device tree. Take a look >> at ePAPR. Is the PCIe host in DT? > > It is possible to represent PCI devices in device-tree however not in > a dynamic or plug-able fashion - they have to be nested per bus/slot > which defeats the purpose of dynamic enumeration. Even though a bus is completely auto-discoverable, if there is additional information needed to supplement that topology, having things be represented in Device Tree is typically accepted. > >> >>> Perhaps the right approach is to program the NIC's EEPROM on our board >>> with a PCI_ID/DEVICE_ID of ours, add support for those ID's to the >>> NIC's driver, and within the NIC's driver create and register dsa >>> platform device when our ID is encountered? >> >> This sounds sensible. But i doubt you can add your DSA platform >> information to the NIC's device driver. Better would be to have a >> small shim driver which is loaded on your PCI_ID/DEVICE_ID. That would >> instantiate the NIC driver, and insert a DSA platform device. > > I was thinking of this as well, but then I would still need that shim > to know the netdevice that the driver I'm shimming creates so I can't > figure a way to do it without touching the PCI driver. You can register a network device notifier, and try to extract that information about this network device you need once you see that device being registered. As an example, there is a loopback/fake DSA switch driver here which uses the loopback interface as a parent network device (NB: this is using the network device name, which is pretty lame, but that does the job): https://github.com/ffainelli/linux/commit/67d1db45d17f8cc3b32d7a46c49d5df736cee56c -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next 4/6] net: switchdev: pass callback to dump operation
Tue, Sep 29, 2015 at 06:07:16PM CEST, vivien.dide...@savoirfairelinux.com wrote: >Similar to the notifier_call callback of a notifier_block, change the >function signature of switchdev dump operation to: > >int switchdev_port_obj_dump(struct net_device *dev, >enum switchdev_obj_id id, void *obj, >int (*cb)(void *obj)); > >This allows the caller to pass and expect back a specific >switchdev_obj_* structure instead of the generic switchdev_obj one. > >Drivers implementation of dump operation can now expect this specific >structure and call the callback with it. Drivers have been changed >accordingly. > >Signed-off-by: Vivien Didelot>--- > drivers/net/ethernet/rocker/rocker.c | 21 + > include/net/switchdev.h | 9 +--- > net/dsa/slave.c | 26 +++-- > net/switchdev/switchdev.c| 45 ++-- > 4 files changed, 53 insertions(+), 48 deletions(-) > >diff --git a/drivers/net/ethernet/rocker/rocker.c >b/drivers/net/ethernet/rocker/rocker.c >index 78fd443..107adb6 100644 >--- a/drivers/net/ethernet/rocker/rocker.c >+++ b/drivers/net/ethernet/rocker/rocker.c >@@ -4538,10 +4538,10 @@ static int rocker_port_obj_del(struct net_device *dev, > } > > static int rocker_port_fdb_dump(const struct rocker_port *rocker_port, >- struct switchdev_obj *obj) >+ struct switchdev_obj_fdb *fdb, >+ int (*cb)(void *obj)) we should have some typedef for this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 01/22] net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop
The skb doesn't change within the function. Therefore it's only necessary to check if we need GSO once at the beginning. Signed-off-by: Julien GrallAcked-by: Wei Liu --- Cc: Ian Campbell Cc: netdev@vger.kernel.org Changes in v4: - Add Wei's acked Changes in v2: - Patch added --- drivers/net/xen-netback/netback.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index ec98d43..c4e6c02 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -288,6 +288,13 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb unsigned long bytes; int gso_type = XEN_NETIF_GSO_TYPE_NONE; + if (skb_is_gso(skb)) { + if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) + gso_type = XEN_NETIF_GSO_TYPE_TCPV4; + else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) + gso_type = XEN_NETIF_GSO_TYPE_TCPV6; + } + /* Data must not cross a page boundary. */ BUG_ON(size + offset > PAGE_SIZE< gso_type & SKB_GSO_TCPV4) - gso_type = XEN_NETIF_GSO_TYPE_TCPV4; - else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) - gso_type = XEN_NETIF_GSO_TYPE_TCPV6; - } - if (*head && ((1 << gso_type) & queue->vif->gso_mask)) queue->rx.req_cons++; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] amd-xgbe: fix potential memory leak in xgbe-debugfs
Added kfree() to avoid the memory leak when debugfs_create_dir() fails. Signed-off-by: Geliang Tang--- drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c index 2c063b6..66137ff 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c @@ -330,6 +330,7 @@ void xgbe_debugfs_init(struct xgbe_prv_data *pdata) pdata->xgbe_debugfs = debugfs_create_dir(buf, NULL); if (!pdata->xgbe_debugfs) { netdev_err(pdata->netdev, "debugfs_create_dir failed\n"); + kfree(buf); return; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net ipv4: use preferred log methods
Hi, On 09/30/2015 11:20 AM, Bastian Stender wrote: Replace printk calls with preferred unconditional log method calls to keep kernel messages clean. Signed-off-by: Bastian Stender--- net/ipv4/ipconfig.c| 53 +- net/ipv4/netfilter/arp_tables.c| 17 + net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_nat_snmp_basic.c | 31 --- 4 files changed, 36 insertions(+), 67 deletions(-) Please ignore my previous patch. I'll test it again and resubmit it. Thanks for your patience. Regards, Bastian Stender -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html