Re: linux-next: build failure after merge of the rcu tree
Hi Paul, On Sun, 12 Feb 2017 20:37:48 -0800 "Paul E. McKenney" wrote: > > I chickened out on that commit for this merge window, so it will come > back at -rc1. But I will cover that when I rebase to -rc1. OK, thanks. -- Cheers, Stephen Rothwell
Re: [PATCH] Make EN2 pin optional in the TRF7970A driver
Hello Rob, Am 10.02.2017 um 16:51 schrieb Rob Herring: On Tue, Feb 07, 2017 at 06:22:04AM +0100, Heiko Schocher wrote: From: Guan Ben Make the EN2 pin optional. This is useful for boards, which have this pin fix wired, for example to ground. Signed-off-by: Guan Ben Signed-off-by: Mark Jonas Signed-off-by: Heiko Schocher --- .../devicetree/bindings/net/nfc/trf7970a.txt | 4 ++-- drivers/nfc/trf7970a.c | 26 -- 2 files changed, 16 insertions(+), 14 deletions(-) diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt index 32b35a0..5889a3d 100644 --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt @@ -5,8 +5,8 @@ Required properties: - spi-max-frequency: Maximum SPI frequency (<= 200). - interrupt-parent: phandle of parent interrupt handler. - interrupts: A single interrupt specifier. -- ti,enable-gpios: Two GPIO entries used for 'EN' and 'EN2' pins on the - TRF7970A. +- ti,enable-gpios: One or two GPIO entries used for 'EN' and 'EN2' pins on the + TRF7970A. EN2 is optional. Could EN ever be optional/fixed? If so, perhaps deprecate this property and do 2 properties, one for each pin. The hardware I have has the EN2 pin fix connected to ground. Looking into http://www.ti.com/lit/ds/slos743k/slos743k.pdf page 19 table 6-3 and 6-4 the EN2 pin is a don;t core if EN = 1. If EN = 0 EN2 pin selects between Power Down and Sleep Mode ... I see no reason why this is not possible/allowed ... Hmm.. I do not like the idea of deprecating the "ti,enable-gpios" property into 2 seperate properties ... but if this would be a reason for not accepting this patch, I can do this ... How should I name the 2 new properties? "ti,pin-enable" and "ti,pin-enable2" ? bye, Heiko -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb
On Mon, Feb 13, 2017 at 10:47:10AM +0530, Selvin Xavier wrote: > On Sun, Feb 12, 2017 at 8:00 PM, Leon Romanovsky wrote: > >> +static u8 __rc_to_ib_wc_status(u8 qstatus) > >> +{ > >> + switch (qstatus) { > >> + case CQ_RES_RC_STATUS_OK: > >> + return IB_WC_SUCCESS; > >> + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR: > >> + return IB_WC_LOC_ACCESS_ERR; > >> + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR: > >> + return IB_WC_LOC_LEN_ERR; > >> + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR: > >> + return IB_WC_LOC_PROT_ERR; > >> + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR: > >> + return IB_WC_LOC_QP_OP_ERR; > >> + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR: > >> + return IB_WC_GENERAL_ERR; > >> + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR: > >> + return IB_WC_REM_INV_REQ_ERR; > >> + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR: > >> + return IB_WC_WR_FLUSH_ERR; > >> + case CQ_RES_RC_STATUS_HW_FLUSH_ERR: > >> + return IB_WC_WR_FLUSH_ERR; > >> + default: > >> + return IB_WC_GENERAL_ERR; > >> + } > >> +} > >> + > > > > Why don't you use these defines directly? > > CQ_RES* values are returned by the HW and these values are > different from the corresponding IB_WC status values. say, > CQ_RES_RC_STATUS_HW_FLUSH_ERR is 8 where as > IB_WC_WR_FLUSH_ERR is 5. > So we thought it is better to map these values in a function rather > than having a switch/case in the calling function. > > Let me know if you meant something different in your query. Thanks, This from_u8 -> to_u8 conversion confused me, because of our similar function mlx5_handle_error_cqe() which updates wc->status at the same time as it is called. So I expected to see something similar in your code where you fill wc. Reviewed-by: Leon Romanovsky signature.asc Description: PGP signature
Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb
On Sun, Feb 12, 2017 at 8:00 PM, Leon Romanovsky wrote: >> +static u8 __rc_to_ib_wc_status(u8 qstatus) >> +{ >> + switch (qstatus) { >> + case CQ_RES_RC_STATUS_OK: >> + return IB_WC_SUCCESS; >> + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR: >> + return IB_WC_LOC_ACCESS_ERR; >> + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR: >> + return IB_WC_LOC_LEN_ERR; >> + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR: >> + return IB_WC_LOC_PROT_ERR; >> + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR: >> + return IB_WC_LOC_QP_OP_ERR; >> + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR: >> + return IB_WC_GENERAL_ERR; >> + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR: >> + return IB_WC_REM_INV_REQ_ERR; >> + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR: >> + return IB_WC_WR_FLUSH_ERR; >> + case CQ_RES_RC_STATUS_HW_FLUSH_ERR: >> + return IB_WC_WR_FLUSH_ERR; >> + default: >> + return IB_WC_GENERAL_ERR; >> + } >> +} >> + > > Why don't you use these defines directly? CQ_RES* values are returned by the HW and these values are different from the corresponding IB_WC status values. say, CQ_RES_RC_STATUS_HW_FLUSH_ERR is 8 where as IB_WC_WR_FLUSH_ERR is 5. So we thought it is better to map these values in a function rather than having a switch/case in the calling function. Let me know if you meant something different in your query.
Re: linux-next: build failure after merge of the rcu tree
On Mon, Feb 13, 2017 at 01:21:33PM +1100, Stephen Rothwell wrote: > Hi Paul, > > On Thu, 19 Jan 2017 13:54:37 -0800 Paul McKenney wrote: > > > > On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell > > wrote: > > > Hi Paul, > > > > > > After merging the rcu tree, today's linux-next build (x86_64 allmodconfig) > > > failed like this: > > > > > > net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here > > > (not in a function) > > > .slab_flags = SLAB_DESTROY_BY_RCU, > > > ^ > > > > > > Caused by commit > > > > > > c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU") > > > > > > interacting with commit > > > > > > ac7138746e14 ("smc: establish new socket family") > > > > > > from the net-next tree. > > > > > > I have applied the following merge fix patch (someone will need to > > > remember to mention this to Linus): > > > > Thank you, Stephen! I expect that there might be a bit more > > bikeshedding on the name, but here is hoping... :-/ > > The need for this merge fix patch has gone away today. Is that a > permanent situation, or will it come back? I chickened out on that commit for this merge window, so it will come back at -rc1. But I will cover that when I rebase to -rc1. Thanx, Paul
Re: [PATCH 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
Hi brian, On 02/11/2017 09:26 AM, Brian Norris wrote: Hi Jeffy, I'm really not an expert on bluetooth or HIDP, but I can't bring myself to say that this is correct. I still think you have a problem. On Tue, Jan 24, 2017 at 12:07:51PM +0800, Jeffy Chen wrote: It looks like hidp_session_thread has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen --- net/bluetooth/hidp/core.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c index 0bec458..43d6e6a 100644 --- a/net/bluetooth/hidp/core.c +++ b/net/bluetooth/hidp/core.c @@ -36,6 +36,7 @@ #define VERSION "1.2" static DECLARE_RWSEM(hidp_session_sem); +static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq); static LIST_HEAD(hidp_session_list); static unsigned char hidp_keycode[256] = { @@ -1068,12 +1069,15 @@ static int hidp_session_start_sync(struct hidp_session *session) * Wake up session thread and notify it to stop. This is asynchronous and * returns immediately. Call this whenever a runtime error occurs and you want * the session to stop. - * Note: wake_up_process() performs any necessary memory-barriers for us. */ static void hidp_session_terminate(struct hidp_session *session) { atomic_inc(&session->terminate); - wake_up_process(session->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); + + wake_up_interruptible(&hidp_session_wq); So, you're adding a whole new wait queue here. } /* @@ -1180,7 +1184,9 @@ static void hidp_session_run(struct hidp_session *session) struct sock *ctrl_sk = session->ctrl_sock->sk; struct sock *intr_sk = session->intr_sock->sk; struct sk_buff *skb; + DEFINE_WAIT_FUNC(wait, woken_wake_function); + add_wait_queue(&hidp_session_wq, &wait); for (;;) { /* * This thread can be woken up two ways: @@ -1188,12 +1194,10 @@ static void hidp_session_run(struct hidp_session *session) *session->terminate flag and wakes this thread up. * - Via modifying the socket state of ctrl/intr_sock. This *thread is woken up by ->sk_state_changed(). -* -* Note: set_current_state() performs any necessary -* memory-barriers for us. */ - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -1227,11 +1231,14 @@ static void hidp_session_run(struct hidp_session *session) hidp_process_transmit(session, &session->ctrl_transmit, session->ctrl_sock); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); And you're waiting on it here. But you're already on two other wait queues (hidp_session_thread()). So the nice WQ_FLAG_WOKEN handling will only happen if you get woken via the new hidp_session_wq queue. But what about the other two? Seems like again you might have a race condition that would lead you to (temporarily, at least?) missing a wake-up attempt. Thanx for point that out. I'm not really sure what the best way to resolve this would be. My best guess would be to either consolidate the use of these wait queues, or lese roll a version of wait_woken() to handle 2 or more wait heads... Am I wrong? I easily could be. Brian } + remove_wait_queue(&hidp_session_wq, &wait); atomic_inc(&session->terminate); - set_current_state(TASK_RUNNING); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); } /* -- 2.1.4
Re: [PATCH 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session
Hi brian, On 02/11/2017 09:43 AM, Brian Norris wrote: Hi, On Tue, Jan 24, 2017 at 12:07:50PM +0800, Jeffy Chen wrote: It looks like cmtp_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen --- net/bluetooth/cmtp/core.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c index 9e59b66..6b03f2b 100644 --- a/net/bluetooth/cmtp/core.c +++ b/net/bluetooth/cmtp/core.c @@ -280,16 +280,16 @@ static int cmtp_session(void *arg) struct cmtp_session *session = arg; struct sock *sk = session->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG("session %p", session); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -306,9 +306,8 @@ static int cmtp_session(void *arg) cmtp_process_transmit(session); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); down_write(&cmtp_session_sem); @@ -393,7 +392,11 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, struct socket *sock) err = cmtp_attach_device(session); if (err < 0) { atomic_inc(&session->terminate); - wake_up_process(session->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); + Same comment about the barrier. Done, there are barriers in wake functions indeed, thanx! + wake_up_interruptible(sk_sleep(session->sock->sk)); up_write(&cmtp_session_sem); return err; } @@ -431,7 +434,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req) /* Stop session thread */ atomic_inc(&session->terminate); - wake_up_process(session->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); And again. But otherwise I think this looks OK, again with the caveat that I don't know Bluetooth/CMTP that well: Reviewed-by: Brian Norris + + wake_up_interruptible(sk_sleep(session->sock->sk)); } else err = -ENOENT; -- 2.1.4
Re: [PATCH 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session
Hi brian, On 02/11/2017 09:40 AM, Brian Norris wrote: Hi, On Tue, Jan 24, 2017 at 12:07:49PM +0800, Jeffy Chen wrote: It looks like bnep_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen --- net/bluetooth/bnep/core.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c index fbf251f..da04d51 100644 --- a/net/bluetooth/bnep/core.c +++ b/net/bluetooth/bnep/core.c @@ -484,16 +484,16 @@ static int bnep_session(void *arg) struct net_device *dev = s->dev; struct sock *sk = s->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG(""); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&s->terminate)) break; @@ -515,9 +515,8 @@ static int bnep_session(void *arg) break; netif_wake_queue(dev); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); /* Cleanup session */ @@ -666,7 +665,11 @@ int bnep_del_connection(struct bnep_conndel_req *req) s = __bnep_get_session(req->dst); if (s) { atomic_inc(&s->terminate); - wake_up_process(s->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); + __wake_up() suggests: * It may be assumed that this function implies a write memory barrier before * changing the task state if and only if any tasks are woken up. so the above barrier is probably unnecessary. I'm not so sure about the one before atomic_read(); seems fine. Got it, thanx! Other than that, I this looks ok: Reviewed-by: Brian Norris But I haven't been testing BNEP. Brian + wake_up_interruptible(sk_sleep(s->sock->sk)); } else err = -ENOENT; -- 2.1.4
[PATCH v3 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
It looks like hidp_session_thread has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen 1/ Fix could not wake up by wake attempts on original wait queues. 2/ Remove unnecessary memory barrier before wake_up_* functions. --- Changes in v3: None Changes in v2: None net/bluetooth/hidp/core.c | 33 ++--- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c index 0bec458..076bc50 100644 --- a/net/bluetooth/hidp/core.c +++ b/net/bluetooth/hidp/core.c @@ -36,6 +36,7 @@ #define VERSION "1.2" static DECLARE_RWSEM(hidp_session_sem); +static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq); static LIST_HEAD(hidp_session_list); static unsigned char hidp_keycode[256] = { @@ -1068,12 +1069,12 @@ static int hidp_session_start_sync(struct hidp_session *session) * Wake up session thread and notify it to stop. This is asynchronous and * returns immediately. Call this whenever a runtime error occurs and you want * the session to stop. - * Note: wake_up_process() performs any necessary memory-barriers for us. + * Note: wake_up_interruptible() performs any necessary memory-barriers for us. */ static void hidp_session_terminate(struct hidp_session *session) { atomic_inc(&session->terminate); - wake_up_process(session->task); + wake_up_interruptible(&hidp_session_wq); } /* @@ -1180,7 +1181,9 @@ static void hidp_session_run(struct hidp_session *session) struct sock *ctrl_sk = session->ctrl_sock->sk; struct sock *intr_sk = session->intr_sock->sk; struct sk_buff *skb; + DEFINE_WAIT_FUNC(wait, woken_wake_function); + add_wait_queue(&hidp_session_wq, &wait); for (;;) { /* * This thread can be woken up two ways: @@ -1188,12 +1191,10 @@ static void hidp_session_run(struct hidp_session *session) *session->terminate flag and wakes this thread up. * - Via modifying the socket state of ctrl/intr_sock. This *thread is woken up by ->sk_state_changed(). -* -* Note: set_current_state() performs any necessary -* memory-barriers for us. */ - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -1227,11 +1228,22 @@ static void hidp_session_run(struct hidp_session *session) hidp_process_transmit(session, &session->ctrl_transmit, session->ctrl_sock); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } + remove_wait_queue(&hidp_session_wq, &wait); atomic_inc(&session->terminate); - set_current_state(TASK_RUNNING); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); +} + +int hidp_session_wake_function(wait_queue_t *wait, unsigned int mode, + int sync, void *key) +{ + wake_up_interruptible(&hidp_session_wq); + + return default_wake_function(wait, mode, sync, key); } /* @@ -1244,7 +1256,8 @@ static void hidp_session_run(struct hidp_session *session) static int hidp_session_thread(void *arg) { struct hidp_session *session = arg; - wait_queue_t ctrl_wait, intr_wait; + DEFINE_WAIT_FUNC(ctrl_wait, hidp_session_wake_function); + DEFINE_WAIT_FUNC(intr_wait, hidp_session_wake_function); BT_DBG("session %p", session); @@ -1254,8 +1267,6 @@ static int hidp_session_thread(void *arg) set_user_nice(current, -15); hidp_set_timer(session); - init_waitqueue_entry(&ctrl_wait, current); - init_waitqueue_entry(&intr_wait, current); add_wait_queue(sk_sleep(session->ctrl_sock->sk), &ctrl_wait); add_wait_queue(sk_sleep(session->intr_sock->sk), &intr_wait); /* This memory barrier is paired with wq_has_sleeper(). See -- 2.1.4
[PATCH v3 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session
It looks like bnep_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen Reviewed-by: Brian Norris --- Changes in v3: Add brian's Reviewed-by. Changes in v2: Remove unnecessary memory barrier before wake_up_* functions. net/bluetooth/bnep/core.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c index fbf251f..4d6b94d 100644 --- a/net/bluetooth/bnep/core.c +++ b/net/bluetooth/bnep/core.c @@ -484,16 +484,16 @@ static int bnep_session(void *arg) struct net_device *dev = s->dev; struct sock *sk = s->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG(""); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&s->terminate)) break; @@ -515,9 +515,8 @@ static int bnep_session(void *arg) break; netif_wake_queue(dev); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); /* Cleanup session */ @@ -666,7 +665,7 @@ int bnep_del_connection(struct bnep_conndel_req *req) s = __bnep_get_session(req->dst); if (s) { atomic_inc(&s->terminate); - wake_up_process(s->task); + wake_up_interruptible(sk_sleep(s->sock->sk)); } else err = -ENOENT; -- 2.1.4
[PATCH v3 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session
It looks like cmtp_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen Reviewed-by: Brian Norris Remove unnecessary memory barrier before wake_up_* functions. --- Changes in v3: Add brian's Reviewed-by. Changes in v2: None net/bluetooth/cmtp/core.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c index 9e59b66..1152ce3 100644 --- a/net/bluetooth/cmtp/core.c +++ b/net/bluetooth/cmtp/core.c @@ -280,16 +280,16 @@ static int cmtp_session(void *arg) struct cmtp_session *session = arg; struct sock *sk = session->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG("session %p", session); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -306,9 +306,8 @@ static int cmtp_session(void *arg) cmtp_process_transmit(session); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); down_write(&cmtp_session_sem); @@ -393,7 +392,7 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, struct socket *sock) err = cmtp_attach_device(session); if (err < 0) { atomic_inc(&session->terminate); - wake_up_process(session->task); + wake_up_interruptible(sk_sleep(session->sock->sk)); up_write(&cmtp_session_sem); return err; } @@ -431,7 +430,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req) /* Stop session thread */ atomic_inc(&session->terminate); - wake_up_process(session->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); + + wake_up_interruptible(sk_sleep(session->sock->sk)); } else err = -ENOENT; -- 2.1.4
[PATCH v2 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
It looks like hidp_session_thread has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen 1/ Fix could not wake up by wake attempts on original wait queues. 2/ Remove unnecessary memory barrier before wake_up_* functions. --- Changes in v2: None net/bluetooth/hidp/core.c | 33 ++--- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c index 0bec458..076bc50 100644 --- a/net/bluetooth/hidp/core.c +++ b/net/bluetooth/hidp/core.c @@ -36,6 +36,7 @@ #define VERSION "1.2" static DECLARE_RWSEM(hidp_session_sem); +static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq); static LIST_HEAD(hidp_session_list); static unsigned char hidp_keycode[256] = { @@ -1068,12 +1069,12 @@ static int hidp_session_start_sync(struct hidp_session *session) * Wake up session thread and notify it to stop. This is asynchronous and * returns immediately. Call this whenever a runtime error occurs and you want * the session to stop. - * Note: wake_up_process() performs any necessary memory-barriers for us. + * Note: wake_up_interruptible() performs any necessary memory-barriers for us. */ static void hidp_session_terminate(struct hidp_session *session) { atomic_inc(&session->terminate); - wake_up_process(session->task); + wake_up_interruptible(&hidp_session_wq); } /* @@ -1180,7 +1181,9 @@ static void hidp_session_run(struct hidp_session *session) struct sock *ctrl_sk = session->ctrl_sock->sk; struct sock *intr_sk = session->intr_sock->sk; struct sk_buff *skb; + DEFINE_WAIT_FUNC(wait, woken_wake_function); + add_wait_queue(&hidp_session_wq, &wait); for (;;) { /* * This thread can be woken up two ways: @@ -1188,12 +1191,10 @@ static void hidp_session_run(struct hidp_session *session) *session->terminate flag and wakes this thread up. * - Via modifying the socket state of ctrl/intr_sock. This *thread is woken up by ->sk_state_changed(). -* -* Note: set_current_state() performs any necessary -* memory-barriers for us. */ - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -1227,11 +1228,22 @@ static void hidp_session_run(struct hidp_session *session) hidp_process_transmit(session, &session->ctrl_transmit, session->ctrl_sock); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } + remove_wait_queue(&hidp_session_wq, &wait); atomic_inc(&session->terminate); - set_current_state(TASK_RUNNING); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); +} + +int hidp_session_wake_function(wait_queue_t *wait, unsigned int mode, + int sync, void *key) +{ + wake_up_interruptible(&hidp_session_wq); + + return default_wake_function(wait, mode, sync, key); } /* @@ -1244,7 +1256,8 @@ static void hidp_session_run(struct hidp_session *session) static int hidp_session_thread(void *arg) { struct hidp_session *session = arg; - wait_queue_t ctrl_wait, intr_wait; + DEFINE_WAIT_FUNC(ctrl_wait, hidp_session_wake_function); + DEFINE_WAIT_FUNC(intr_wait, hidp_session_wake_function); BT_DBG("session %p", session); @@ -1254,8 +1267,6 @@ static int hidp_session_thread(void *arg) set_user_nice(current, -15); hidp_set_timer(session); - init_waitqueue_entry(&ctrl_wait, current); - init_waitqueue_entry(&intr_wait, current); add_wait_queue(sk_sleep(session->ctrl_sock->sk), &ctrl_wait); add_wait_queue(sk_sleep(session->intr_sock->sk), &intr_wait); /* This memory barrier is paired with wq_has_sleeper(). See -- 2.1.4
[PATCH v2 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session
It looks like bnep_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen --- Changes in v2: Remove unnecessary memory barrier before wake_up_* functions. net/bluetooth/bnep/core.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c index fbf251f..4d6b94d 100644 --- a/net/bluetooth/bnep/core.c +++ b/net/bluetooth/bnep/core.c @@ -484,16 +484,16 @@ static int bnep_session(void *arg) struct net_device *dev = s->dev; struct sock *sk = s->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG(""); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&s->terminate)) break; @@ -515,9 +515,8 @@ static int bnep_session(void *arg) break; netif_wake_queue(dev); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); /* Cleanup session */ @@ -666,7 +665,7 @@ int bnep_del_connection(struct bnep_conndel_req *req) s = __bnep_get_session(req->dst); if (s) { atomic_inc(&s->terminate); - wake_up_process(s->task); + wake_up_interruptible(sk_sleep(s->sock->sk)); } else err = -ENOENT; -- 2.1.4
[PATCH v2 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session
It looks like cmtp_session has same pattern as the issue reported in old rfcomm: while (1) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; // may call might_sleep here schedule(); } __set_current_state(TASK_RUNNING); Which fixed at: dfb2fae Bluetooth: Fix nested sleeps So let's fix it at the same way, also follow the suggestion of: https://lwn.net/Articles/628628/ Signed-off-by: Jeffy Chen Remove unnecessary memory barrier before wake_up_* functions. --- Changes in v2: None net/bluetooth/cmtp/core.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c index 9e59b66..1152ce3 100644 --- a/net/bluetooth/cmtp/core.c +++ b/net/bluetooth/cmtp/core.c @@ -280,16 +280,16 @@ static int cmtp_session(void *arg) struct cmtp_session *session = arg; struct sock *sk = session->sock->sk; struct sk_buff *skb; - wait_queue_t wait; + DEFINE_WAIT_FUNC(wait, woken_wake_function); BT_DBG("session %p", session); set_user_nice(current, -15); - init_waitqueue_entry(&wait, current); add_wait_queue(sk_sleep(sk), &wait); while (1) { - set_current_state(TASK_INTERRUPTIBLE); + /* Ensure session->terminate is updated */ + smp_mb__before_atomic(); if (atomic_read(&session->terminate)) break; @@ -306,9 +306,8 @@ static int cmtp_session(void *arg) cmtp_process_transmit(session); - schedule(); + wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } - __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); down_write(&cmtp_session_sem); @@ -393,7 +392,7 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, struct socket *sock) err = cmtp_attach_device(session); if (err < 0) { atomic_inc(&session->terminate); - wake_up_process(session->task); + wake_up_interruptible(sk_sleep(session->sock->sk)); up_write(&cmtp_session_sem); return err; } @@ -431,7 +430,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req) /* Stop session thread */ atomic_inc(&session->terminate); - wake_up_process(session->task); + + /* Ensure session->terminate is updated */ + smp_mb__after_atomic(); + + wake_up_interruptible(sk_sleep(session->sock->sk)); } else err = -ENOENT; -- 2.1.4
Re: [net-next 00/14][pull request] 40GbE Intel Wired LAN Driver Updates 2017-02-11
From: Jeff Kirsher Date: Sat, 11 Feb 2017 21:30:14 -0800 > This series contains updates to i40e and i40evf only. Pulled, thanks Jeff. Please address Sergei's feedback for patch #12 in followup changes, if necessary, thank you.
Re: [PATCH] net: micrel: ks8851: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 09:57:47 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: micrel: ks8695net: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Wed, 8 Feb 2017 23:54:45 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: micrel: ks8851_mll: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 11:28:25 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: microchip: encx24j600: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 22:42:18 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: nuvoton: w90p910: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Sun, 12 Feb 2017 21:38:29 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: natsemi: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 23:58:25 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: neterion: s2io: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Sun, 12 Feb 2017 11:44:36 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: micrel: ksz884x: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 20:25:06 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: myricom: myri10ge: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 23:17:23 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: natsemi: ns83820: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Fri, 10 Feb 2017 23:57:48 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: neterion: vxge: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Sun, 12 Feb 2017 17:33:13 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: microchip: enc28j60: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Thu, 9 Feb 2017 22:02:47 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH net-next 0/9] bnxt_en: Misc updates.
From: Michael Chan Date: Sun, 12 Feb 2017 19:18:09 -0500 > Miscellaneous updates include update of the firmware spec, ethtool flash > enhancement, ethtool -l minor fix, NTUPLE support enhancements, FEC > link settings message during link up, and new PCI IDs. Please review. > Thanks. This all looks pretty straightforward to me, series applied, thanks Michael.
Re: [PATCH net] net/llc: avoid BUG_ON() in skb_orphan()
From: Eric Dumazet Date: Sun, 12 Feb 2017 14:03:52 -0800 > From: Eric Dumazet > > It seems nobody used LLC since linux-3.12. > > Fortunately fuzzers like syzkaller still know how to run this code, > otherwise it would be no fun. > > Setting skb->sk without skb->destructor leads to all kinds of > bugs, we now prefer to be very strict about it. > > Ideally here we would use skb_set_owner() but this helper does not exist yet, > only CAN seems to have a private helper for that. > > Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()") > Signed-off-by: Eric Dumazet > Reported-by: Andrey Konovalov Applied and queued up for -stable, thanks Eric.
Re: [PATCH 00/21] Netfilter updates for net-next
From: Pablo Neira Ayuso Date: Sun, 12 Feb 2017 20:42:32 +0100 > The following patchset contains Netfilter updates for your net-next > tree, most relevantly they are: .. > You can pull these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git Pulled, I really like the RULE_ID generation count stuff for userspace. Thanks.
Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag
From: Alexei Starovoitov Date: Fri, 10 Feb 2017 20:28:24 -0800 > If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command > to the given cgroup the descendent cgroup will be able to override > effective bpf program that was inherited from this cgroup. > By default it's not passed, therefore override is disallowed. > > Examples: > 1. > prog X attached to /A with default > prog Y fails to attach to /A/B and /A/B/C > Everything under /A runs prog X > > 2. > prog X attached to /A with allow_override. > prog Y fails to attach to /A/B with default (non-override) > prog M attached to /A/B with allow_override. > Everything under /A/B runs prog M only. > > 3. > prog X attached to /A with allow_override. > prog Y fails to attach to /A with default. > The user has to detach first to switch the mode. > > In the future this behavior may be extended with a chain of > non-overridable programs. > > Also fix the bug where detach from cgroup where nothing is attached > was not throwing error. Return ENOENT in such case. > > Add several testcases and adjust libbpf. > > Fixes: 3007098494be ("cgroup: add support for eBPF programs") > Signed-off-by: Alexei Starovoitov Applied.
[lkp-robot] [xdp] 543d41bf78: INFO:suspicious_RCU_usage
FYI, we noticed the following commit: commit: 543d41bf78792e858e6f6598945d307ff808b7fc ("xdp: Infrastructure to generalize XDP") url: https://github.com/0day-ci/linux/commits/Tom-Herbert/xdp-Generalize-XDP/20170209-092238 in testcase: trinity with following parameters: runtime: 300s test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/ on test machine: qemu-system-i386 -enable-kvm -smp 2 -m 320M caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): +-+++ | | df6dd79be8 | 543d41bf78 | +-+++ | boot_successes | 10 | 0 | | boot_failures | 2 | 12 | | WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page | 2 | 2 | | INFO:suspicious_RCU_usage | 0 | 12 | +-+++ [6.814497] [ INFO: suspicious RCU usage. ] [6.814497] [ INFO: suspicious RCU usage. ] [6.814990] 4.10.0-rc7-01379-g543d41b #1 Not tainted [6.814990] 4.10.0-rc7-01379-g543d41b #1 Not tainted [6.815618] --- [6.815618] --- [6.816107] net/core/xdp.c:201 suspicious rcu_dereference_check() usage! [6.816107] net/core/xdp.c:201 suspicious rcu_dereference_check() usage! [6.817090] [6.817090] other info that might help us debug this: [6.817090] [6.817090] [6.817090] other info that might help us debug this: [6.817090] [6.818000] [6.818000] rcu_scheduler_active = 2, debug_locks = 0 [6.818000] [6.818000] rcu_scheduler_active = 2, debug_locks = 0 [6.818778] 1 lock held by swapper/1: [6.818778] 1 lock held by swapper/1: [6.819213] #0: (xdp_hook_mutex){+.+...}, at: [] __xdp_unregister_hooks+0x1c/0x185 [6.819213] #0: (xdp_hook_mutex){+.+...}, at: [] __xdp_unregister_hooks+0x1c/0x185 [6.820199] [6.820199] stack backtrace: [6.820199] [6.820199] stack backtrace: [6.820710] CPU: 0 PID: 1 Comm: swapper Not tainted 4.10.0-rc7-01379-g543d41b #1 [6.820710] CPU: 0 PID: 1 Comm: swapper Not tainted 4.10.0-rc7-01379-g543d41b #1 [6.821530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 [6.821530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 [6.822747] Call Trace: [6.822747] Call Trace: [6.823052] dump_stack+0x16/0x18 [6.823052] dump_stack+0x16/0x18 [6.823434] lockdep_rcu_suspicious+0xdb/0xee [6.823434] lockdep_rcu_suspicious+0xdb/0xee [6.823908] __xdp_unregister_hooks+0x171/0x185 [6.823908] __xdp_unregister_hooks+0x171/0x185 [6.824421] ? __might_sleep+0x2d/0x86 [6.824421] ? __might_sleep+0x2d/0x86 [6.824848] xdp_unregister_all_hooks+0x3a/0x3f [6.824848] xdp_unregister_all_hooks+0x3a/0x3f [6.825398] free_netdev+0x25/0xca [6.825398] free_netdev+0x25/0xca [6.825801] lance_probe+0x115/0x122 [6.825801] lance_probe+0x115/0x122 [6.826191] probe_list2+0x20/0x41 [6.826191] probe_list2+0x20/0x41 [6.826586] net_olddevs_init+0x42/0x4e [6.826586] net_olddevs_init+0x42/0x4e [6.827037] ? probe_list2+0x41/0x41 [6.827037] ? probe_list2+0x41/0x41 [6.827448] do_one_initcall+0x3c/0x184 [6.827448] do_one_initcall+0x3c/0x184 [6.827866] ? repair_env_string+0x12/0x54 [6.827866] ? repair_env_string+0x12/0x54 [6.828326] ? parse_args+0x24e/0x402 [6.828326] ? parse_args+0x24e/0x402 [6.828785] ? trace_hardirqs_on+0xb/0xd [6.828785] ? trace_hardirqs_on+0xb/0xd [6.829235] kernel_init_freeable+0xe1/0x15c [6.829235] kernel_init_freeable+0xe1/0x15c [6.829729] ? rest_init+0x10e/0x10e [6.829729] ? rest_init+0x10e/0x10e [6.830134] kernel_init+0xb/0xe5 [6.830134] kernel_init+0xb/0xe5 [6.830515] ? schedule_tail+0xc/0x4a [6.830515] ? schedule_tail+0xc/0x4a [6.830925] ? rest_init+0x10e/0x10e [6.830925] ? rest_init+0x10e/0x10e [6.831343] ret_from_fork+0x21/0x2c [6.831343] ret_from_fork+0x21/0x2c [6.832026] libphy: Fixed MDIO Bus: probed [6.832026] libphy: Fixed MDIO Bus: probed [6.832650] arcnet: arcnet loaded [6.832650] arcnet: arcnet loaded [6.833011] arcnet:rfc1201: RFC1201 "standard" (`a') encapsulation support loaded [6.833011] arcnet:rfc1201: RFC1201 "standard" (`a') encapsulation support loaded [6.833856] arcnet:arc_rawmode: raw mode (`r') encapsulation support loaded [6.833856] arcnet:arc_rawmode: raw mode (`r') encapsulation
Re: linux-next: build failure after merge of the rcu tree
Hi Paul, On Thu, 19 Jan 2017 13:54:37 -0800 Paul McKenney wrote: > > On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell > wrote: > > Hi Paul, > > > > After merging the rcu tree, today's linux-next build (x86_64 allmodconfig) > > failed like this: > > > > net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here (not > > in a function) > > .slab_flags = SLAB_DESTROY_BY_RCU, > > ^ > > > > Caused by commit > > > > c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU") > > > > interacting with commit > > > > ac7138746e14 ("smc: establish new socket family") > > > > from the net-next tree. > > > > I have applied the following merge fix patch (someone will need to > > remember to mention this to Linus): > > Thank you, Stephen! I expect that there might be a bit more > bikeshedding on the name, but here is hoping... :-/ The need for this merge fix patch has gone away today. Is that a permanent situation, or will it come back? -- Cheers, Stephen Rothwell
[PATCH net 1/1] net: fec: fix multicast filtering hardware setup
From: Rui Sousa Fix hardware setup of multicast address hash: - Never clear the hardware hash (to avoid packet loss) - Construct the hash register values in software and then write once to hardware Signed-off-by: Rui Sousa Signed-off-by: Fugang Duan --- drivers/net/ethernet/freescale/fec_main.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 2cc552d..91a1664 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -2910,6 +2910,7 @@ static void set_multicast_list(struct net_device *ndev) struct netdev_hw_addr *ha; unsigned int i, bit, data, crc, tmp; unsigned char hash; + unsigned int hash_high = 0, hash_low = 0; if (ndev->flags & IFF_PROMISC) { tmp = readl(fep->hwp + FEC_R_CNTRL); @@ -2932,11 +2933,7 @@ static void set_multicast_list(struct net_device *ndev) return; } - /* Clear filter and add the addresses in hash register -*/ - writel(0, fep->hwp + FEC_GRP_HASH_TABLE_HIGH); - writel(0, fep->hwp + FEC_GRP_HASH_TABLE_LOW); - + /* Add the addresses in hash register */ netdev_for_each_mc_addr(ha, ndev) { /* calculate crc32 value of mac address */ crc = 0x; @@ -2954,16 +2951,14 @@ static void set_multicast_list(struct net_device *ndev) */ hash = (crc >> (32 - FEC_HASH_BITS)) & 0x3f; - if (hash > 31) { - tmp = readl(fep->hwp + FEC_GRP_HASH_TABLE_HIGH); - tmp |= 1 << (hash - 32); - writel(tmp, fep->hwp + FEC_GRP_HASH_TABLE_HIGH); - } else { - tmp = readl(fep->hwp + FEC_GRP_HASH_TABLE_LOW); - tmp |= 1 << hash; - writel(tmp, fep->hwp + FEC_GRP_HASH_TABLE_LOW); - } + if (hash > 31) + hash_high |= 1 << (hash - 32); + else + hash_low |= 1 << hash; } + + writel(hash_high, fep->hwp + FEC_GRP_HASH_TABLE_HIGH); + writel(hash_low, fep->hwp + FEC_GRP_HASH_TABLE_LOW); } /* Set a MAC change in hardware. */ -- 1.9.1
Re: [PATCH net-next v1] bpf: Remove redundant ifdef
On 2017/2/12 3:37, Mickaël Salaün wrote: Remove a useless ifdef __NR_bpf as requested by Wang Nan. Inline one-line static functions as it was in the bpf_sys.h file. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: David S. Miller Cc: Wang Nan Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com --- tools/lib/bpf/bpf.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 50e04cc5..2de9c386989a 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,21 +42,15 @@ # endif #endif -static __u64 ptr_to_u64(const void *ptr) +static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, - unsigned int size) +static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, + unsigned int size) { -#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); -#else - fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); - errno = ENOSYS; - return -1; -#endif } int bpf_create_map(enum bpf_map_type map_type, int key_size, Acked-by: Wang Nan However, it is better to merge this patch with commit 702498a1426bc95b6f49f9c5fba616110cbd3947. Thank you.
RE: [PATCH net 1/1] net: fec: fix multicast filtering hardware setup
From: Fabio Estevam Sent: Saturday, February 11, 2017 5:20 AM >To: Andy Duan >Cc: David S. Miller ; netdev@vger.kernel.org; >Stephen Hemminger >Subject: Re: [PATCH net 1/1] net: fec: fix multicast filtering hardware setup > >On Fri, Feb 10, 2017 at 3:54 AM, Andy Duan wrote: >> Fix hardware setup of multicast address hash: >> - Never clear the hardware hash (to avoid packet loss) >> - Construct the hash register values in software and then write once >> to hardware >> >> Signed-off-by: Fugang Duan >> Signed-off-by: Rui Sousa > >It seems you missed to put Rui's name in the From: field. I did some change base on original patch and merge into net tree. Forget to change thr FR field, send it again, not V2 version. Regards, Andy
Re: [PATCH v1] samples/bpf: Add a .gitignore for binaries
On 2/12/17 2:23 PM, Mickaël Salaün wrote: > diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore > new file mode 100644 > index ..a7562a5ef4c2 > --- /dev/null > +++ b/samples/bpf/.gitignore > @@ -0,0 +1,32 @@ > +fds_example > +lathist ... Listing each target is going to be a PITA to maintain. It would be better to put targets into a build directory (bin?) and ignore the directory.
Re: net/packet: use-after-free in packet_rcv_fanout
On (02/10/17 10:02), Eric Dumazet wrote: > At least, Anoob patch is making a step into the right direction ;) > https://patchwork.ozlabs.org/patch/726532/ I've not been able to reproduce Dmitry's panic (though I did not try very hard either) but there's a call to fanout_release from packet_release before the synchronize_net() - I wonder if this could end up kfree'ing f when there are threads in the middle of dev_queue_xmit_nit(). --Sowmini
Re: [PATCH v4 0/3] Miscellaneous fixes for BPF (perf tree)
On 2017/2/9 4:27, Mickaël Salaün wrote: This series brings some fixes and small improvements to the BPF samples. This is intended for the perf tree and apply on 7a5980f9c006 ("tools lib bpf: Add missing header to the library"). Changes since v3: * remove applied patch 1/5 * remove patch 2/5 on bpf_load_program() as requested by Wang Nan Changes since v2: * add this cover letter Changes since v1: * exclude patches not intended for the perf tree Regards, Mickaël Salaün (3): samples/bpf: Ignore already processed ELF sections samples/bpf: Reset global variables samples/bpf: Add missing header samples/bpf/bpf_load.c | 7 +++ samples/bpf/tracex5_kern.c | 1 + 2 files changed, 8 insertions(+) Looks good to me. Thank you.
Re: net: hix5hd2_gmac uninitialized net_device
On 2017/2/11 8:51, Marty Plummer wrote: > On Fri, Feb 10, 2017 at 06:21:35PM +0800, Dongpo Li wrote: >> I think the error "No irq resource" happened for some other reason, has no >> relation with >> the info "(unnamed net_device) (uninitialized):". >> You can add more debug info to find bug. > Do you have any particular suggestions as to what to check out, or is > this just a general 'debug more' instruction? I haven't encountered such a problem. So it needs you to debug what happens. >> Yes, I agree with you that the ndev has not been initialized completely, >> because the function "register_netdev" has not been called yet. >> It's better to use the "dev_err" to replace the "netdev_err". >> > Ah, I see. So, prior to line 1266's call to register_netdev, it will > always be uninitialized and unnamed, regardless of what is or isn't > right elsewhere. Good to know. So, I could replace these netdev_err > with dev_err for now, up until that point, so I can get a bit more info, > yes? Yes. Regards, Dongpo .
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On Sun, 2017-02-12 at 23:38 +0100, Jesper Dangaard Brouer wrote: > Just so others understand this: The number of RX queue slots is > indirectly the size of the page-recycle "cache" in this scheme (that > depend on refcnt tricks to see if page can be reused). Note that the page recycle tricks only work on some occasions. To provision correctly hosts dealing with TCP flows, one should not rely on page recycling or any opportunistic (non guaranteed) behavior. Page recycling, _if_ possible, will help to reduce system load and thus lower latencies. > > > > A single TCP flow easily can have more than 1024 MSS waiting in its > > receive queue (typical receive window on linux is 6MB/2 ) > > So, you do need to increase the page-"cache" size, and need this for > real-life cases, interesting. I believe this sizing was done mostly to cope with normal system scheduling constraints [1], reducing packet losses under incast blasts. Sizing happened before I did my patches to switch to order-0 pages anyway. The fact that it allowed page-recycling to happen more often was nice of course. [1] - One can not really assume host will always have the ability to process the RX ring in time, unless maybe CPU are fully dedicated to the napi polling logic. - Recent work to shift softirqs to ksoftirqd is potentially magnifying the problem.
[PATCH net-next 5/9] bnxt_en: Add hardware NTUPLE filter for encapsulated packets.
If skb_flow_dissect_flow_keys() returns with the encapsulation flag set, pass the information to the firmware to setup the NTUPLE filter accordingly. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 516c5d7..f3d829f 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -3456,6 +3456,9 @@ static int bnxt_hwrm_cfa_ntuple_filter_free(struct bnxt *bp, CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_DST_PORT_MASK |\ CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_DST_ID) +#define BNXT_NTP_TUNNEL_FLTR_FLAG \ + CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_TUNNEL_TYPE + static int bnxt_hwrm_cfa_ntuple_filter_alloc(struct bnxt *bp, struct bnxt_ntuple_filter *fltr) { @@ -3496,6 +3499,11 @@ static int bnxt_hwrm_cfa_ntuple_filter_alloc(struct bnxt *bp, req.dst_ipaddr[0] = keys->addrs.v4addrs.dst; req.dst_ipaddr_mask[0] = cpu_to_be32(0x); } + if (keys->control.flags & FLOW_DIS_ENCAPSULATION) { + req.enables |= cpu_to_le32(BNXT_NTP_TUNNEL_FLTR_FLAG); + req.tunnel_type = + CFA_NTUPLE_FILTER_ALLOC_REQ_TUNNEL_TYPE_ANYTUNNEL; + } req.src_port = keys->ports.src; req.src_port_mask = cpu_to_be16(0x); @@ -6869,6 +6877,7 @@ static bool bnxt_fltr_match(struct bnxt_ntuple_filter *f1, keys1->ports.ports == keys2->ports.ports && keys1->basic.ip_proto == keys2->basic.ip_proto && keys1->basic.n_proto == keys2->basic.n_proto && + keys1->control.flags == keys2->control.flags && ether_addr_equal(f1->src_mac_addr, f2->src_mac_addr) && ether_addr_equal(f1->dst_mac_addr, f2->dst_mac_addr)) return true; @@ -6886,9 +6895,6 @@ static int bnxt_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb, int rc = 0, idx, bit_id, l2_idx = 0; struct hlist_head *head; - if (skb->encapsulation) - return -EPROTONOSUPPORT; - if (!ether_addr_equal(dev->dev_addr, eth->h_dest)) { struct bnxt_vnic_info *vnic = &bp->vnic_info[0]; int off = 0, j; @@ -6927,6 +6933,11 @@ static int bnxt_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb, rc = -EPROTONOSUPPORT; goto err_free; } + if ((fkeys->control.flags & FLOW_DIS_ENCAPSULATION) && + bp->hwrm_spec_code < 0x10601) { + rc = -EPROTONOSUPPORT; + goto err_free; + } memcpy(new_fltr->dst_mac_addr, eth->h_dest, ETH_ALEN); memcpy(new_fltr->src_mac_addr, eth->h_source, ETH_ALEN); -- 1.8.3.1
[PATCH net-next 6/9] bnxt_en: Do not setup PHY unless driving a single PF.
If it is a VF or an NPAR function, the firmware call to setup the PHY will fail. Adding this check will prevent unnecessary firmware calls to setup the PHY unless calling from the PF. This will also eliminate many unnecessary warning messages when the call from a VF or NPAR fails. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index f3d829f..afd1190 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -5853,6 +5853,9 @@ static int bnxt_update_phy_setting(struct bnxt *bp) rc); return rc; } + if (!BNXT_SINGLE_PF(bp)) + return 0; + if ((link_info->autoneg & BNXT_AUTONEG_FLOW_CTRL) && (link_info->auto_pause_setting & BNXT_LINK_PAUSE_BOTH) != link_info->req_flow_ctrl) -- 1.8.3.1
[PATCH net-next 3/9] bnxt_en: Fix ethtool -l pre-set max combined channel.
With commit d1e7925e6d80 ("bnxt_en: Centralize logic to reserve rings."), ring allocation for combined rings has become stricter. A combined ring must now have an rx-tx ring pair. The pre-set max. for combined rings should now be min(rx, tx). Fixes: d1e7925e6d80 ("bnxt_en: Centralize logic to reserve rings.") Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c index 4b45b88..6903a87 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c @@ -357,7 +357,7 @@ static void bnxt_get_channels(struct net_device *dev, int max_rx_rings, max_tx_rings, tcs; bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, true); - channel->max_combined = max_t(int, max_rx_rings, max_tx_rings); + channel->max_combined = min_t(int, max_rx_rings, max_tx_rings); if (bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, false)) { max_rx_rings = 0; -- 1.8.3.1
[PATCH net-next 7/9] bnxt_en: Print FEC settings as part of the linkup dmesg.
Print FEC (Forward Error Correction) autoneg and encoding settings during link up. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 13 - drivers/net/ethernet/broadcom/bnxt/bnxt.h | 4 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index afd1190..9f1dfbe 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -5437,7 +5437,7 @@ static void bnxt_report_link(struct bnxt *bp) if (bp->link_info.link_up) { const char *duplex; const char *flow_ctrl; - u16 speed; + u16 speed, fec; netif_carrier_on(bp->dev); if (bp->link_info.duplex == BNXT_LINK_DUPLEX_FULL) @@ -5459,6 +5459,12 @@ static void bnxt_report_link(struct bnxt *bp) netdev_info(bp->dev, "EEE is %s\n", bp->eee.eee_active ? "active" : "not active"); + fec = bp->link_info.fec_cfg; + if (!(fec & PORT_PHY_QCFG_RESP_FEC_CFG_FEC_NONE_SUPPORTED)) + netdev_info(bp->dev, "FEC autoneg %s encodings: %s\n", + (fec & BNXT_FEC_AUTONEG) ? "on" : "off", + (fec & BNXT_FEC_ENC_BASE_R) ? "BaseR" : +(fec & BNXT_FEC_ENC_RS) ? "RS" : "None"); } else { netif_carrier_off(bp->dev); netdev_err(bp->dev, "NIC Link is Down\n"); @@ -5583,6 +5589,11 @@ static int bnxt_update_link(struct bnxt *bp, bool chng_link_state) } } } + + link_info->fec_cfg = PORT_PHY_QCFG_RESP_FEC_CFG_FEC_NONE_SUPPORTED; + if (bp->hwrm_spec_code >= 0x10504) + link_info->fec_cfg = le16_to_cpu(resp->fec_cfg); + /* TODO: need to add more logic to report VF link */ if (chng_link_state) { if (link_info->phy_link_status == BNXT_LINK_LINK) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index eaed700..faf26a2 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -867,6 +867,10 @@ struct bnxt_link_info { u16 force_link_speed; u32 preemphasis; u8 module_status; + u16 fec_cfg; +#define BNXT_FEC_AUTONEG PORT_PHY_QCFG_RESP_FEC_CFG_FEC_AUTONEG_ENABLED +#define BNXT_FEC_ENC_BASE_RPORT_PHY_QCFG_RESP_FEC_CFG_FEC_CLAUSE74_ENABLED +#define BNXT_FEC_ENC_RS PORT_PHY_QCFG_RESP_FEC_CFG_FEC_CLAUSE91_ENABLED /* copy of requested setting from ethtool cmd */ u8 autoneg; -- 1.8.3.1
[PATCH net-next 1/9] bnxt_en: Update to firmware interface spec 1.7.0.
The new spec has NVRAM defragmentation support which will be used in the next patch to improve ethtool flash operation. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 +- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 5 +- drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 437 +- 3 files changed, 363 insertions(+), 88 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index cda1c78..8ac5987 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -1,6 +1,7 @@ /* Broadcom NetXtreme-C/E network driver. * * Copyright (c) 2014-2016 Broadcom Corporation + * Copyright (c) 2016-2017 Broadcom Limited * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -3974,7 +3975,7 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp, req.length = cpu_to_le32(bp->rx_agg_ring_mask + 1); break; case HWRM_RING_ALLOC_CMPL: - req.ring_type = RING_ALLOC_REQ_RING_TYPE_CMPL; + req.ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL; req.length = cpu_to_le32(bp->cp_ring_mask + 1); if (bp->flags & BNXT_FLAG_USING_MSIX) req.int_mode = RING_ALLOC_REQ_INT_MODE_MSIX; @@ -3993,7 +3994,7 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp, if (rc || err) { switch (ring_type) { - case RING_FREE_REQ_RING_TYPE_CMPL: + case RING_FREE_REQ_RING_TYPE_L2_CMPL: netdev_err(bp->dev, "hwrm_ring_alloc cp failed. rc:%x err:%x\n", rc, err); return -1; @@ -4137,7 +4138,7 @@ static int hwrm_ring_free_send_msg(struct bnxt *bp, if (rc || error_code) { switch (ring_type) { - case RING_FREE_REQ_RING_TYPE_CMPL: + case RING_FREE_REQ_RING_TYPE_L2_CMPL: netdev_err(bp->dev, "hwrm_ring_free cp failed. rc:%d\n", rc); return rc; @@ -4226,7 +4227,7 @@ static void bnxt_hwrm_ring_free(struct bnxt *bp, bool close_path) if (ring->fw_ring_id != INVALID_HW_RING_ID) { hwrm_ring_free_send_msg(bp, ring, - RING_FREE_REQ_RING_TYPE_CMPL, + RING_FREE_REQ_RING_TYPE_L2_CMPL, INVALID_HW_RING_ID); ring->fw_ring_id = INVALID_HW_RING_ID; bp->grp_info[i].cp_fw_ring_id = INVALID_HW_RING_ID; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 9f07b9c..eaed700 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -1,6 +1,7 @@ /* Broadcom NetXtreme-C/E network driver. * * Copyright (c) 2014-2016 Broadcom Corporation + * Copyright (c) 2016-2017 Broadcom Limited * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -11,10 +12,10 @@ #define BNXT_H #define DRV_MODULE_NAME"bnxt_en" -#define DRV_MODULE_VERSION "1.6.0" +#define DRV_MODULE_VERSION "1.7.0" #define DRV_VER_MAJ1 -#define DRV_VER_MIN6 +#define DRV_VER_MIN7 #define DRV_VER_UPD0 struct tx_bd { diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h index 5df32ab..6e275c2 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h @@ -11,12 +11,12 @@ #ifndef BNXT_HSI_H #define BNXT_HSI_H -/* HSI and HWRM Specification 1.6.1 */ +/* HSI and HWRM Specification 1.7.0 */ #define HWRM_VERSION_MAJOR 1 -#define HWRM_VERSION_MINOR 6 -#define HWRM_VERSION_UPDATE1 +#define HWRM_VERSION_MINOR 7 +#define HWRM_VERSION_UPDATE0 -#define HWRM_VERSION_STR "1.6.1" +#define HWRM_VERSION_STR "1.7.0" /* * Following is the signature for HWRM message field that indicates not * applicable (All F's). Need to cast it the size of the field if needed. @@ -834,20 +834,32 @@ struct hwrm_func_qcfg_output { __le32 min_bw; #define FUNC_QCFG_RESP_MIN_BW_BW_VALUE_MASK 0xfffUL #define FUNC_QCFG_RESP_MIN_BW_BW_VALUE_SFT 0 - #define FUNC_QCFG_RESP_MIN_BW_RSVD 0x1000UL + #define FUNC_QCFG_RESP_MIN_BW_SCALE 0x1000UL + #define FUNC_QCFG_RESP_MIN_BW_SCALE_BITS (0x0UL << 28) + #define FUNC_QCFG_RESP_MIN_BW_SCALE_BYTES (0
[PATCH net-next 2/9] bnxt_en: Retry failed NVM_INSTALL_UPDATE with defragmentation flag.
From: Kshitij Soni If the HWRM_NVM_INSTALL_UPDATE command fails with the error code NVM_INSTALL_UPDATE_CMD_ERR_CODE_FRAG_ERR, retry the command with a new flag to allow defragmentation. Since we are checking the response for error code, we also need to take the mutex until we finish reading the response. Signed-off-by: Kshitij Soni Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 32 ++- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c index 7aa248d..4b45b88 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c @@ -1578,17 +1578,37 @@ static int bnxt_flash_package_from_file(struct net_device *dev, bnxt_hwrm_cmd_hdr_init(bp, &install, HWRM_NVM_INSTALL_UPDATE, -1, -1); install.install_type = cpu_to_le32(install_type); - rc = hwrm_send_message(bp, &install, sizeof(install), - INSTALL_PACKAGE_TIMEOUT); - if (rc) - return -EOPNOTSUPP; + mutex_lock(&bp->hwrm_cmd_lock); + rc = _hwrm_send_message(bp, &install, sizeof(install), + INSTALL_PACKAGE_TIMEOUT); + if (rc) { + rc = -EOPNOTSUPP; + goto flash_pkg_exit; + } + + if (resp->error_code) { + u8 error_code = ((struct hwrm_err_output *)resp)->cmd_err; + + if (error_code == NVM_INSTALL_UPDATE_CMD_ERR_CODE_FRAG_ERR) { + install.flags |= cpu_to_le16( + NVM_INSTALL_UPDATE_REQ_FLAGS_ALLOWED_TO_DEFRAG); + rc = _hwrm_send_message(bp, &install, sizeof(install), + INSTALL_PACKAGE_TIMEOUT); + if (rc) { + rc = -EOPNOTSUPP; + goto flash_pkg_exit; + } + } + } if (resp->result) { netdev_err(dev, "PKG install error = %d, problem_item = %d\n", (s8)resp->result, (int)resp->problem_item); - return -ENOPKG; + rc = -ENOPKG; } - return 0; +flash_pkg_exit: + mutex_unlock(&bp->hwrm_cmd_lock); + return rc; } static int bnxt_flash_device(struct net_device *dev, -- 1.8.3.1
[PATCH net-next 9/9] bnxt_en: Added PCI IDs for BCM57452 and BCM57454 ASICs
From: Deepak Khungar Signed-off-by: Deepak Khungar Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index c899d61..71f9a18 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -99,6 +99,8 @@ enum board_idx { BCM57407_NPAR, BCM57414_NPAR, BCM57416_NPAR, + BCM57452, + BCM57454, NETXTREME_E_VF, NETXTREME_C_VF, }; @@ -133,6 +135,8 @@ enum board_idx { { "Broadcom BCM57407 NetXtreme-E Ethernet Partition" }, { "Broadcom BCM57414 NetXtreme-E Ethernet Partition" }, { "Broadcom BCM57416 NetXtreme-E Ethernet Partition" }, + { "Broadcom BCM57452 NetXtreme-E 10Gb/25Gb/40Gb/50Gb Ethernet" }, + { "Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" }, { "Broadcom NetXtreme-E Ethernet Virtual Function" }, { "Broadcom NetXtreme-C Ethernet Virtual Function" }, }; @@ -168,6 +172,8 @@ enum board_idx { { PCI_VDEVICE(BROADCOM, 0x16ed), .driver_data = BCM57414_NPAR }, { PCI_VDEVICE(BROADCOM, 0x16ee), .driver_data = BCM57416_NPAR }, { PCI_VDEVICE(BROADCOM, 0x16ef), .driver_data = BCM57416_NPAR }, + { PCI_VDEVICE(BROADCOM, 0x16f1), .driver_data = BCM57452 }, + { PCI_VDEVICE(BROADCOM, 0x1614), .driver_data = BCM57454 }, #ifdef CONFIG_BNXT_SRIOV { PCI_VDEVICE(BROADCOM, 0x16c1), .driver_data = NETXTREME_E_VF }, { PCI_VDEVICE(BROADCOM, 0x16cb), .driver_data = NETXTREME_C_VF }, -- 1.8.3.1
[PATCH net-next 4/9] bnxt_en: Allow NETIF_F_NTUPLE to be enabled on VFs.
Commit ae10ae740ad2 ("bnxt_en: Add new hardware RFS mode.") has added code to allow NTUPLE to be enabled on VFs. So we now remove the BNXT_VF() check in rfs_capable() to allow NTUPLE on VFs. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 8ac5987..516c5d7 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -6291,7 +6291,7 @@ static bool bnxt_rfs_capable(struct bnxt *bp) #ifdef CONFIG_RFS_ACCEL int vnics, max_vnics, max_rss_ctxs; - if (BNXT_VF(bp) || !(bp->flags & BNXT_FLAG_MSIX_CAP)) + if (!(bp->flags & BNXT_FLAG_MSIX_CAP)) return false; vnics = 1 + bp->rx_nr_rings; -- 1.8.3.1
[PATCH net-next 8/9] bnxt_en: Fix bnxt_setup_tc() error message.
Add proper puctuation to make the message more clear. Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 9f1dfbe..c899d61 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -6833,7 +6833,7 @@ int bnxt_setup_mq_tc(struct net_device *dev, u8 tc) int rc; if (tc > bp->max_tc) { - netdev_err(dev, "too many traffic classes requested: %d Max supported is %d\n", + netdev_err(dev, "Too many traffic classes requested: %d. Max supported is %d.\n", tc, bp->max_tc); return -EINVAL; } -- 1.8.3.1
[PATCH net-next 0/9] bnxt_en: Misc updates.
Miscellaneous updates include update of the firmware spec, ethtool flash enhancement, ethtool -l minor fix, NTUPLE support enhancements, FEC link settings message during link up, and new PCI IDs. Please review. Thanks. Deepak Khungar (1): bnxt_en: Added PCI IDs for BCM57452 and BCM57454 ASICs Kshitij Soni (1): bnxt_en: Retry failed NVM_INSTALL_UPDATE with defragmentation flag. Michael Chan (7): bnxt_en: Update to firmware interface spec 1.7.0. bnxt_en: Fix ethtool -l pre-set max combined channel. bnxt_en: Allow NETIF_F_NTUPLE to be enabled on VFs. bnxt_en: Add hardware NTUPLE filter for encapsulated packets. bnxt_en: Do not setup PHY unless driving a single PF. bnxt_en: Print FEC settings as part of the linkup dmesg. bnxt_en: Fix bnxt_setup_tc() error message. drivers/net/ethernet/broadcom/bnxt/bnxt.c | 52 ++- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 +- drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 34 +- drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 437 ++ 4 files changed, 431 insertions(+), 101 deletions(-) -- 1.8.3.1
linux-next: manual merge of the ipsec-next tree with the net-next tree
Hi Steffen, Today's linux-next merge of the ipsec-next tree got a conflict in: net/xfrm/xfrm_policy.c between commit: 63fca65d0863 ("net: add confirm_neigh method to dst_ops") from the net-next tree and commits: 3d7d25a68ea5 ("xfrm: policy: remove garbage_collect callback") a2817d8b279b ("xfrm: policy: remove family field") from the ipsec-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/xfrm/xfrm_policy.c index f68d75766d51,04ed1a1ae019.. --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@@ -2856,32 -2843,15 +2843,32 @@@ static struct neighbour *xfrm_neigh_loo return dst->path->ops->neigh_lookup(dst, skb, daddr); } +static void xfrm_confirm_neigh(const struct dst_entry *dst, const void *daddr) +{ + const struct dst_entry *path = dst->path; + + for (; dst != path; dst = dst->child) { + const struct xfrm_state *xfrm = dst->xfrm; + + if (xfrm->props.mode == XFRM_MODE_TRANSPORT) + continue; + if (xfrm->type->flags & XFRM_TYPE_REMOTE_COADDR) + daddr = xfrm->coaddr; + else if (!(xfrm->type->flags & XFRM_TYPE_LOCAL_COADDR)) + daddr = &xfrm->id.daddr; + } + path->ops->confirm_neigh(path, daddr); +} + - int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo) + int xfrm_policy_register_afinfo(const struct xfrm_policy_afinfo *afinfo, int family) { int err = 0; - if (unlikely(afinfo == NULL)) - return -EINVAL; - if (unlikely(afinfo->family >= NPROTO)) + + if (WARN_ON(family >= ARRAY_SIZE(xfrm_policy_afinfo))) return -EAFNOSUPPORT; + spin_lock(&xfrm_policy_afinfo_lock); - if (unlikely(xfrm_policy_afinfo[afinfo->family] != NULL)) + if (unlikely(xfrm_policy_afinfo[family] != NULL)) err = -EEXIST; else { struct dst_ops *dst_ops = afinfo->dst_ops; @@@ -2899,11 -2869,7 +2886,9 @@@ dst_ops->link_failure = xfrm_link_failure; if (likely(dst_ops->neigh_lookup == NULL)) dst_ops->neigh_lookup = xfrm_neigh_lookup; + if (likely(!dst_ops->confirm_neigh)) + dst_ops->confirm_neigh = xfrm_confirm_neigh; - if (likely(afinfo->garbage_collect == NULL)) - afinfo->garbage_collect = xfrm_garbage_collect_deferred; - rcu_assign_pointer(xfrm_policy_afinfo[afinfo->family], afinfo); + rcu_assign_pointer(xfrm_policy_afinfo[family], afinfo); } spin_unlock(&xfrm_policy_afinfo_lock);
Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag
On Sun, Feb 12, 2017 at 12:01 AM, Daniel Mack wrote: > On 02/11/2017 05:28 AM, Alexei Starovoitov wrote: >> If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command >> to the given cgroup the descendent cgroup will be able to override >> effective bpf program that was inherited from this cgroup. >> By default it's not passed, therefore override is disallowed. >> >> Examples: >> 1. >> prog X attached to /A with default >> prog Y fails to attach to /A/B and /A/B/C >> Everything under /A runs prog X >> >> 2. >> prog X attached to /A with allow_override. >> prog Y fails to attach to /A/B with default (non-override) >> prog M attached to /A/B with allow_override. >> Everything under /A/B runs prog M only. >> >> 3. >> prog X attached to /A with allow_override. >> prog Y fails to attach to /A with default. >> The user has to detach first to switch the mode. >> >> In the future this behavior may be extended with a chain of >> non-overridable programs. >> >> Also fix the bug where detach from cgroup where nothing is attached >> was not throwing error. Return ENOENT in such case. >> >> Add several testcases and adjust libbpf. >> >> Fixes: 3007098494be ("cgroup: add support for eBPF programs") >> Signed-off-by: Alexei Starovoitov > > Looks good to me. > > Acked-by: Daniel Mack > > Let's get this into 4.10! Agreed. --Andy > > > Thanks, > Daniel > > > >> --- >> v1->v2: disallowed overridable->non_override transition as suggested by Andy >> added tests and fixed double detach bug >> >> Andy, Daniel, >> please review and ack quickly, so it can land into 4.10. >> --- >> include/linux/bpf-cgroup.h | 13 >> include/uapi/linux/bpf.h | 7 + >> kernel/bpf/cgroup.c | 59 +++--- >> kernel/bpf/syscall.c | 20 >> kernel/cgroup.c | 9 +++--- >> samples/bpf/test_cgrp2_attach.c | 2 +- >> samples/bpf/test_cgrp2_attach2.c | 68 >> +--- >> samples/bpf/test_cgrp2_sock.c| 2 +- >> samples/bpf/test_cgrp2_sock2.c | 2 +- >> tools/lib/bpf/bpf.c | 4 ++- >> tools/lib/bpf/bpf.h | 3 +- >> 11 files changed, 151 insertions(+), 38 deletions(-) >> >> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h >> index 92bc89ae7e20..c970a25d2a49 100644 >> --- a/include/linux/bpf-cgroup.h >> +++ b/include/linux/bpf-cgroup.h >> @@ -21,20 +21,19 @@ struct cgroup_bpf { >>*/ >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; >> struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE]; >> + bool disallow_override[MAX_BPF_ATTACH_TYPE]; >> }; >> >> void cgroup_bpf_put(struct cgroup *cgrp); >> void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent); >> >> -void __cgroup_bpf_update(struct cgroup *cgrp, >> - struct cgroup *parent, >> - struct bpf_prog *prog, >> - enum bpf_attach_type type); >> +int __cgroup_bpf_update(struct cgroup *cgrp, struct cgroup *parent, >> + struct bpf_prog *prog, enum bpf_attach_type type, >> + bool overridable); >> >> /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */ >> -void cgroup_bpf_update(struct cgroup *cgrp, >> -struct bpf_prog *prog, >> -enum bpf_attach_type type); >> +int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog, >> + enum bpf_attach_type type, bool overridable); >> >> int __cgroup_bpf_run_filter_skb(struct sock *sk, >> struct sk_buff *skb, >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index e5b8cf16cbaf..69f65b710b10 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -116,6 +116,12 @@ enum bpf_attach_type { >> >> #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE >> >> +/* If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command >> + * to the given target_fd cgroup the descendent cgroup will be able to >> + * override effective bpf program that was inherited from this cgroup >> + */ >> +#define BPF_F_ALLOW_OVERRIDE (1U << 0) >> + >> #define BPF_PSEUDO_MAP_FD1 >> >> /* flags for BPF_MAP_UPDATE_ELEM command */ >> @@ -171,6 +177,7 @@ union bpf_attr { >> __u32 target_fd; /* container object to attach >> to */ >> __u32 attach_bpf_fd; /* eBPF program to attach */ >> __u32 attach_type; >> + __u32 attach_flags; >> }; >> } __attribute__((aligned(8))); >> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index a515f7b007c6..da0f53690295 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -52,6 +52,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup >> *parent) >> e = rcu_dereference_protected(parent->bpf.effective[type], >>
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On Sun, 12 Feb 2017 12:57:46 -0800 Eric Dumazet wrote: > On Sun, 2017-02-12 at 18:31 +0200, Tariq Toukan wrote: > > On 09/02/2017 6:56 PM, Eric Dumazet wrote: > > >> Default, out of box. > > > Well. Please report : > > > > > > ethtool -l eth0 > > > ethtool -g eth0 > > $ ethtool -g p1p1 > > Ring parameters for p1p1: > > Pre-set maximums: > > RX: 8192 > > RX Mini:0 > > RX Jumbo: 0 > > TX: 8192 > > Current hardware settings: > > RX: 1024 > > RX Mini:0 > > RX Jumbo: 0 > > TX: 512 > > We are using 4096 slots per RX queue, this is why I could not reproduce > your results. Just so others understand this: The number of RX queue slots is indirectly the size of the page-recycle "cache" in this scheme (that depend on refcnt tricks to see if page can be reused). > A single TCP flow easily can have more than 1024 MSS waiting in its > receive queue (typical receive window on linux is 6MB/2 ) So, you do need to increase the page-"cache" size, and need this for real-life cases, interesting. > I mentioned that having a slightly inflated skb->truesize might have an > impact in some workloads. (charging for 2048 bytes per MSS instead of > 1536), but this is not related to mlx4 and should be tweaked in TCP > stack instead, since this 2048 bytes (half a page on x86) strategy is > now well spread. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH 0/2] IPv4-mapped on wire, :: dst address issue
Under some circumstances IPv6 datagrams are sent with IPv4-mapped IPv6 addresses as the source. Given an IPv6 socket bound to an IPv4-mapped IPv6 address, and an IPv6 destination address, both TCP and UDP will will send packets using the IPv4-mapped IPv6 address as the source. Per RFC 6890 (Table 20), IPv4-mapped IPv6 source addresses are not allowed in an IP datagram. The problem can be observed by attempting to connect() either a TCP or UDP socket, or by using sendmsg() with a UDP socket. The patch is intended to correct this issue for all socket types. linux follows the BSD convention that an IPv6 destination address specified as in6addr_any is converted to the loopback address. Currently, neither TCP nor UDP consider the possibility that the source address is an IPv4-mapped IPv6 address, and assume that the appropriate loopback address is ::1. The patch adds a check on whether or not the source address is an IPv4-mapped IPv6 address and then sets the destination address to either :::127.0.0.1 or ::1, as appropriate. Jon Jonathan T. Leighton (2): ipv6: Inhibit IPv4-mapped src address on the wire. ipv6: Handle IPv4-mapped src to in6addr_any dst. net/ipv6/datagram.c | 14 +- net/ipv6/ip6_output.c | 3 +++ net/ipv6/tcp_ipv6.c | 11 --- net/ipv6/udp.c| 4 4 files changed, 24 insertions(+), 8 deletions(-) -- 2.7.4
[PATCH 1/2] ipv6: Inhibit IPv4-mapped src address on the wire.
This patch adds a check for the problematic case of an IPv4-mapped IPv6 source address and a destination address that is neither an IPv4-mapped IPv6 address nor in6addr_any, and returns an appropriate error. The check in done before returning from looking up the route. Signed-off-by: Jonathan T. Leighton --- net/ipv6/ip6_output.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index a75871c..d0f51b4 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1022,6 +1022,9 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk, } } #endif + if (ipv6_addr_v4mapped(&fl6->saddr) && + !(ipv6_addr_v4mapped(&fl6->daddr) || ipv6_addr_any(&fl6->daddr))) + return -EAFNOSUPPORT; return 0; -- 2.7.4
[PATCH 2/2] ipv6: Handle IPv4-mapped src to in6addr_any dst.
This patch adds a check on the type of the source address for the case where the destination address is in6addr_any. If the source is an IPv4-mapped IPv6 source address, the destination is changed to :::127.0.0.1, and otherwise the destination is changed to ::1. This is done in three locations to handle UDP calls to either connect() or sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays handling an in6addr_any destination until very late, so the patch only needs to handle the case where the source is an IPv4-mapped IPv6 address. Signed-off-by: Jonathan T. Leighton --- net/ipv6/datagram.c | 14 +- net/ipv6/tcp_ipv6.c | 11 --- net/ipv6/udp.c | 4 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index a3eaafd..eec27f8 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -167,18 +167,22 @@ int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, if (np->sndflow) fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK; - addr_type = ipv6_addr_type(&usin->sin6_addr); - - if (addr_type == IPV6_ADDR_ANY) { + if (ipv6_addr_any(&usin->sin6_addr)) { /* * connect to self */ - usin->sin6_addr.s6_addr[15] = 0x01; + if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) + ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK), + &usin->sin6_addr); + else + usin->sin6_addr = in6addr_loopback; } + addr_type = ipv6_addr_type(&usin->sin6_addr); + daddr = &usin->sin6_addr; - if (addr_type == IPV6_ADDR_MAPPED) { + if (addr_type & IPV6_ADDR_MAPPED) { struct sockaddr_in sin; if (__ipv6_only_sock(sk)) { diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index b5d2721..21c7199 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -149,8 +149,13 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, * connect() to INADDR_ANY means loopback (BSD'ism). */ - if (ipv6_addr_any(&usin->sin6_addr)) - usin->sin6_addr.s6_addr[15] = 0x1; + if (ipv6_addr_any(&usin->sin6_addr)) { + if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) + ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK), + &usin->sin6_addr); + else + usin->sin6_addr = in6addr_loopback; + } addr_type = ipv6_addr_type(&usin->sin6_addr); @@ -189,7 +194,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, * TCP over IPv4 */ - if (addr_type == IPV6_ADDR_MAPPED) { + if (addr_type & IPV6_ADDR_MAPPED) { u32 exthdrlen = icsk->icsk_ext_hdr_len; struct sockaddr_in sin; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index df71ba0..4e4c401 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1046,6 +1046,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (addr_len < SIN6_LEN_RFC2133) return -EINVAL; daddr = &sin6->sin6_addr; + if (ipv6_addr_any(daddr) && + ipv6_addr_v4mapped(&np->saddr)) + ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK), + daddr); break; case AF_INET: goto do_udp_sendmsg; -- 2.7.4
[PATCH net] net/llc: avoid BUG_ON() in skb_orphan()
From: Eric Dumazet It seems nobody used LLC since linux-3.12. Fortunately fuzzers like syzkaller still know how to run this code, otherwise it would be no fun. Setting skb->sk without skb->destructor leads to all kinds of bugs, we now prefer to be very strict about it. Ideally here we would use skb_set_owner() but this helper does not exist yet, only CAN seems to have a private helper for that. Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()") Signed-off-by: Eric Dumazet Reported-by: Andrey Konovalov --- net/llc/llc_conn.c |3 +++ net/llc/llc_sap.c |3 +++ 2 files changed, 6 insertions(+) diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c index 3e821daf9dd4a2fbf00550591e92b153efd4a73a..8bc5a1bd2d453542df31506f543feb64b64cdd96 100644 --- a/net/llc/llc_conn.c +++ b/net/llc/llc_conn.c @@ -821,7 +821,10 @@ void llc_conn_handler(struct llc_sap *sap, struct sk_buff *skb) * another trick required to cope with how the PROCOM state * machine works. -acme */ + skb_orphan(skb); + sock_hold(sk); skb->sk = sk; + skb->destructor = sock_efree; } if (!sock_owned_by_user(sk)) llc_conn_rcv(sk, skb); diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c index d0e1e804ebd73dcebcf2f930b921233a49b0f454..5404d0d195cc581613e356b75bd70321e617673e 100644 --- a/net/llc/llc_sap.c +++ b/net/llc/llc_sap.c @@ -290,7 +290,10 @@ static void llc_sap_rcv(struct llc_sap *sap, struct sk_buff *skb, ev->type = LLC_SAP_EV_TYPE_PDU; ev->reason = 0; + skb_orphan(skb); + sock_hold(sk); skb->sk = sk; + skb->destructor = sock_efree; llc_sap_state_process(sap, skb); }
Re: [PATCH net-next v2] net: phy: Allow splitting MDIO bus/device support from PHYs
Hi Florian, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-phy-Allow-splitting-MDIO-bus-device-support-from-PHYs/20170210-115834 config: i386-randconfig-h1-02130126 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/built-in.o: In function `unimac_mdio_remove': >> mdio-bcm-unimac.c:(.text+0x2f6d9b): undefined reference to >> `mdiobus_unregister' >> mdio-bcm-unimac.c:(.text+0x2f6db0): undefined reference to `mdiobus_free' drivers/built-in.o: In function `unimac_mdio_reset': >> mdio-bcm-unimac.c:(.text+0x2f6e6a): undefined reference to >> `of_mdio_parse_addr' >> mdio-bcm-unimac.c:(.text+0x2f6ef7): undefined reference to `mdiobus_read' drivers/built-in.o: In function `unimac_mdio_probe': >> mdio-bcm-unimac.c:(.text+0x2f7297): undefined reference to >> `mdiobus_alloc_size' >> mdio-bcm-unimac.c:(.text+0x2f7320): undefined reference to >> `of_mdiobus_register' mdio-bcm-unimac.c:(.text+0x2f73aa): undefined reference to `mdiobus_free' drivers/built-in.o: In function `alloc_mdio_bitbang': >> (.text+0x2f7b28): undefined reference to `mdiobus_alloc_size' drivers/built-in.o: In function `free_mdio_bitbang': >> (.text+0x2f7bc1): undefined reference to `mdiobus_free' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH v1] samples/bpf: Add a .gitignore for binaries
Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- samples/bpf/.gitignore | 32 1 file changed, 32 insertions(+) create mode 100644 samples/bpf/.gitignore diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore new file mode 100644 index ..a7562a5ef4c2 --- /dev/null +++ b/samples/bpf/.gitignore @@ -0,0 +1,32 @@ +fds_example +lathist +lwt_len_hist +map_perf_test +offwaketime +sampleip +sockex1 +sockex2 +sockex3 +sock_example +spintest +tc_l2_redirect +test_cgrp2_array_pin +test_cgrp2_attach +test_cgrp2_attach2 +test_cgrp2_sock +test_cgrp2_sock2 +test_current_task_under_cgroup +test_lru_dist +test_overhead +test_probe_write_user +trace_event +trace_output +tracex1 +tracex2 +tracex3 +tracex4 +tracex5 +tracex6 +xdp1 +xdp2 +xdp_tx_iptunnel -- 2.11.0
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On Sun, 2017-02-12 at 18:31 +0200, Tariq Toukan wrote: > On 09/02/2017 6:56 PM, Eric Dumazet wrote: > >> Default, out of box. > > Well. Please report : > > > > ethtool -l eth0 > > ethtool -g eth0 > $ ethtool -g p1p1 > Ring parameters for p1p1: > Pre-set maximums: > RX: 8192 > RX Mini:0 > RX Jumbo: 0 > TX: 8192 > Current hardware settings: > RX: 1024 > RX Mini:0 > RX Jumbo: 0 > TX: 512 We are using 4096 slots per RX queue, this is why I could not reproduce your results. A single TCP flow easily can have more than 1024 MSS waiting in its receive queue (typical receive window on linux is 6MB/2 ) I mentioned that having a slightly inflated skb->truesize might have an impact in some workloads. (charging for 2048 bytes per MSS instead of 1536), but this is not related to mlx4 and should be tweaked in TCP stack instead, since this 2048 bytes (half a page on x86) strategy is now well spread.
[PATCH] net: nuvoton: w90p910: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes --- drivers/net/ethernet/nuvoton/w90p910_ether.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/nuvoton/w90p910_ether.c b/drivers/net/ethernet/nuvoton/w90p910_ether.c index 119f6dc..9709c8c 100644 --- a/drivers/net/ethernet/nuvoton/w90p910_ether.c +++ b/drivers/net/ethernet/nuvoton/w90p910_ether.c @@ -874,16 +874,18 @@ static void w90p910_get_drvinfo(struct net_device *dev, strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version)); } -static int w90p910_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int w90p910_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) { struct w90p910_ether *ether = netdev_priv(dev); - return mii_ethtool_gset(ðer->mii, cmd); + return mii_ethtool_get_link_ksettings(ðer->mii, cmd); } -static int w90p910_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int w90p910_set_link_ksettings(struct net_device *dev, + const struct ethtool_link_ksettings *cmd) { struct w90p910_ether *ether = netdev_priv(dev); - return mii_ethtool_sset(ðer->mii, cmd); + return mii_ethtool_set_link_ksettings(ðer->mii, cmd); } static int w90p910_nway_reset(struct net_device *dev) @@ -899,11 +901,11 @@ static u32 w90p910_get_link(struct net_device *dev) } static const struct ethtool_ops w90p910_ether_ethtool_ops = { - .get_settings = w90p910_get_settings, - .set_settings = w90p910_set_settings, .get_drvinfo= w90p910_get_drvinfo, .nway_reset = w90p910_nway_reset, .get_link = w90p910_get_link, + .get_link_ksettings = w90p910_get_link_ksettings, + .set_link_ksettings = w90p910_set_link_ksettings, }; static const struct net_device_ops w90p910_ether_netdev_ops = { -- 1.7.4.4
Re: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.
On Sun, Feb 12, 2017 at 03:07:34PM +, Mintz, Yuval wrote: > Your algorithm ignores the HW limitation. Consider (ppb == 1): > your logic would output N == 7, *M == 70, >Which has perfect accuracy [N / *M is 1 / 10^9]. > But the solution for >'period' * 16 + 8 == 7 * 10^9 > isn't a whole number, so this result doesn't really reflect the actual > approximation error since we couldn't configure it to HW. Ok, so change my code to read: /*truncate to HW resolution*/ reg = (m - 8) / 16; m = reg * 16 + 8; Your HW will happyly accept the value of 'reg', right? > The original would return val == 1, period == 6249; While this > does have some error [val / (period * 16 + 8) is slightly bigger > than 1 / 10^9, error at 18[?] digit after dot], it's the best we can > configure for the HW. That is *not* the best you can do: Perfect: 1 / 10 = .1 Yours:1 / 2 = .18 Mine*:7 / 62 = .1114 [ * revised with the above change ] Not a huge difference, but yours is not "the best we can". Let's try another: ppb = 40831 Perfect: 40831 / 10 = .40831 Yours:4 / 97960 = .4083299305839118 Mine: 5 / 122456 = .4083099235643823 See the difference? Please, try the two algorithms and plot the RMS error over the interval ppb = 1 ... 10. The result may surprise you. > No. In an ideal world, I would have liked optimizing everything. > But in this world if I do find time to spend on optimizations > I rather do that for the stuff that matters. I.e., datapath. As the PTP maintainer, I look after about the PTP drivers. They should be as good as we can make them (even when the HW is a broken as yours is). That is why I bothered to review and to spend time thinking about your problem. I especially care about having good examples in the tree, since this stuff will inevitably get copied by new driver authors. It is wonderful that your data path is so very optimized, but that is no excuse for poor PTP code. Thanks, Richard
[PATCH 05/21] netfilter: nf_tables: add flush field to struct nft_set_iter
This provides context to walk callback iterator, thus, we know if the walk happens from the set flush path. This is required by the new bitmap set type coming in a follow up patch which has no real struct nft_set_ext, so it has to allocate it based on the two bit compact element representation. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 1 + net/netfilter/nf_tables_api.c | 4 2 files changed, 5 insertions(+) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index ab155644d489..5830f594842e 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -203,6 +203,7 @@ struct nft_set_elem { struct nft_set; struct nft_set_iter { u8 genmask; + boolflush; unsigned intcount; unsigned intskip; int err; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index c09b11eb36fc..7ae810b03462 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3121,6 +3121,7 @@ int nf_tables_bind_set(const struct nft_ctx *ctx, struct nft_set *set, iter.count = 0; iter.err= 0; iter.fn = nf_tables_bind_check_setelem; + iter.flush = false; set->ops->walk(ctx, set, &iter); if (iter.err < 0) @@ -3374,6 +3375,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb) args.iter.count = 0; args.iter.err = 0; args.iter.fn= nf_tables_dump_setelem; + args.iter.flush = false; set->ops->walk(&ctx, set, &args.iter); nla_nest_end(skb, nest); @@ -3939,6 +3941,7 @@ static int nf_tables_delsetelem(struct net *net, struct sock *nlsk, struct nft_set_iter iter = { .genmask= genmask, .fn = nft_flush_set, + .flush = true, }; set->ops->walk(&ctx, set, &iter); @@ -5089,6 +5092,7 @@ static int nf_tables_check_loops(const struct nft_ctx *ctx, iter.count = 0; iter.err= 0; iter.fn = nf_tables_loop_check_setelem; + iter.flush = false; set->ops->walk(ctx, set, &iter); if (iter.err < 0) -- 2.1.4
[PATCH 01/21] netfilter: nft_exthdr: Add support for existence check
From: Phil Sutter If NFT_EXTHDR_F_PRESENT is set, exthdr will not copy any header field data into *dest, but instead set it to 1 if the header is found and 0 otherwise. Signed-off-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 6 ++ net/netfilter/nft_exthdr.c | 22 -- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 7b730cab99bd..53aac8b8ed6b 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -704,6 +704,10 @@ enum nft_payload_attributes { }; #define NFTA_PAYLOAD_MAX (__NFTA_PAYLOAD_MAX - 1) +enum nft_exthdr_flags { + NFT_EXTHDR_F_PRESENT = (1 << 0), +}; + /** * enum nft_exthdr_attributes - nf_tables IPv6 extension header expression netlink attributes * @@ -711,6 +715,7 @@ enum nft_payload_attributes { * @NFTA_EXTHDR_TYPE: extension header type (NLA_U8) * @NFTA_EXTHDR_OFFSET: extension header offset (NLA_U32) * @NFTA_EXTHDR_LEN: extension header length (NLA_U32) + * @NFTA_EXTHDR_FLAGS: extension header flags (NLA_U32) */ enum nft_exthdr_attributes { NFTA_EXTHDR_UNSPEC, @@ -718,6 +723,7 @@ enum nft_exthdr_attributes { NFTA_EXTHDR_TYPE, NFTA_EXTHDR_OFFSET, NFTA_EXTHDR_LEN, + NFTA_EXTHDR_FLAGS, __NFTA_EXTHDR_MAX }; #define NFTA_EXTHDR_MAX(__NFTA_EXTHDR_MAX - 1) diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index 47beb3abcc9d..a89e5ab150db 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -23,6 +23,7 @@ struct nft_exthdr { u8 offset; u8 len; enum nft_registers dreg:8; + u8 flags; }; static void nft_exthdr_eval(const struct nft_expr *expr, @@ -35,8 +36,12 @@ static void nft_exthdr_eval(const struct nft_expr *expr, int err; err = ipv6_find_hdr(pkt->skb, &offset, priv->type, NULL, NULL); - if (err < 0) + if (priv->flags & NFT_EXTHDR_F_PRESENT) { + *dest = (err >= 0); + return; + } else if (err < 0) { goto err; + } offset += priv->offset; dest[priv->len / NFT_REG32_SIZE] = 0; @@ -52,6 +57,7 @@ static const struct nla_policy nft_exthdr_policy[NFTA_EXTHDR_MAX + 1] = { [NFTA_EXTHDR_TYPE] = { .type = NLA_U8 }, [NFTA_EXTHDR_OFFSET]= { .type = NLA_U32 }, [NFTA_EXTHDR_LEN] = { .type = NLA_U32 }, + [NFTA_EXTHDR_FLAGS] = { .type = NLA_U32 }, }; static int nft_exthdr_init(const struct nft_ctx *ctx, @@ -59,7 +65,7 @@ static int nft_exthdr_init(const struct nft_ctx *ctx, const struct nlattr * const tb[]) { struct nft_exthdr *priv = nft_expr_priv(expr); - u32 offset, len; + u32 offset, len, flags = 0; int err; if (tb[NFTA_EXTHDR_DREG] == NULL || @@ -76,10 +82,20 @@ static int nft_exthdr_init(const struct nft_ctx *ctx, if (err < 0) return err; + if (tb[NFTA_EXTHDR_FLAGS]) { + err = nft_parse_u32_check(tb[NFTA_EXTHDR_FLAGS], U8_MAX, &flags); + if (err < 0) + return err; + + if (flags & ~NFT_EXTHDR_F_PRESENT) + return -EINVAL; + } + priv->type = nla_get_u8(tb[NFTA_EXTHDR_TYPE]); priv->offset = offset; priv->len= len; priv->dreg = nft_parse_register(tb[NFTA_EXTHDR_DREG]); + priv->flags = flags; return nft_validate_register_store(ctx, priv->dreg, NULL, NFT_DATA_VALUE, priv->len); @@ -97,6 +113,8 @@ static int nft_exthdr_dump(struct sk_buff *skb, const struct nft_expr *expr) goto nla_put_failure; if (nla_put_be32(skb, NFTA_EXTHDR_LEN, htonl(priv->len))) goto nla_put_failure; + if (nla_put_be32(skb, NFTA_EXTHDR_FLAGS, htonl(priv->flags))) + goto nla_put_failure; return 0; nla_put_failure: -- 2.1.4
[PATCH 04/21] netfilter: nf_tables: rename deactivate_one() to flush()
Although semantics are similar to deactivate() with no implicit element lookup, this is only called from the set flush path, so better rename this to flush(). Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 8 net/netfilter/nf_tables_api.c | 2 +- net/netfilter/nft_set_hash.c | 8 net/netfilter/nft_set_rbtree.c| 8 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index a721bcb1210c..ab155644d489 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -260,7 +260,7 @@ struct nft_expr; * @insert: insert new element into set * @activate: activate new element in the next generation * @deactivate: lookup for element and deactivate it in the next generation - * @deactivate_one: deactivate element in the next generation + * @flush: deactivate element in the next generation * @remove: remove element from set * @walk: iterate over all set elemeennts * @privsize: function to return size of set private data @@ -295,9 +295,9 @@ struct nft_set_ops { void * (*deactivate)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem); - bool(*deactivate_one)(const struct net *net, - const struct nft_set *set, - void *priv); + bool(*flush)(const struct net *net, +const struct nft_set *set, +void *priv); void(*remove)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem); diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 790ffed82930..c09b11eb36fc 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3898,7 +3898,7 @@ static int nft_flush_set(const struct nft_ctx *ctx, if (!trans) return -ENOMEM; - if (!set->ops->deactivate_one(ctx->net, set, elem->priv)) { + if (!set->ops->flush(ctx->net, set, elem->priv)) { err = -ENOENT; goto err1; } diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index bb157bd47fe8..2f10ac3b1b10 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -167,8 +167,8 @@ static void nft_hash_activate(const struct net *net, const struct nft_set *set, nft_set_elem_clear_busy(&he->ext); } -static bool nft_hash_deactivate_one(const struct net *net, - const struct nft_set *set, void *priv) +static bool nft_hash_flush(const struct net *net, + const struct nft_set *set, void *priv) { struct nft_hash_elem *he = priv; @@ -195,7 +195,7 @@ static void *nft_hash_deactivate(const struct net *net, rcu_read_lock(); he = rhashtable_lookup_fast(&priv->ht, &arg, nft_hash_params); if (he != NULL && - !nft_hash_deactivate_one(net, set, he)) + !nft_hash_flush(net, set, he)) he = NULL; rcu_read_unlock(); @@ -398,7 +398,7 @@ static struct nft_set_ops nft_hash_ops __read_mostly = { .insert = nft_hash_insert, .activate = nft_hash_activate, .deactivate = nft_hash_deactivate, - .deactivate_one = nft_hash_deactivate_one, + .flush = nft_hash_flush, .remove = nft_hash_remove, .lookup = nft_hash_lookup, .update = nft_hash_update, diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 9fbd70da1633..81b8a4c2c061 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -172,8 +172,8 @@ static void nft_rbtree_activate(const struct net *net, nft_set_elem_change_active(net, set, &rbe->ext); } -static bool nft_rbtree_deactivate_one(const struct net *net, - const struct nft_set *set, void *priv) +static bool nft_rbtree_flush(const struct net *net, +const struct nft_set *set, void *priv) { struct nft_rbtree_elem *rbe = priv; @@ -214,7 +214,7 @@ static void *nft_rbtree_deactivate(const struct net *net, parent = parent->rb_right; continue; } - nft_rbtree_deactivate_one(net, set, rbe); + nft_rbtree_flush(net, set, rbe
[PATCH 07/21] netfilter: nf_tables: add space notation to sets
The space notation allows us to classify the set backend implementation based on the amount of required memory. This provides an order of the set representation scalability in terms of memory. The size field is still left in place so use this if the userspace provides no explicit number of elements, so we cannot calculate the real memory that this set needs. This also helps us break ties in the set backend selection routine, eg. two backend implementations provide the same performance. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 2 ++ net/netfilter/nf_tables_api.c | 22 +- net/netfilter/nft_set_hash.c | 1 + net/netfilter/nft_set_rbtree.c| 1 + 4 files changed, 21 insertions(+), 5 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index d76ac2f80a40..21ce50e6d0c5 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -245,10 +245,12 @@ enum nft_set_class { * * @size: required memory * @lookup: lookup performance class + * @space: memory class */ struct nft_set_estimate { unsigned intsize; enum nft_set_class lookup; + enum nft_set_class space; }; struct nft_set_ext; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index fa7cd1679079..cb6ae46f6c48 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2404,6 +2404,7 @@ nft_select_set_ops(const struct nlattr * const nla[], bops= NULL; best.size = ~0; best.lookup = ~0; + best.space = ~0; list_for_each_entry(ops, &nf_tables_set_ops, list) { if ((ops->features & features) != features) @@ -2415,14 +2416,25 @@ nft_select_set_ops(const struct nlattr * const nla[], case NFT_SET_POL_PERFORMANCE: if (est.lookup < best.lookup) break; - if (est.lookup == best.lookup && est.size < best.size) - break; + if (est.lookup == best.lookup) { + if (!desc->size) { + if (est.space < best.space) + break; + } else if (est.size < best.size) { + break; + } + } continue; case NFT_SET_POL_MEMORY: - if (est.size < best.size) - break; - if (est.size == best.size && est.lookup < best.lookup) + if (!desc->size) { + if (est.space < best.space) + break; + if (est.space == best.space && + est.lookup < best.lookup) + break; + } else if (est.size < best.size) { break; + } continue; default: break; diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index e58e7f02138b..6938bc890f31 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -385,6 +385,7 @@ static bool nft_hash_estimate(const struct nft_set_desc *desc, u32 features, } est->lookup = NFT_SET_CLASS_O_1; + est->space = NFT_SET_CLASS_O_N; return true; } diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 2b6ea10c4bbd..3387ed7dd231 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -292,6 +292,7 @@ static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features, est->size = nsize; est->lookup = NFT_SET_CLASS_O_LOG_N; + est->space = NFT_SET_CLASS_O_N; return true; } -- 2.1.4
[PATCH 13/21] netfilter: nf_ct_sip: Use mod_timer_pending()
From: Gao Feng timer_del() followed by timer_add() can be replaced by mod_timer_pending(). Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_sip.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c index c3fc14e021ec..24174c520239 100644 --- a/net/netfilter/nf_conntrack_sip.c +++ b/net/netfilter/nf_conntrack_sip.c @@ -809,13 +809,11 @@ static int refresh_signalling_expectation(struct nf_conn *ct, exp->tuple.dst.protonum != proto || exp->tuple.dst.u.udp.port != port) continue; - if (!del_timer(&exp->timeout)) - continue; - exp->flags &= ~NF_CT_EXPECT_INACTIVE; - exp->timeout.expires = jiffies + expires * HZ; - add_timer(&exp->timeout); - found = 1; - break; + if (mod_timer_pending(&exp->timeout, jiffies + expires * HZ)) { + exp->flags &= ~NF_CT_EXPECT_INACTIVE; + found = 1; + break; + } } spin_unlock_bh(&nf_conntrack_expect_lock); return found; -- 2.1.4
[PATCH 15/21] netfilter: nfnetlink: get rid of u_intX_t types
Use uX types instead. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index a09fa9fd8f3d..586212ebba9e 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -100,9 +100,9 @@ int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n) } EXPORT_SYMBOL_GPL(nfnetlink_subsys_unregister); -static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u_int16_t type) +static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u16 type) { - u_int8_t subsys_id = NFNL_SUBSYS_ID(type); + u8 subsys_id = NFNL_SUBSYS_ID(type); if (subsys_id >= NFNL_SUBSYS_COUNT) return NULL; @@ -111,9 +111,9 @@ static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u_int16_t t } static inline const struct nfnl_callback * -nfnetlink_find_client(u_int16_t type, const struct nfnetlink_subsystem *ss) +nfnetlink_find_client(u16 type, const struct nfnetlink_subsystem *ss) { - u_int8_t cb_id = NFNL_MSG_TYPE(type); + u8 cb_id = NFNL_MSG_TYPE(type); if (cb_id >= ss->cb_count) return NULL; @@ -185,7 +185,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); - u_int8_t cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); + u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); struct nlattr *cda[ss->cb[cb_id].attr_count + 1]; struct nlattr *attr = (void *)nlh + min_len; int attrlen = nlh->nlmsg_len - min_len; @@ -273,7 +273,7 @@ enum { }; static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, - u_int16_t subsys_id) + u16 subsys_id) { struct sk_buff *oskb = skb; struct net *net = sock_net(skb->sk); @@ -365,7 +365,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, { int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); - u_int8_t cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); + u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); struct nlattr *cda[ss->cb[cb_id].attr_count + 1]; struct nlattr *attr = (void *)nlh + min_len; int attrlen = nlh->nlmsg_len - min_len; @@ -439,7 +439,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, static void nfnetlink_rcv(struct sk_buff *skb) { struct nlmsghdr *nlh = nlmsg_hdr(skb); - u_int16_t res_id; + u16 res_id; int msglen; if (nlh->nlmsg_len < NLMSG_HDRLEN || -- 2.1.4
[PATCH 03/21] netfilter: nf_tables: use struct nft_set_iter in set element flush
Instead of struct nft_set_dump_args, remove unnecessary wrapper structure. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 3643ce345b59..790ffed82930 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3936,15 +3936,13 @@ static int nf_tables_delsetelem(struct net *net, struct sock *nlsk, return -EBUSY; if (nla[NFTA_SET_ELEM_LIST_ELEMENTS] == NULL) { - struct nft_set_dump_args args = { - .iter = { - .genmask= genmask, - .fn = nft_flush_set, - }, + struct nft_set_iter iter = { + .genmask= genmask, + .fn = nft_flush_set, }; - set->ops->walk(&ctx, set, &args.iter); + set->ops->walk(&ctx, set, &iter); - return args.iter.err; + return iter.err; } nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) { -- 2.1.4
[PATCH 09/21] netfilter: nft_ct: add zone id get support
From: Florian Westphal Just like with counters the direction attribute is optional. We set priv->dir to MAX unconditionally to avoid duplicating the assignment for all keys with optional direction. For keys where direction is mandatory, existing code already returns an error. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 2 ++ net/netfilter/nft_ct.c | 22 +++--- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 53aac8b8ed6b..3e60ed78c538 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -870,6 +870,7 @@ enum nft_rt_attributes { * @NFT_CT_PKTS: conntrack packets * @NFT_CT_BYTES: conntrack bytes * @NFT_CT_AVGPKT: conntrack average bytes per packet + * @NFT_CT_ZONE: conntrack zone */ enum nft_ct_keys { NFT_CT_STATE, @@ -889,6 +890,7 @@ enum nft_ct_keys { NFT_CT_PKTS, NFT_CT_BYTES, NFT_CT_AVGPKT, + NFT_CT_ZONE, }; /** diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index 66a2377510e1..5bd4cdfdcda5 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -151,6 +151,18 @@ static void nft_ct_get_eval(const struct nft_expr *expr, case NFT_CT_PROTOCOL: *dest = nf_ct_protonum(ct); return; +#ifdef CONFIG_NF_CONNTRACK_ZONES + case NFT_CT_ZONE: { + const struct nf_conntrack_zone *zone = nf_ct_zone(ct); + + if (priv->dir < IP_CT_DIR_MAX) + *dest = nf_ct_zone_id(zone, priv->dir); + else + *dest = zone->id; + + return; + } +#endif default: break; } @@ -266,6 +278,7 @@ static int nft_ct_get_init(const struct nft_ctx *ctx, int err; priv->key = ntohl(nla_get_be32(tb[NFTA_CT_KEY])); + priv->dir = IP_CT_DIR_MAX; switch (priv->key) { case NFT_CT_DIRECTION: if (tb[NFTA_CT_DIRECTION] != NULL) @@ -333,11 +346,13 @@ static int nft_ct_get_init(const struct nft_ctx *ctx, case NFT_CT_BYTES: case NFT_CT_PKTS: case NFT_CT_AVGPKT: - /* no direction? return sum of original + reply */ - if (tb[NFTA_CT_DIRECTION] == NULL) - priv->dir = IP_CT_DIR_MAX; len = sizeof(u64); break; +#ifdef CONFIG_NF_CONNTRACK_ZONES + case NFT_CT_ZONE: + len = sizeof(u16); + break; +#endif default: return -EOPNOTSUPP; } @@ -465,6 +480,7 @@ static int nft_ct_get_dump(struct sk_buff *skb, const struct nft_expr *expr) case NFT_CT_BYTES: case NFT_CT_PKTS: case NFT_CT_AVGPKT: + case NFT_CT_ZONE: if (priv->dir < IP_CT_DIR_MAX && nla_put_u8(skb, NFTA_CT_DIRECTION, priv->dir)) goto nla_put_failure; -- 2.1.4
[PATCH 02/21] netfilter: nf_tables: pass netns to set->ops->remove()
This new parameter is required by the new bitmap set type that comes in a follow up patch. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 3 ++- net/netfilter/nf_tables_api.c | 6 +++--- net/netfilter/nft_set_hash.c | 3 ++- net/netfilter/nft_set_rbtree.c| 3 ++- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index 7dfdb517f0be..a721bcb1210c 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -298,7 +298,8 @@ struct nft_set_ops { bool(*deactivate_one)(const struct net *net, const struct nft_set *set, void *priv); - void(*remove)(const struct nft_set *set, + void(*remove)(const struct net *net, + const struct nft_set *set, const struct nft_set_elem *elem); void(*walk)(const struct nft_ctx *ctx, struct nft_set *set, diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 57eeae63f597..3643ce345b59 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3752,7 +3752,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set, return 0; err6: - set->ops->remove(set, &elem); + set->ops->remove(ctx->net, set, &elem); err5: kfree(trans); err4: @@ -4804,7 +4804,7 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb) nf_tables_setelem_notify(&trans->ctx, te->set, &te->elem, NFT_MSG_DELSETELEM, 0); - te->set->ops->remove(te->set, &te->elem); + te->set->ops->remove(net, te->set, &te->elem); atomic_dec(&te->set->nelems); te->set->ndeact--; break; @@ -4925,7 +4925,7 @@ static int nf_tables_abort(struct net *net, struct sk_buff *skb) case NFT_MSG_NEWSETELEM: te = (struct nft_trans_elem *)trans->data; - te->set->ops->remove(te->set, &te->elem); + te->set->ops->remove(net, te->set, &te->elem); atomic_dec(&te->set->nelems); break; case NFT_MSG_DELSETELEM: diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index e36069fb76ae..bb157bd47fe8 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -203,7 +203,8 @@ static void *nft_hash_deactivate(const struct net *net, return he; } -static void nft_hash_remove(const struct nft_set *set, +static void nft_hash_remove(const struct net *net, + const struct nft_set *set, const struct nft_set_elem *elem) { struct nft_hash *priv = nft_set_priv(set); diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index f06f55ee516d..9fbd70da1633 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -151,7 +151,8 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, return err; } -static void nft_rbtree_remove(const struct nft_set *set, +static void nft_rbtree_remove(const struct net *net, + const struct nft_set *set, const struct nft_set_elem *elem) { struct nft_rbtree *priv = nft_set_priv(set); -- 2.1.4
[PATCH 06/21] netfilter: nf_tables: rename struct nft_set_estimate class field
Use lookup as field name instead, to prepare the introduction of the memory class in a follow up patch. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 4 ++-- net/netfilter/nf_tables_api.c | 12 ++-- net/netfilter/nft_set_hash.c | 2 +- net/netfilter/nft_set_rbtree.c| 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index 5830f594842e..d76ac2f80a40 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -244,11 +244,11 @@ enum nft_set_class { * characteristics * * @size: required memory - * @class: lookup performance class + * @lookup: lookup performance class */ struct nft_set_estimate { unsigned intsize; - enum nft_set_class class; + enum nft_set_class lookup; }; struct nft_set_ext; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 7ae810b03462..fa7cd1679079 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2401,9 +2401,9 @@ nft_select_set_ops(const struct nlattr * const nla[], features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT; } - bops = NULL; - best.size = ~0; - best.class = ~0; + bops= NULL; + best.size = ~0; + best.lookup = ~0; list_for_each_entry(ops, &nf_tables_set_ops, list) { if ((ops->features & features) != features) @@ -2413,15 +2413,15 @@ nft_select_set_ops(const struct nlattr * const nla[], switch (policy) { case NFT_SET_POL_PERFORMANCE: - if (est.class < best.class) + if (est.lookup < best.lookup) break; - if (est.class == best.class && est.size < best.size) + if (est.lookup == best.lookup && est.size < best.size) break; continue; case NFT_SET_POL_MEMORY: if (est.size < best.size) break; - if (est.size == best.size && est.class < best.class) + if (est.size == best.size && est.lookup < best.lookup) break; continue; default: diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 2f10ac3b1b10..e58e7f02138b 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -384,7 +384,7 @@ static bool nft_hash_estimate(const struct nft_set_desc *desc, u32 features, est->size = esize + 2 * sizeof(struct nft_hash_elem *); } - est->class = NFT_SET_CLASS_O_1; + est->lookup = NFT_SET_CLASS_O_1; return true; } diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 81b8a4c2c061..2b6ea10c4bbd 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -291,7 +291,7 @@ static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features, else est->size = nsize; - est->class = NFT_SET_CLASS_O_LOG_N; + est->lookup = NFT_SET_CLASS_O_LOG_N; return true; } -- 2.1.4
[PATCH 10/21] netfilter: nft_ct: prepare for key-dependent error unwind
From: Florian Westphal Next patch will add ZONE_ID set support which will need similar error unwind (put operation) as conntrack labels. Prepare for this: remove the 'label_got' boolean in favor of a switch statement that can be extended in next patch. As we already have that in the set_destroy function place that in a separate function and call it from the set init function. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_ct.c | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index 5bd4cdfdcda5..2d82df2737da 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -386,12 +386,24 @@ static int nft_ct_get_init(const struct nft_ctx *ctx, return 0; } +static void __nft_ct_set_destroy(const struct nft_ctx *ctx, struct nft_ct *priv) +{ + switch (priv->key) { +#ifdef CONFIG_NF_CONNTRACK_LABELS + case NFT_CT_LABELS: + nf_connlabels_put(ctx->net); + break; +#endif + default: + break; + } +} + static int nft_ct_set_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_ct *priv = nft_expr_priv(expr); - bool label_got = false; unsigned int len; int err; @@ -412,7 +424,6 @@ static int nft_ct_set_init(const struct nft_ctx *ctx, err = nf_connlabels_get(ctx->net, (len * BITS_PER_BYTE) - 1); if (err) return err; - label_got = true; break; #endif default: @@ -431,8 +442,7 @@ static int nft_ct_set_init(const struct nft_ctx *ctx, return 0; err1: - if (label_got) - nf_connlabels_put(ctx->net); + __nft_ct_set_destroy(ctx, priv); return err; } @@ -447,16 +457,7 @@ static void nft_ct_set_destroy(const struct nft_ctx *ctx, { struct nft_ct *priv = nft_expr_priv(expr); - switch (priv->key) { -#ifdef CONFIG_NF_CONNTRACK_LABELS - case NFT_CT_LABELS: - nf_connlabels_put(ctx->net); - break; -#endif - default: - break; - } - + __nft_ct_set_destroy(ctx, priv); nft_ct_netns_put(ctx->net, ctx->afi->family); } -- 2.1.4
[PATCH 08/21] netfilter: nf_tables: add bitmap set type
This patch adds a new bitmap set type. This bitmap uses two bits to represent one element. These two bits determine the element state in the current and the future generation that fits into the nf_tables commit protocol. When dumping elements back to userspace, the two bits are expanded into a struct nft_set_ext object. If no NFTA_SET_DESC_SIZE is specified, the existing automatic set backend selection prefers bitmap over hash in case of keys whose size is <= 16 bit. If the set size is know, the bitmap set type is selected if with 16 bit kets and more than 390 elements in the set, otherwise the hash table set implementation is used. For 8 bit keys, the bitmap consumes 66 bytes. For 16 bit keys, the bitmap takes 16388 bytes. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/Kconfig | 6 + net/netfilter/Makefile | 1 + net/netfilter/nft_set_bitmap.c | 314 + 3 files changed, 321 insertions(+) create mode 100644 net/netfilter/nft_set_bitmap.c diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index dfbe9deeb8c4..ea479ed43373 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -509,6 +509,12 @@ config NFT_SET_HASH This option adds the "hash" set type that is used to build one-way mappings between matchings and actions. +config NFT_SET_BITMAP + tristate "Netfilter nf_tables bitmap set module" + help + This option adds the "bitmap" set type that is used to build sets + whose keys are smaller or equal to 16 bits. + config NFT_COUNTER tristate "Netfilter nf_tables counter module" help diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index 6b3034f12661..c9b78e7b342f 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_NFT_REJECT) += nft_reject.o obj-$(CONFIG_NFT_REJECT_INET) += nft_reject_inet.o obj-$(CONFIG_NFT_SET_RBTREE) += nft_set_rbtree.o obj-$(CONFIG_NFT_SET_HASH) += nft_set_hash.o +obj-$(CONFIG_NFT_SET_BITMAP) += nft_set_bitmap.o obj-$(CONFIG_NFT_COUNTER) += nft_counter.o obj-$(CONFIG_NFT_LOG) += nft_log.o obj-$(CONFIG_NFT_MASQ) += nft_masq.o diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c new file mode 100644 index ..97f9649bcc7e --- /dev/null +++ b/net/netfilter/nft_set_bitmap.c @@ -0,0 +1,314 @@ +/* + * Copyright (c) 2017 Pablo Neira Ayuso + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* This bitmap uses two bits to represent one element. These two bits determine + * the element state in the current and the future generation. + * + * An element can be in three states. The generation cursor is represented using + * the ^ character, note that this cursor shifts on every succesful transaction. + * If no transaction is going on, we observe all elements are in the following + * state: + * + * 11 = this element is active in the current generation. In case of no updates, + * ^it stays active in the next generation. + * 00 = this element is inactive in the current generation. In case of no + * ^updates, it stays inactive in the next generation. + * + * On transaction handling, we observe these two temporary states: + * + * 01 = this element is inactive in the current generation and it becomes active + * ^in the next one. This happens when the element is inserted but commit + * path has not yet been executed yet, so activation is still pending. On + * transaction abortion, the element is removed. + * 10 = this element is active in the current generation and it becomes inactive + * ^in the next one. This happens when the element is deactivated but commit + * path has not yet been executed yet, so removal is still pending. On + * transation abortion, the next generation bit is reset to go back to + * restore its previous state. + */ +struct nft_bitmap { + u16 bitmap_size; + u8 bitmap[]; +}; + +static inline void nft_bitmap_location(u32 key, u32 *idx, u32 *off) +{ + u32 k = (key << 1); + + *idx = k / BITS_PER_BYTE; + *off = k % BITS_PER_BYTE; +} + +/* Fetch the two bits that represent the element and check if it is active based + * on the generation mask. + */ +static inline bool +nft_bitmap_active(const u8 *bitmap, u32 idx, u32 off, u8 genmask) +{ + return (bitmap[idx] & (0x3 << off)) & (genmask << off); +} + +static bool nft_bitmap_lookup(const struct net *net, const struct nft_set *set, + const u32 *key, const struct nft_set_ext **ext) +{ + const struct nft_bitmap *priv = nft_set_priv(set); + u8 genmask = nft_genmask_cur(net); + u32 idx, off; +
[PATCH 14/21] netfilter: nf_ct_expect: nf_ct_expect_insert() returns void
From: Gao Feng Because nf_ct_expect_insert() always succeeds now, its return value can be just void instead of int. And remove code that checks for its return value. Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_expect.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index f8dbacf66795..e19a69787d99 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -353,7 +353,7 @@ void nf_ct_expect_put(struct nf_conntrack_expect *exp) } EXPORT_SYMBOL_GPL(nf_ct_expect_put); -static int nf_ct_expect_insert(struct nf_conntrack_expect *exp) +static void nf_ct_expect_insert(struct nf_conntrack_expect *exp) { struct nf_conn_help *master_help = nfct_help(exp->master); struct nf_conntrack_helper *helper; @@ -380,7 +380,6 @@ static int nf_ct_expect_insert(struct nf_conntrack_expect *exp) add_timer(&exp->timeout); NF_CT_STAT_INC(net, expect_create); - return 0; } /* Race with expectations being used means we could have none to find; OK. */ @@ -464,9 +463,8 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, if (ret <= 0) goto out; - ret = nf_ct_expect_insert(expect); - if (ret < 0) - goto out; + nf_ct_expect_insert(expect); + spin_unlock_bh(&nf_conntrack_expect_lock); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report); return ret; -- 2.1.4
[PATCH 00/21] Netfilter updates for net-next
Hi David, The following patchset contains Netfilter updates for your net-next tree, most relevantly they are: 1) Extend nft_exthdr to allow to match TCP options bitfields, from Manuel Messner. 2) Allow to check if IPv6 extension header is present in nf_tables, from Phil Sutter. 3) Allow to set and match conntrack zone in nf_tables, patches from Florian Westphal. 4) Several patches for the nf_tables set infrastructure, this includes cleanup and preparatory patches to add the new bitmap set type. 5) Add optional ruleset generation ID check to nf_tables and allow to delete rules that got no public handle yet via NFTA_RULE_ID. These patches add the missing kernel infrastructure to support rule deletion by description from userspace. 6) Missing NFT_SET_OBJECT flag to select the right backend when sets stores an object map. 7) A couple of cleanups for the expectation and SIP helper, from Gao feng. You can pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git Thanks! The following changes since commit 6e7bc478c9a006c701c14476ec9d389a484b4864: net: skb_needs_check() accepts CHECKSUM_NONE for tx (2017-02-03 17:33:01 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD for you to fetch changes up to 7286ff7fde9f963736c7e575572899d8e16b06b7: netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection (2017-02-12 14:45:14 +0100) Florian Westphal (3): netfilter: nft_ct: add zone id get support netfilter: nft_ct: prepare for key-dependent error unwind netfilter: nft_ct: add zone id set support Gao Feng (2): netfilter: nf_ct_sip: Use mod_timer_pending() netfilter: nf_ct_expect: nf_ct_expect_insert() returns void Manuel Messner (1): netfilter: nft_exthdr: add TCP option matching Pablo Neira Ayuso (14): netfilter: nf_tables: pass netns to set->ops->remove() netfilter: nf_tables: use struct nft_set_iter in set element flush netfilter: nf_tables: rename deactivate_one() to flush() netfilter: nf_tables: add flush field to struct nft_set_iter netfilter: nf_tables: rename struct nft_set_estimate class field netfilter: nf_tables: add space notation to sets netfilter: nf_tables: add bitmap set type netfilter: nfnetlink: get rid of u_intX_t types netfilter: nfnetlink: add nfnetlink_rcv_skb_batch() netfilter: nfnetlink: allow to check for generation ID netfilter: nf_tables: add check_genid to the nfnetlink subsystem netfilter: nf_tables: add NFTA_RULE_ID attribute netfilter: update MAINTAINERS netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection Phil Sutter (1): netfilter: nft_exthdr: Add support for existence check MAINTAINERS | 3 +- include/linux/netfilter/nfnetlink.h | 1 + include/net/netfilter/nf_tables.h| 21 ++- include/uapi/linux/netfilter/nf_tables.h | 27 ++- include/uapi/linux/netfilter/nfnetlink.h | 12 ++ net/netfilter/Kconfig| 10 +- net/netfilter/Makefile | 1 + net/netfilter/nf_conntrack_expect.c | 8 +- net/netfilter/nf_conntrack_sip.c | 12 +- net/netfilter/nf_tables_api.c| 89 ++--- net/netfilter/nfnetlink.c| 90 ++--- net/netfilter/nft_ct.c | 195 +-- net/netfilter/nft_exthdr.c | 139 -- net/netfilter/nft_set_bitmap.c | 314 +++ net/netfilter/nft_set_hash.c | 16 +- net/netfilter/nft_set_rbtree.c | 16 +- 16 files changed, 832 insertions(+), 122 deletions(-) create mode 100644 net/netfilter/nft_set_bitmap.c
[PATCH 12/21] netfilter: nft_exthdr: add TCP option matching
From: Manuel Messner This patch implements the kernel side of the TCP option patch. Signed-off-by: Manuel Messner Reviewed-by: Florian Westphal Acked-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 17 - net/netfilter/Kconfig| 4 +- net/netfilter/nft_exthdr.c | 119 +++ 3 files changed, 124 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 3e60ed78c538..207951516ede 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -709,13 +709,27 @@ enum nft_exthdr_flags { }; /** - * enum nft_exthdr_attributes - nf_tables IPv6 extension header expression netlink attributes + * enum nft_exthdr_op - nf_tables match options + * + * @NFT_EXTHDR_OP_IPV6: match against ipv6 extension headers + * @NFT_EXTHDR_OP_TCP: match against tcp options + */ +enum nft_exthdr_op { + NFT_EXTHDR_OP_IPV6, + NFT_EXTHDR_OP_TCPOPT, + __NFT_EXTHDR_OP_MAX +}; +#define NFT_EXTHDR_OP_MAX (__NFT_EXTHDR_OP_MAX - 1) + +/** + * enum nft_exthdr_attributes - nf_tables extension header expression netlink attributes * * @NFTA_EXTHDR_DREG: destination register (NLA_U32: nft_registers) * @NFTA_EXTHDR_TYPE: extension header type (NLA_U8) * @NFTA_EXTHDR_OFFSET: extension header offset (NLA_U32) * @NFTA_EXTHDR_LEN: extension header length (NLA_U32) * @NFTA_EXTHDR_FLAGS: extension header flags (NLA_U32) + * @NFTA_EXTHDR_OP: option match type (NLA_U8) */ enum nft_exthdr_attributes { NFTA_EXTHDR_UNSPEC, @@ -724,6 +738,7 @@ enum nft_exthdr_attributes { NFTA_EXTHDR_OFFSET, NFTA_EXTHDR_LEN, NFTA_EXTHDR_FLAGS, + NFTA_EXTHDR_OP, __NFTA_EXTHDR_MAX }; #define NFTA_EXTHDR_MAX(__NFTA_EXTHDR_MAX - 1) diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index ea479ed43373..9b28864cc36a 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -467,10 +467,10 @@ config NF_TABLES_NETDEV This option enables support for the "netdev" table. config NFT_EXTHDR - tristate "Netfilter nf_tables IPv6 exthdr module" + tristate "Netfilter nf_tables exthdr module" help This option adds the "exthdr" expression that you can use to match - IPv6 extension headers. + IPv6 extension headers and tcp options. config NFT_META tristate "Netfilter nf_tables meta module" diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index a89e5ab150db..c308920b194c 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -15,20 +15,29 @@ #include #include #include -// FIXME: -#include +#include struct nft_exthdr { u8 type; u8 offset; u8 len; + u8 op; enum nft_registers dreg:8; u8 flags; }; -static void nft_exthdr_eval(const struct nft_expr *expr, - struct nft_regs *regs, - const struct nft_pktinfo *pkt) +static unsigned int optlen(const u8 *opt, unsigned int offset) +{ + /* Beware zero-length options: make finite progress */ + if (opt[offset] <= TCPOPT_NOP || opt[offset + 1] == 0) + return 1; + else + return opt[offset + 1]; +} + +static void nft_exthdr_ipv6_eval(const struct nft_expr *expr, +struct nft_regs *regs, +const struct nft_pktinfo *pkt) { struct nft_exthdr *priv = nft_expr_priv(expr); u32 *dest = ®s->data[priv->dreg]; @@ -52,6 +61,53 @@ static void nft_exthdr_eval(const struct nft_expr *expr, regs->verdict.code = NFT_BREAK; } +static void nft_exthdr_tcp_eval(const struct nft_expr *expr, + struct nft_regs *regs, + const struct nft_pktinfo *pkt) +{ + u8 buff[sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE]; + struct nft_exthdr *priv = nft_expr_priv(expr); + unsigned int i, optl, tcphdr_len, offset; + u32 *dest = ®s->data[priv->dreg]; + struct tcphdr *tcph; + u8 *opt; + + if (!pkt->tprot_set || pkt->tprot != IPPROTO_TCP) + goto err; + + tcph = skb_header_pointer(pkt->skb, pkt->xt.thoff, sizeof(*tcph), buff); + if (!tcph) + goto err; + + tcphdr_len = __tcp_hdrlen(tcph); + if (tcphdr_len < sizeof(*tcph)) + goto err; + + tcph = skb_header_pointer(pkt->skb, pkt->xt.thoff, tcphdr_len, buff); + if (!tcph) + goto err; + + opt = (u8 *)tcph; + for (i = sizeof(*tcph); i < tcphdr_len - 1; i += optl) { + optl = optlen(opt, i); + + if (priv->type != opt[i])
[PATCH 19/21] netfilter: nf_tables: add NFTA_RULE_ID attribute
This new attribute allows us to uniquely identify a rule in transaction. Robots may trigger an insertion followed by deletion in a batch, in that scenario we still don't have a public rule handle that we can use to delete the rule. This is similar to the NFTA_SET_ID attribute that allows us to refer to an anonymous set from a batch. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h| 3 +++ include/uapi/linux/netfilter/nf_tables.h | 2 ++ net/netfilter/nf_tables_api.c| 26 ++ 3 files changed, 31 insertions(+) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index 21ce50e6d0c5..ac84686aaafb 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -1202,10 +1202,13 @@ struct nft_trans { struct nft_trans_rule { struct nft_rule *rule; + u32 rule_id; }; #define nft_trans_rule(trans) \ (((struct nft_trans_rule *)trans->data)->rule) +#define nft_trans_rule_id(trans) \ + (((struct nft_trans_rule *)trans->data)->rule_id) struct nft_trans_set { struct nft_set *set; diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 207951516ede..05215d30fe5c 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -207,6 +207,7 @@ enum nft_chain_attributes { * @NFTA_RULE_COMPAT: compatibility specifications of the rule (NLA_NESTED: nft_rule_compat_attributes) * @NFTA_RULE_POSITION: numeric handle of the previous rule (NLA_U64) * @NFTA_RULE_USERDATA: user data (NLA_BINARY, NFT_USERDATA_MAXLEN) + * @NFTA_RULE_ID: uniquely identifies a rule in a transaction (NLA_U32) */ enum nft_rule_attributes { NFTA_RULE_UNSPEC, @@ -218,6 +219,7 @@ enum nft_rule_attributes { NFTA_RULE_POSITION, NFTA_RULE_USERDATA, NFTA_RULE_PAD, + NFTA_RULE_ID, __NFTA_RULE_MAX }; #define NFTA_RULE_MAX (__NFTA_RULE_MAX - 1) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 71c60a04b66b..6c782532615f 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -240,6 +240,10 @@ static struct nft_trans *nft_trans_rule_add(struct nft_ctx *ctx, int msg_type, if (trans == NULL) return NULL; + if (msg_type == NFT_MSG_NEWRULE && ctx->nla[NFTA_RULE_ID] != NULL) { + nft_trans_rule_id(trans) = + ntohl(nla_get_be32(ctx->nla[NFTA_RULE_ID])); + } nft_trans_rule(trans) = rule; list_add_tail(&trans->list, &ctx->net->nft.commit_list); @@ -2293,6 +2297,22 @@ static int nf_tables_newrule(struct net *net, struct sock *nlsk, return err; } +static struct nft_rule *nft_rule_lookup_byid(const struct net *net, +const struct nlattr *nla) +{ + u32 id = ntohl(nla_get_be32(nla)); + struct nft_trans *trans; + + list_for_each_entry(trans, &net->nft.commit_list, list) { + struct nft_rule *rule = nft_trans_rule(trans); + + if (trans->msg_type == NFT_MSG_NEWRULE && + id == nft_trans_rule_id(trans)) + return rule; + } + return ERR_PTR(-ENOENT); +} + static int nf_tables_delrule(struct net *net, struct sock *nlsk, struct sk_buff *skb, const struct nlmsghdr *nlh, const struct nlattr * const nla[]) @@ -2331,6 +2351,12 @@ static int nf_tables_delrule(struct net *net, struct sock *nlsk, return PTR_ERR(rule); err = nft_delrule(&ctx, rule); + } else if (nla[NFTA_RULE_ID]) { + rule = nft_rule_lookup_byid(net, nla[NFTA_RULE_ID]); + if (IS_ERR(rule)) + return PTR_ERR(rule); + + err = nft_delrule(&ctx, rule); } else { err = nft_delrule_by_chain(&ctx); } -- 2.1.4
[PATCH 20/21] netfilter: update MAINTAINERS
It's been a while since Patrick has been suspended as coreteam member [1]. Update this file to remove him. While at this, remove references to all foo-tables variants, given the project hosts more than just that, eg. ipset, conntrack, ... [1] https://marc.info/?l=netfilter-devel&m=146887464512702 Signed-off-by: Pablo Neira Ayuso --- MAINTAINERS | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index a9368bba9b37..5864bbd99f8f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8579,9 +8579,8 @@ F:Documentation/networking/s2io.txt F: Documentation/networking/vxge.txt F: drivers/net/ethernet/neterion/ -NETFILTER ({IP,IP6,ARP,EB,NF}TABLES) +NETFILTER M: Pablo Neira Ayuso -M: Patrick McHardy M: Jozsef Kadlecsik L: netfilter-de...@vger.kernel.org L: coret...@netfilter.org -- 2.1.4
[PATCH 16/21] netfilter: nfnetlink: add nfnetlink_rcv_skb_batch()
Add new nfnetlink_rcv_skb_batch() to wrap initial nfnetlink batch handling. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink.c | 51 ++- 1 file changed, 28 insertions(+), 23 deletions(-) diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 586212ebba9e..ca645a3b1375 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -436,12 +436,35 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, kfree_skb(skb); } -static void nfnetlink_rcv(struct sk_buff *skb) +static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh) { - struct nlmsghdr *nlh = nlmsg_hdr(skb); + struct nfgenmsg *nfgenmsg; u16 res_id; int msglen; + msglen = NLMSG_ALIGN(nlh->nlmsg_len); + if (msglen > skb->len) + msglen = skb->len; + + if (nlh->nlmsg_len < NLMSG_HDRLEN || + skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg)) + return; + + nfgenmsg = nlmsg_data(nlh); + skb_pull(skb, msglen); + /* Work around old nft using host byte order */ + if (nfgenmsg->res_id == NFNL_SUBSYS_NFTABLES) + res_id = NFNL_SUBSYS_NFTABLES; + else + res_id = ntohs(nfgenmsg->res_id); + + nfnetlink_rcv_batch(skb, nlh, res_id); +} + +static void nfnetlink_rcv(struct sk_buff *skb) +{ + struct nlmsghdr *nlh = nlmsg_hdr(skb); + if (nlh->nlmsg_len < NLMSG_HDRLEN || skb->len < nlh->nlmsg_len) return; @@ -451,28 +474,10 @@ static void nfnetlink_rcv(struct sk_buff *skb) return; } - if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN) { - struct nfgenmsg *nfgenmsg; - - msglen = NLMSG_ALIGN(nlh->nlmsg_len); - if (msglen > skb->len) - msglen = skb->len; - - if (nlh->nlmsg_len < NLMSG_HDRLEN || - skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg)) - return; - - nfgenmsg = nlmsg_data(nlh); - skb_pull(skb, msglen); - /* Work around old nft using host byte order */ - if (nfgenmsg->res_id == NFNL_SUBSYS_NFTABLES) - res_id = NFNL_SUBSYS_NFTABLES; - else - res_id = ntohs(nfgenmsg->res_id); - nfnetlink_rcv_batch(skb, nlh, res_id); - } else { + if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN) + nfnetlink_rcv_skb_batch(skb, nlh); + else netlink_rcv_skb(skb, &nfnetlink_rcv_msg); - } } #ifdef CONFIG_MODULES -- 2.1.4
[PATCH 11/21] netfilter: nft_ct: add zone id set support
From: Florian Westphal zones allow tracking multiple connections sharing identical tuples, this is needed e.g. when tracking distinct vlans with overlapping ip addresses (conntrack is l2 agnostic). Thus the zone has to be set before the packet is picked up by the connection tracker. This is done by means of 'conntrack templates' which are conntrack structures used solely to pass this info from one netfilter hook to the next. The iptables CT target instantiates these connection tracking templates once per rule, i.e. the template is fixed/tied to particular zone, can be read-only and therefore be re-used by as many skbs simultaneously as needed. We can't follow this model because we want to take the zone id from an sreg at rule eval time so we could e.g. fill in the zone id from the packets vlan id or a e.g. nftables key : value maps. To avoid cost of per packet alloc/free of the template, use a percpu template 'scratch' object and use the refcount to detect the (unlikely) case where the template is still attached to another skb (i.e., previous skb was nfqueued ...). Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_ct.c | 144 - 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index 2d82df2737da..c6b8022c0e47 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -32,6 +32,11 @@ struct nft_ct { }; }; +#ifdef CONFIG_NF_CONNTRACK_ZONES +static DEFINE_PER_CPU(struct nf_conn *, nft_ct_pcpu_template); +static unsigned int nft_ct_pcpu_template_refcnt __read_mostly; +#endif + static u64 nft_ct_get_eval_counter(const struct nf_conn_counter *c, enum nft_ct_keys k, enum ip_conntrack_dir d) @@ -191,6 +196,53 @@ static void nft_ct_get_eval(const struct nft_expr *expr, regs->verdict.code = NFT_BREAK; } +#ifdef CONFIG_NF_CONNTRACK_ZONES +static void nft_ct_set_zone_eval(const struct nft_expr *expr, +struct nft_regs *regs, +const struct nft_pktinfo *pkt) +{ + struct nf_conntrack_zone zone = { .dir = NF_CT_DEFAULT_ZONE_DIR }; + const struct nft_ct *priv = nft_expr_priv(expr); + struct sk_buff *skb = pkt->skb; + enum ip_conntrack_info ctinfo; + u16 value = regs->data[priv->sreg]; + struct nf_conn *ct; + + ct = nf_ct_get(skb, &ctinfo); + if (ct) /* already tracked */ + return; + + zone.id = value; + + switch (priv->dir) { + case IP_CT_DIR_ORIGINAL: + zone.dir = NF_CT_ZONE_DIR_ORIG; + break; + case IP_CT_DIR_REPLY: + zone.dir = NF_CT_ZONE_DIR_REPL; + break; + default: + break; + } + + ct = this_cpu_read(nft_ct_pcpu_template); + + if (likely(atomic_read(&ct->ct_general.use) == 1)) { + nf_ct_zone_add(ct, &zone); + } else { + /* previous skb got queued to userspace */ + ct = nf_ct_tmpl_alloc(nft_net(pkt), &zone, GFP_ATOMIC); + if (!ct) { + regs->verdict.code = NF_DROP; + return; + } + } + + atomic_inc(&ct->ct_general.use); + nf_ct_set(skb, ct, IP_CT_NEW); +} +#endif + static void nft_ct_set_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) @@ -269,6 +321,45 @@ static void nft_ct_netns_put(struct net *net, uint8_t family) nf_ct_netns_put(net, family); } +#ifdef CONFIG_NF_CONNTRACK_ZONES +static void nft_ct_tmpl_put_pcpu(void) +{ + struct nf_conn *ct; + int cpu; + + for_each_possible_cpu(cpu) { + ct = per_cpu(nft_ct_pcpu_template, cpu); + if (!ct) + break; + nf_ct_put(ct); + per_cpu(nft_ct_pcpu_template, cpu) = NULL; + } +} + +static bool nft_ct_tmpl_alloc_pcpu(void) +{ + struct nf_conntrack_zone zone = { .id = 0 }; + struct nf_conn *tmp; + int cpu; + + if (nft_ct_pcpu_template_refcnt) + return true; + + for_each_possible_cpu(cpu) { + tmp = nf_ct_tmpl_alloc(&init_net, &zone, GFP_KERNEL); + if (!tmp) { + nft_ct_tmpl_put_pcpu(); + return false; + } + + atomic_set(&tmp->ct_general.use, 1); + per_cpu(nft_ct_pcpu_template, cpu) = tmp; + } + + return true; +} +#endif + static int nft_ct_get_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) @@ -394,6 +485,11 @@ static void __nft_ct_set_destroy(const struct nft_ctx *ctx, struc
[PATCH 18/21] netfilter: nf_tables: add check_genid to the nfnetlink subsystem
This patch implements the check generation id as provided by nfnetlink. This allows us to reject ruleset updates against stale baseline, so userspace can retry update with a fresh ruleset cache. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index cb6ae46f6c48..71c60a04b66b 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -4972,6 +4972,11 @@ static int nf_tables_abort(struct net *net, struct sk_buff *skb) return 0; } +static bool nf_tables_valid_genid(struct net *net, u32 genid) +{ + return net->nft.base_seq == genid; +} + static const struct nfnetlink_subsystem nf_tables_subsys = { .name = "nf_tables", .subsys_id = NFNL_SUBSYS_NFTABLES, @@ -4979,6 +4984,7 @@ static const struct nfnetlink_subsystem nf_tables_subsys = { .cb = nf_tables_cb, .commit = nf_tables_commit, .abort = nf_tables_abort, + .valid_genid= nf_tables_valid_genid, }; int nft_chain_validate_dependency(const struct nft_chain *chain, -- 2.1.4
[PATCH 21/21] netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection
Check for NFT_SET_OBJECT feature flag, otherwise we may end up selecting the wrong set backend. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 3 ++- net/netfilter/nft_set_hash.c | 2 +- net/netfilter/nft_set_rbtree.c | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 6c782532615f..ff7304ae58ac 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2424,7 +2424,8 @@ nft_select_set_ops(const struct nlattr * const nla[], features = 0; if (nla[NFTA_SET_FLAGS] != NULL) { features = ntohl(nla_get_be32(nla[NFTA_SET_FLAGS])); - features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT; + features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT | + NFT_SET_OBJECT; } bops= NULL; diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 6938bc890f31..5f652720fc78 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -404,7 +404,7 @@ static struct nft_set_ops nft_hash_ops __read_mostly = { .lookup = nft_hash_lookup, .update = nft_hash_update, .walk = nft_hash_walk, - .features = NFT_SET_MAP | NFT_SET_TIMEOUT, + .features = NFT_SET_MAP | NFT_SET_OBJECT | NFT_SET_TIMEOUT, .owner = THIS_MODULE, }; diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 3387ed7dd231..71e8fb886a73 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -310,7 +310,7 @@ static struct nft_set_ops nft_rbtree_ops __read_mostly = { .activate = nft_rbtree_activate, .lookup = nft_rbtree_lookup, .walk = nft_rbtree_walk, - .features = NFT_SET_INTERVAL | NFT_SET_MAP, + .features = NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT, .owner = THIS_MODULE, }; -- 2.1.4
[PATCH 17/21] netfilter: nfnetlink: allow to check for generation ID
This patch allows userspace to specify the generation ID that has been used to build an incremental batch update. If userspace specifies the generation ID in the batch message as attribute, then nfnetlink compares it to the current generation ID so you make sure that you work against the right baseline. Otherwise, bail out with ERESTART so userspace knows that its changeset is stale and needs to respin. Userspace can do this transparently at the cost of taking slightly more time to refresh caches and rework the changeset. This check is optional, if there is no NFNL_BATCH_GENID attribute in the batch begin message, then no check is performed. Signed-off-by: Pablo Neira Ayuso --- include/linux/netfilter/nfnetlink.h | 1 + include/uapi/linux/netfilter/nfnetlink.h | 12 net/netfilter/nfnetlink.c| 31 +++ 3 files changed, 40 insertions(+), 4 deletions(-) diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h index 1d82dd5e9a08..1b49209dd5c7 100644 --- a/include/linux/netfilter/nfnetlink.h +++ b/include/linux/netfilter/nfnetlink.h @@ -28,6 +28,7 @@ struct nfnetlink_subsystem { const struct nfnl_callback *cb; /* callback for individual types */ int (*commit)(struct net *net, struct sk_buff *skb); int (*abort)(struct net *net, struct sk_buff *skb); + bool (*valid_genid)(struct net *net, u32 genid); }; int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n); diff --git a/include/uapi/linux/netfilter/nfnetlink.h b/include/uapi/linux/netfilter/nfnetlink.h index 4bb8cb7730e7..a09906a30d77 100644 --- a/include/uapi/linux/netfilter/nfnetlink.h +++ b/include/uapi/linux/netfilter/nfnetlink.h @@ -65,4 +65,16 @@ struct nfgenmsg { #define NFNL_MSG_BATCH_BEGIN NLMSG_MIN_TYPE #define NFNL_MSG_BATCH_END NLMSG_MIN_TYPE+1 +/** + * enum nfnl_batch_attributes - nfnetlink batch netlink attributes + * + * @NFNL_BATCH_GENID: generation ID for this changeset (NLA_U32) + */ +enum nfnl_batch_attributes { +NFNL_BATCH_UNSPEC, +NFNL_BATCH_GENID, +__NFNL_BATCH_MAX +}; +#define NFNL_BATCH_MAX (__NFNL_BATCH_MAX - 1) + #endif /* _UAPI_NFNETLINK_H */ diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index ca645a3b1375..a2148d0bc50e 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -3,7 +3,7 @@ * * (C) 2001 by Jay Schulist , * (C) 2002-2005 by Harald Welte - * (C) 2005,2007 by Pablo Neira Ayuso + * (C) 2005-2017 by Pablo Neira Ayuso * * Initial netfilter messages via netlink development funded and * generally made possible by Network Robots, Inc. (www.networkrobots.com) @@ -273,7 +273,7 @@ enum { }; static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, - u16 subsys_id) + u16 subsys_id, u32 genid) { struct sk_buff *oskb = skb; struct net *net = sock_net(skb->sk); @@ -315,6 +315,12 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, return kfree_skb(skb); } + if (genid && ss->valid_genid && !ss->valid_genid(net, genid)) { + nfnl_unlock(subsys_id); + netlink_ack(oskb, nlh, -ERESTART); + return kfree_skb(skb); + } + while (skb->len >= nlmsg_total_size(0)) { int msglen, type; @@ -436,11 +442,20 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, kfree_skb(skb); } +static const struct nla_policy nfnl_batch_policy[NFNL_BATCH_MAX + 1] = { + [NFNL_BATCH_GENID] = { .type = NLA_U32 }, +}; + static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh) { + int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); + struct nlattr *attr = (void *)nlh + min_len; + struct nlattr *cda[NFNL_BATCH_MAX + 1]; + int attrlen = nlh->nlmsg_len - min_len; struct nfgenmsg *nfgenmsg; + int msglen, err; + u32 gen_id = 0; u16 res_id; - int msglen; msglen = NLMSG_ALIGN(nlh->nlmsg_len); if (msglen > skb->len) @@ -450,6 +465,14 @@ static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh) skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg)) return; + err = nla_parse(cda, NFNL_BATCH_MAX, attr, attrlen, nfnl_batch_policy); + if (err < 0) { + netlink_ack(skb, nlh, err); + return; + } + if (cda[NFNL_BATCH_GENID]) + gen_id = ntohl(nla_get_be32(cda[NFNL_BATCH_GENID])); + nfgenmsg = nlmsg_data(nlh); skb_pull(skb, msglen); /* Work around old nft using host byte order */ @@ -458,7 +481,7 @@ static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh) else res_id = ntohs(nf
Re: Fw: [Bug 193911] New: net_prio.ifpriomap is not aware of the network namespace, and discloses all network interface
Tejun Heo writes: > Hello, > > On Sun, Feb 05, 2017 at 11:05:36PM -0800, Cong Wang wrote: >> > To be more specific, the read operation of net_prio.ifpriomap is handled >> > by the >> > function read_priomap. Tracing from this function, we can find it invokes >> > for_each_netdev_rcu and set the first parameter as the address of >> > init_net. It >> > iterates all network devices of the host regardless of the network >> > namespace. >> > Thus, from the view of a container, it can read the names of all network >> > devices of the host. >> >> I think that is probably because cgroup files don't provide a net pointer >> for the context, if so we probably need some API similar to >> class_create_file_ns(). > > Yeah, the whole thing never considered netns or delegation. Maybe the > read function itself should probably filter on the namespace of the > reader? I'm not completely sure whether trying to fix it won't cause > some of existing use cases to break. Eric, what do you think? Apologies for the delay I just made it back from vacation. There are cases where we do look at the reader/opener of the file, and it is a pain, almost always the best policy is to have the context fixed at mount time. I don't see an obvious answer of what better semantics for this file should be. Perhaps Docker can mount over this file on older kernels? The namespace primitives that people build containers out of were never guaranteed not to leak the fact that you are in a container. So a small essentially harmless information leak is not something I panic about. It is the setting up of the container itself that must know what the primitives do to ensure that leaks don't happen, if you want to avoid leaks. That said if this controller/file does not consider netns and delegation I suspect the right thing to do is put it under CONFIG_BROKEN or possibly CONFIG_I_REALLY_NEED_THIS_SILLY_CODE_FOR_BACKWARDS_COMPATIBILITY aka CONFIG_STAGING and let the code age out of the kernel there. If someone actually cares about this code and wants to fix it to do the something reasonable and is willing to dig through all of the subtleties I can help with that. I may be wrong but the code feels like something that just isn't interesting enough to make it worth fixing. Eric
Re: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.
On Sun, Feb 12, 2017 at 11:52:23AM +, Mintz, Yuval wrote: > Just to clarify [since it's bit a meaningless otherwise] - > this +8 is a HW-bug workaround. Can you please explain exactly what the problem is? Your code does period1 = div_s64(val * 10, ppb); period1 -= 8; period1 >>= 4; But correct rounding would be period1 = div_s64(val * 10, ppb); period1 += 8; period1 >>= 4; Thanks, Richard
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On 12/02/2017 5:32 PM, Eric Dumazet wrote: On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan wrote: We consistently see this behavior: the higher the BW, the sharper the degradation. This is because the page-cache is of a fixed-size. Any fixed-size page-cache will always meet one of the following: 1) Too small to keep the pace when load is high. 2) Too big (in terms of memory footprint) when load is low. So, we had the order-0 allocations for years at Google, then made the horrible mistake to rebase mlx4 driver from the upstream one, and we had all these issues under load. I decided to redo the work I did years ago and upstream it. Thanks for that. I really appreciate and like your re-factorization. I have warned Mellanox in the past (for cx-5 driver) that _any_ high order allocation strategy was nice in benchmarks, but terrible in face of real server workloads. ( And I am not even referring to malicious attacks ) In mlx5, we fully completed the transition to order-0 allocations in Striding RQ. Think about what happens on real servers : In the order of 100,000 TCP sockets opened. Then some incast or outcast problem (Mapreduce jobs are fond of this) make thousands of TCP socket accumulate _millions_ of TCP messages in their out of order queue per second. There is no way you can hold millions of pages in mlx4 driver. A "dynamic" page pool is going to fail very badly. I understand your point. Today I am totally aware of the advantages in using order-0 pages, I am just trying to have the bread buttered on both sides, by reducing the allocation overhead. Even though the iperf benchmarks are less realistic than the ones you described, I think it is still nice if we could find solutions for the page allocator in order to keep the high rates we had before. As a common bottleneck, we will always gain by improving the page allocator, no matter what is the pages order. Just two points regarding the dynamic page-cache I implemented: 1) We define an upper limit for the size of the dynamic page-cache, so the mata-data do not grow too much. 2) When load is high, our dynamic page-cache _does not exclusively hold too many pages_, it just keeps track of pages that are being anyway processed in stack. In memory footprints accounting, I would not account such page into the "driver's footprint", as it is being used by the stack. Sure, your iperf bench will look great. But who cares ? Doyou really have customers dedicating hosts to run 1 iperf full time ? Make sure you run tests with 100,000 TCP sockets, and add networking small flaps, with 5% packet losses. This is what we really care here. I definitely agree that benchmarks should improve to reflect more realistic use cases. I will send the v3 of the patch series, I really hope that it will go in, because we at Google very much need it ASAP, and I would rather not have to keep it private in our tree. Do not focus on your benchmarks, that is marketing only Focus on ability of the servers to _survive_ and continue their work. You did not answer to my questions by the way. ethtool -g eth0 ethtool -l eth0 Yes, sorry the delayed reply, it was sent separately. Thanks.
Re: net/llc: BUG in llc_sap_state_process/skb_set_owner_r
On Sun, Feb 12, 2017 at 8:44 AM, Andrey Konovalov wrote: > Hi, > > I've got the following error report while fuzzing the kernel with syzkaller. > > On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742. > > A reproducer and .config are attached Thanks for the report. llc sets skb->sk without corresponding skb->destructor. This is considered invalid by our current standards. As I added the sanity check in skb_destructor() back in linux-3.12 (!!!), I will send the corresponding LLC fix. ( commit 376c7311bdb6efea3322310333576a04d73fbe4c )
Re: net/ipv6: use-after-free in sock_wfree
On Mon, Jan 9, 2017 at 6:21 PM, Eric Dumazet wrote: > On Mon, Jan 9, 2017 at 9:11 AM, Andrey Konovalov > wrote: >> On Mon, Jan 9, 2017 at 6:08 PM, Andrey Konovalov >> wrote: >>> Hi! >>> >>> I've got the following error report while running the syzkaller fuzzer. >>> >>> On commit a121103c922847ba5010819a3f250f1f7fc84ab8 (4.10-rc3). >>> >>> A reproducer is attached. >>> >>> == >>> BUG: KASAN: use-after-free in sock_wfree+0x118/0x120 >>> Read of size 8 at addr 880062da0060 by task a.out/4140 >>> >>> page:ea00018b6800 count:1 mapcount:0 mapping: (null) >>> index:0x0 compound_mapcount: 0 >>> flags: 0x1008100(slab|head) >>> raw: 01008100 000180130013 >>> raw: dead0100 dead0200 88006741f140 >>> page dumped because: kasan: bad access detected >>> >>> CPU: 0 PID: 4140 Comm: a.out Not tainted 4.10.0-rc3+ #59 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 >>> Call Trace: >>> __dump_stack lib/dump_stack.c:15 >>> dump_stack+0x292/0x398 lib/dump_stack.c:51 >>> describe_address mm/kasan/report.c:262 >>> kasan_report_error+0x121/0x560 mm/kasan/report.c:370 >>> kasan_report mm/kasan/report.c:392 >>> __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:413 >>> sock_flag ./arch/x86/include/asm/bitops.h:324 >>> sock_wfree+0x118/0x120 net/core/sock.c:1631 >>> skb_release_head_state+0xfc/0x250 net/core/skbuff.c:655 >>> skb_release_all+0x15/0x60 net/core/skbuff.c:668 >>> __kfree_skb+0x15/0x20 net/core/skbuff.c:684 >>> kfree_skb+0x16e/0x4e0 net/core/skbuff.c:705 >>> inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304 >>> inet_frag_put ./include/net/inet_frag.h:133 >>> nf_ct_frag6_gather+0x1125/0x38b0 >>> net/ipv6/netfilter/nf_conntrack_reasm.c:617 >>> ipv6_defrag+0x21b/0x350 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 >>> nf_hook_entry_hookfn ./include/linux/netfilter.h:102 >>> nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310 >>> nf_hook ./include/linux/netfilter.h:212 >>> __ip6_local_out+0x52c/0xaf0 net/ipv6/output_core.c:160 >>> ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170 >>> ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722 >>> ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742 >>> rawv6_push_pending_frames net/ipv6/raw.c:613 >>> rawv6_sendmsg+0x2cff/0x4130 net/ipv6/raw.c:927 >>> inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744 >>> sock_sendmsg_nosec net/socket.c:635 >>> sock_sendmsg+0xca/0x110 net/socket.c:645 >>> sock_write_iter+0x326/0x620 net/socket.c:848 >>> new_sync_write fs/read_write.c:499 >>> __vfs_write+0x483/0x760 fs/read_write.c:512 >>> vfs_write+0x187/0x530 fs/read_write.c:560 >>> SYSC_write fs/read_write.c:607 >>> SyS_write+0xfb/0x230 fs/read_write.c:599 >>> entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203 >>> RIP: 0033:0x7ff26e6f5b79 >>> RSP: 002b:7ff268e0ed98 EFLAGS: 0206 ORIG_RAX: 0001 >>> RAX: ffda RBX: 7ff268e0f9c0 RCX: 7ff26e6f5b79 >>> RDX: 0010 RSI: 20f50fe1 RDI: 0003 >>> RBP: 7ff26ebc1220 R08: R09: >>> R10: R11: 0206 R12: >>> R13: 7ff268e0f9c0 R14: 7ff26efec040 R15: 0003 >>> >>> The buggy address belongs to the object at 880062da >>> which belongs to the cache RAWv6 of size 1504 >>> The buggy address 880062da0060 is located 96 bytes inside >>> of 1504-byte region [880062da, 880062da05e0) >>> >>> Freed by task 4113: >>> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57 >>> save_stack+0x43/0xd0 mm/kasan/kasan.c:502 >>> set_track mm/kasan/kasan.c:514 >>> kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578 >>> slab_free_hook mm/slub.c:1352 >>> slab_free_freelist_hook mm/slub.c:1374 >>> slab_free mm/slub.c:2951 >>> kmem_cache_free+0xb2/0x2c0 mm/slub.c:2973 >>> sk_prot_free net/core/sock.c:1377 >>> __sk_destruct+0x49c/0x6e0 net/core/sock.c:1452 >>> sk_destruct+0x47/0x80 net/core/sock.c:1460 >>> __sk_free+0x57/0x230 net/core/sock.c:1468 >>> sk_free+0x23/0x30 net/core/sock.c:1479 >>> sock_put ./include/net/sock.h:1638 >>> sk_common_release+0x31e/0x4e0 net/core/sock.c:2782 >>> rawv6_close+0x54/0x80 net/ipv6/raw.c:1214 >>> inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425 >>> inet6_release+0x50/0x70 net/ipv6/af_inet6.c:431 >>> sock_release+0x8d/0x1e0 net/socket.c:599 >>> sock_close+0x16/0x20 net/socket.c:1063 >>> __fput+0x332/0x7f0 fs/file_table.c:208 >>> fput+0x15/0x20 fs/file_table.c:244 >>> task_work_run+0x19b/0x270 kernel/task_work.c:116 >>> exit_task_work ./include/linux/task_work.h:21 >>> do_exit+0x186b/0x2800 kernel/exit.c:839 >>> do_group_exit+0x149/0x420 kernel/exit.c:943 >>> SYSC_exit_group kernel/exit.c:954 >>> SyS_exit_group+0x1d/0x20 kernel/exit.c:952 >>> entry_SYSCALL_64_fastpath+0x
Re: [PATCH net-next 0/4] net/sched: Use TC skip flags to reflect HW offload status
From: Or Gerlitz Date: Sun, 12 Feb 2017 11:54:25 +0200 > Re the old kernel argument, these patches are small and pointish, > would it make sense to you to consider them as fixes and push them > back to the relevant stable kernels? Sorry, it doesn't work that way.
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
Please Tariq do not send HTML messages, they are not making to netdev mailing list. On Sun, Feb 12, 2017 at 7:55 AM, Tariq Toukan wrote: > > On 09/02/2017 6:43 PM, Tariq Toukan wrote: > > We need to test this series again in our functional and performance > regression systems. > It will be running during the weekend, so we can analyze the results and > update you on Sunday. > > Both setups running functional regression hanged, on two different issues. > Both repros don't seem to be immediate, they do not simply happen by running > the exact case that caused the hang, but by a series of cases. > I'm analyzing the issue, looking for a minimal repro. > For now, you can find the traces copied below. > > Regards, > Tariq > > > Setup 1: x86 > > [ 8646.869516] [ cut here ] > [ 8646.870970] WARNING: CPU: 4 PID: 0 at net/ipv4/af_inet.c:1498 > inet_gro_complete+0xa6/0xb0 So by the time inet_gro_complete() is called, iph->procotol became mangled. This does not make sense to me, my patch do not change skb->head allocations ... > > > > Setup 2: PowerPC > > [10586.623028] Unable to handle kernel paging request for data at address > 0x80251f9001c > [10586.623072] Faulting instruction address: 0xc0236fa8 > [10586.623081] Oops: Kernel access of bad area, sig: 11 [#1] > [10586.623087] SMP NR_CPUS=2048 > [10586.623087] NUMA > [10586.623093] pSeries > [10586.623103] Modules linked in: rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib > ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en ptp pps_core mlx4_ib > ib_core mlx4_core devlink netconsole 8021q garp mrp stp llc nfsv3 nfs > fscache sg pseries_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables > ext4 mbcache jbd2 sd_mod ibmvscsi ibmveth scsi_transport_srp [last unloaded: > devlink] > [10586.623137] CPU: 8 PID: 30175 Comm: ifconfig Not tainted > 4.10.0-rc6-eric_v2 #1 > [10586.623144] task: cb1e4480 task.stack: ca3cc000 > [10586.623151] NIP: c0236fa8 LR: d4f738c4 CTR: > c0236fa0 > [10586.623156] REGS: ca3cf360 TRAP: 0380 Not tainted > (4.10.0-rc6-eric_v2) > [10586.623162] MSR: 8280b032 > [10586.623167] CR: 28002048 XER: 2000 > [10586.623178] CFAR: d4f87ab0 SOFTE: 1 > [10586.623178] GPR00: d4f739d0 ca3cf5e0 c121da00 > 080251f9 > [10586.623178] GPR04: 0001 0002 > > [10586.623178] GPR08: c11a3218 c0026320 080251f9001c > d4f87a98 > [10586.623178] GPR12: c0236fa0 ce834800 3fffd7c08bcc > > [10586.623178] GPR16: 3fffd7c08bd8 3fffd7c08c18 > 3fffd7c08bd0 > [10586.623178] GPR20: c002b37f1438 c00275b5b400 c002b37f1438 > 0046 > [10586.623178] GPR24: 5deadbeef200 c002b37e0900 > d4fd0020 > [10586.623178] GPR28: c002b37f0900 > d4fd0020 > [10586.623223] NIP [c0236fa8] .__free_pages+0x8/0x50 > [10586.623236] LR [d4f738c4] > .mlx4_en_free_rx_desc.isra.21+0xd4/0x180 [mlx4_en] > [10586.623243] Call Trace: > [10586.623248] [ca3cf5e0] [c002b37ed770] 0xc002b37ed770 > (unreliable) > [10586.623260] [ca3cf690] [d4f739d0] > .mlx4_en_free_rx_buf+0x60/0x130 [mlx4_en] > [10586.623274] [ca3cf720] [d4f74658] > .mlx4_en_deactivate_rx_ring+0x128/0x180 [mlx4_en] > [10586.623286] [ca3cf7c0] [d4f815c4] > .mlx4_en_stop_port+0x614/0x950 [mlx4_en] > [10586.623297] [ca3cf8a0] [d4f81abc] > .mlx4_en_change_mtu+0x1bc/0x210 [mlx4_en] > [10586.623307] [ca3cf940] [c0736f50] > .dev_set_mtu+0x190/0x270 > [10586.623316] [ca3cf9e0] [c07644c8] .dev_ifsioc+0x348/0x3f0 > [10586.623323] [ca3cfa80] [c0764920] .dev_ioctl+0x3b0/0x880 > [10586.623331] [ca3cfb70] [c0712880] > .sock_do_ioctl+0x90/0xb0 > [10586.623337] [ca3cfc00] [c0713380] .sock_ioctl+0x2b0/0x390 > [10586.623345] [ca3cfca0] [c03059b4] > .do_vfs_ioctl+0xc4/0x8b0 > [10586.623352] [ca3cfd90] [c0306264] .SyS_ioctl+0xc4/0xe0 > [10586.623360] [ca3cfe30] [c000b184] system_call+0x38/0xe0 > [10586.623367] Instruction dump: > [10586.623372] fadf0028 7f1cd92a 4bfffe70 7f43d378 7fe4fb78 7fa5eb78 > 38c0 38e5 > [10586.623383] 4bffd689 4bfffe6c 7c0004ac 3943001c <7d005028> 3108 > 7d00512d 40c2fff4 > [10586.623397] ---[ end trace 97ff7bd173bea34a ]--- > [10586.623403] > [10588.623447] Kernel panic - not syncing: Fatal exception Yeah, changing MTU seems to be problematic because of the log_rx_info trick that you already mentioned. Can you tell me what was the old MTU and what is the new one ? Thanks
Re: net/llc: bug in llc_pdu_init_as_xid_cmd/skb_over_panic
On Fri, Feb 10, 2017 at 4:12 PM, Andrey Konovalov wrote: > Hi, > > I've got the following error report while fuzzing the kernel with syzkaller. > > On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742. > > A reproducer and .config are attached > > kernel BUG at net/core/skbuff.c:105! > invalid opcode: [#1] SMP KASAN > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 2 PID: 6558 Comm: syz-executor4 Not tainted 4.10.0-rc7+ #126 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > task: 88003c49c480 task.stack: 88003a5c > RIP: 0010:skb_panic+0x16f/0x200 net/core/skbuff.c:101 > RSP: 0018:88003a5c77d0 EFLAGS: 00010286 > RAX: 0082 RBX: 88006be991c0 RCX: > RDX: 0082 RSI: 814567fc RDI: ed00074b8eec > RBP: 88003a5c7838 R08: 0001 R09: > R10: 0002 R11: 0001 R12: 85231ee0 > R13: 834a6722 R14: 0003 R15: 88006c81c580 > FS: 7f89298c7700() GS:88006de0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 20ee5000 CR3: 58697000 CR4: 06e0 > Call Trace: > skb_over_panic net/core/skbuff.c:110 [inline] > skb_put+0x18d/0x1d0 net/core/skbuff.c:1437 > llc_pdu_init_as_xid_cmd include/net/llc_pdu.h:377 [inline] > llc_sap_action_send_xid_c+0x2a2/0x3b0 net/llc/llc_s_ac.c:82 > llc_exec_sap_trans_actions net/llc/llc_sap.c:152 [inline] > llc_sap_next_state net/llc/llc_sap.c:181 [inline] > llc_sap_state_process+0x26b/0x4e0 net/llc/llc_sap.c:212 > llc_build_and_send_xid_pkt+0x19f/0x200 net/llc/llc_sap.c:276 > llc_ui_sendmsg+0xad9/0x1430 net/llc/af_llc.c:938 > sock_sendmsg_nosec net/socket.c:635 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:645 > ___sys_sendmsg+0x9d2/0xae0 net/socket.c:1985 > __sys_sendmsg+0x138/0x320 net/socket.c:2019 > SYSC_sendmsg net/socket.c:2030 [inline] > SyS_sendmsg+0x2d/0x50 net/socket.c:2026 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > RIP: 0033:0x4458b9 > RSP: 002b:7f89298c6b58 EFLAGS: 0286 ORIG_RAX: 002e > RAX: ffda RBX: 0005 RCX: 004458b9 > RDX: 00040085 RSI: 20001fc8 RDI: 0005 > RBP: 006e1ae0 R08: R09: > R10: R11: 0286 R12: 00708000 > R13: R14: c0206434 R15: 201fcfe0 > Code: 00 00 00 48 89 54 24 10 48 c7 c7 60 19 23 85 48 89 74 24 08 4c > 89 04 24 4c 89 ea 4c 89 7c 24 18 45 89 f0 4c 89 e6 e8 1e c0 38 fe <0f> > 0b 4c 89 4d b8 4c 89 45 c0 48 89 75 c8 48 89 55 d0 e8 6a 5e > RIP: skb_panic+0x16f/0x200 net/core/skbuff.c:101 RSP: 88003a5c77d0 > ---[ end trace 89f0ca2ea5bc3ead ]--- > Kernel panic - not syncing: Fatal exception > Dumping ftrace buffer: >(ftrace buffer empty) > Kernel Offset: disabled > Rebooting in 86400 seconds.. +a...@redhat.com
net/llc: BUG in llc_sap_state_process/skb_set_owner_r
Hi, I've got the following error report while fuzzing the kernel with syzkaller. On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742. A reproducer and .config are attached kernel BUG at ./include/linux/skbuff.h:2389! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 9315 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #126 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: 88006861c480 task.stack: 88006a988000 RIP: 0010:skb_set_owner_r include/linux/skbuff.h:2389 [inline] RIP: 0010:__sock_queue_rcv_skb+0x8c0/0xda0 net/core/sock.c:425 RSP: 0018:88003ec06b58 EFLAGS: 00010206 RAX: 88006861c480 RBX: 8800371c2568 RCX: RDX: 0100 RSI: 110006ba08ab RDI: 880035d04560 RBP: 88003ec06dc0 R08: 0002 R09: 0001 R10: R11: dc00 R12: 880035d04540 R13: 88003ec06d98 R14: 8800371c2590 R15: 880035d045a0 FS: 7fa8005ac700() GS:88003ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 004a6f68 CR3: 38e25000 CR4: 06f0 Call Trace: sock_queue_rcv_skb+0x3a/0x50 net/core/sock.c:451 llc_sap_state_process+0x3e3/0x4e0 net/llc/llc_sap.c:220 llc_sap_rcv net/llc/llc_sap.c:294 [inline] llc_sap_handler+0x695/0x1320 net/llc/llc_sap.c:434 llc_rcv+0x6da/0xed0 net/llc/llc_input.c:208 __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228 process_backlog+0xe5/0x6c0 net/core/dev.c:4839 napi_poll net/core/dev.c:5202 [inline] net_rx_action+0xe70/0x1900 net/core/dev.c:5267 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902 do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328 do_softirq kernel/softirq.c:176 [inline] __local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181 local_bh_enable include/linux/bottom_half.h:31 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline] __dev_queue_xmit+0xd87/0x2860 net/core/dev.c:3399 dev_queue_xmit+0x17/0x20 net/core/dev.c:3405 llc_build_and_send_ui_pkt+0x240/0x330 net/llc/llc_output.c:74 llc_ui_sendmsg+0x98d/0x1430 net/llc/af_llc.c:928 sock_sendmsg_nosec net/socket.c:635 [inline] sock_sendmsg+0xca/0x110 net/socket.c:645 ___sys_sendmsg+0x9d2/0xae0 net/socket.c:1985 __sys_sendmsg+0x138/0x320 net/socket.c:2019 SYSC_sendmsg net/socket.c:2030 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2026 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x4458b9 RSP: 002b:7fa8005abb58 EFLAGS: 0286 ORIG_RAX: 002e RAX: ffda RBX: 0006 RCX: 004458b9 RDX: 00040880 RSI: 20003000 RDI: 0006 RBP: 006e1b00 R08: R09: R10: R11: 0286 R12: 00708000 R13: 0082 R14: 2000 R15: Code: 4b 50 fe e9 b1 f8 ff ff e8 3e 4a 50 fe e9 78 f8 ff ff e8 34 4a 50 fe e9 6d f9 ff ff e8 2a 4a 50 fe e9 93 f9 ff ff e8 20 0a 26 fe <0f> 0b e8 19 0a 26 fe be 3c 01 00 00 48 c7 c7 e0 e9 22 85 e8 b8 RIP: skb_set_owner_r include/linux/skbuff.h:2389 [inline] RSP: 88003ec06b58 RIP: __sock_queue_rcv_skb+0x8c0/0xda0 net/core/sock.c:425 RSP: 88003ec06b58 ---[ end trace 58af2d02ad7f84f0 ]--- Kernel panic - not syncing: Fatal exception in interrupt Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. // autogenerated by syzkaller (http://github.com/google/syzkaller) #ifndef __NR_sendmsg #define __NR_sendmsg 46 #endif #ifndef __NR_mmap #define __NR_mmap 9 #endif #ifndef __NR_socket #define __NR_socket 41 #endif #ifndef __NR_connect #define __NR_connect 42 #endif #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include const int kFailStatus = 67; const int kErrorStatus = 68; const int kRetryStatus = 69; __attribute__((noreturn)) void doexit(int status) { volatile unsigned i; syscall(__NR_exit_group, status); for (i = 0;; i++) { } } __attribute__((noreturn)) void fail(const char* msg, ...) { int e = errno; fflush(stdout); va_list args; va_start(args, msg); vfprintf(stderr, msg, args); va_end(args); fprintf(stderr, " (errno %d)\n", e); doexit(e == ENOMEM ? kRetryStatus : kFailStatus); } __attribute__((noreturn)) void exitf(const char* msg, ...) { int e = errno; fflush(stdout); va_list args; va_start(args, msg); vfprintf(stderr, msg, args); va_end(args); fprintf(stderr, " (errno %d)\n", e); doexit(kRetryStatus); } static int flag_debug; void debug(const char*
[PATCH] net: neterion: vxge: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes --- drivers/net/ethernet/neterion/vxge/vxge-ethtool.c | 47 - 1 files changed, 27 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c b/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c index 9a29670..db55e6d 100644 --- a/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c +++ b/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c @@ -38,9 +38,9 @@ }; /** - * vxge_ethtool_sset - Sets different link parameters. + * vxge_ethtool_set_link_ksettings - Sets different link parameters. * @dev: device pointer. - * @info: pointer to the structure with parameters given by ethtool to set + * @cmd: pointer to the structure with parameters given by ethtool to set * link information. * * The function sets different link parameters provided by the user onto @@ -48,44 +48,51 @@ * Return value: * 0 on success. */ -static int vxge_ethtool_sset(struct net_device *dev, struct ethtool_cmd *info) +static int +vxge_ethtool_set_link_ksettings(struct net_device *dev, + const struct ethtool_link_ksettings *cmd) { /* We currently only support 10Gb/FULL */ - if ((info->autoneg == AUTONEG_ENABLE) || - (ethtool_cmd_speed(info) != SPEED_1) || - (info->duplex != DUPLEX_FULL)) + if ((cmd->base.autoneg == AUTONEG_ENABLE) || + (cmd->base.speed != SPEED_1) || + (cmd->base.duplex != DUPLEX_FULL)) return -EINVAL; return 0; } /** - * vxge_ethtool_gset - Return link specific information. + * vxge_ethtool_get_link_ksettings - Return link specific information. * @dev: device pointer. - * @info: pointer to the structure with parameters given by ethtool + * @cmd: pointer to the structure with parameters given by ethtool * to return link information. * * Returns link specific information like speed, duplex etc.. to ethtool. * Return value : * return 0 on success. */ -static int vxge_ethtool_gset(struct net_device *dev, struct ethtool_cmd *info) +static int vxge_ethtool_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) { - info->supported = (SUPPORTED_1baseT_Full | SUPPORTED_FIBRE); - info->advertising = (ADVERTISED_1baseT_Full | ADVERTISED_FIBRE); - info->port = PORT_FIBRE; + ethtool_link_ksettings_zero_link_mode(cmd, supported); + ethtool_link_ksettings_add_link_mode(cmd, supported, 1baseT_Full); + ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE); - info->transceiver = XCVR_EXTERNAL; + ethtool_link_ksettings_zero_link_mode(cmd, advertising); + ethtool_link_ksettings_add_link_mode(cmd, advertising, 1baseT_Full); + ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE); + + cmd->base.port = PORT_FIBRE; if (netif_carrier_ok(dev)) { - ethtool_cmd_speed_set(info, SPEED_1); - info->duplex = DUPLEX_FULL; + cmd->base.speed = SPEED_1; + cmd->base.duplex = DUPLEX_FULL; } else { - ethtool_cmd_speed_set(info, SPEED_UNKNOWN); - info->duplex = DUPLEX_UNKNOWN; + cmd->base.speed = SPEED_UNKNOWN; + cmd->base.duplex = DUPLEX_UNKNOWN; } - info->autoneg = AUTONEG_DISABLE; + cmd->base.autoneg = AUTONEG_DISABLE; return 0; } @@ -1126,8 +1133,6 @@ static int vxge_fw_flash(struct net_device *dev, struct ethtool_flash *parms) } static const struct ethtool_ops vxge_ethtool_ops = { - .get_settings = vxge_ethtool_gset, - .set_settings = vxge_ethtool_sset, .get_drvinfo= vxge_ethtool_gdrvinfo, .get_regs_len = vxge_ethtool_get_regs_len, .get_regs = vxge_ethtool_gregs, @@ -1139,6 +1144,8 @@ static int vxge_fw_flash(struct net_device *dev, struct ethtool_flash *parms) .get_sset_count = vxge_ethtool_get_sset_count, .get_ethtool_stats = vxge_get_ethtool_stats, .flash_device = vxge_fw_flash, + .get_link_ksettings = vxge_ethtool_get_link_ksettings, + .set_link_ksettings = vxge_ethtool_set_link_ksettings, }; void vxge_initialize_ethtool_ops(struct net_device *ndev) -- 1.7.4.4
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On 09/02/2017 6:56 PM, Eric Dumazet wrote: Default, out of box. Well. Please report : ethtool -l eth0 ethtool -g eth0 $ ethtool -g p1p1 Ring parameters for p1p1: Pre-set maximums: RX: 8192 RX Mini:0 RX Jumbo: 0 TX: 8192 Current hardware settings: RX: 1024 RX Mini:0 RX Jumbo: 0 TX: 512 $ ethtool -l p1p1 Channel parameters for p1p1: Pre-set maximums: RX: 128 TX: 32 Other: 0 Combined: 0 Current hardware settings: RX: 8 TX: 32 Other: 0 Combined: 0
RE: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.
> The original would return val == 1, period == 6249; While this does have > some error [val / (period * 16 + 8) is slightly bigger than 1 / 10^9, error at > 18[?] digit after dot], it's the best we can configure for the HW. Correction. That's actually not *the best* we could configure - due to stopping at the first value between equal differences. [you've already commented on that in the past, mentioning that we should use >= and not >]. But that doesn't change the fact that your approximation can choose numbers which can't be configured to the HW, and as a result incorrectly pick some that will not minimize the approximation error. > One simple adjustment we could do is simply break from the loop If 'diff == > 0'. At least for small PPB value this would be hit relatively quickly. Given the previous correction, the suggestion would also include reversing the order of the iteration [7 -> 1 instead of 1 -> 7].
Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling
On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan wrote: > > We consistently see this behavior: the higher the BW, the sharper the > degradation. > > This is because the page-cache is of a fixed-size. Any fixed-size page-cache > will always meet one of the following: > 1) Too small to keep the pace when load is high. > 2) Too big (in terms of memory footprint) when load is low. > So, we had the order-0 allocations for years at Google, then made the horrible mistake to rebase mlx4 driver from the upstream one, and we had all these issues under load. I decided to redo the work I did years ago and upstream it. I have warned Mellanox in the past (for cx-5 driver) that _any_ high order allocation strategy was nice in benchmarks, but terrible in face of real server workloads. ( And I am not even referring to malicious attacks ) Think about what happens on real servers : In the order of 100,000 TCP sockets opened. Then some incast or outcast problem (Mapreduce jobs are fond of this) make thousands of TCP socket accumulate _millions_ of TCP messages in their out of order queue per second. There is no way you can hold millions of pages in mlx4 driver. A "dynamic" page pool is going to fail very badly. Sure, your iperf bench will look great. But who cares ? Doyou really have customers dedicating hosts to run 1 iperf full time ? Make sure you run tests with 100,000 TCP sockets, and add networking small flaps, with 5% packet losses. This is what we really care here. I will send the v3 of the patch series, I really hope that it will go in, because we at Google very much need it ASAP, and I would rather not have to keep it private in our tree. Do not focus on your benchmarks, that is marketing only Focus on ability of the servers to _survive_ and continue their work. You did not answer to my questions by the way. ethtool -g eth0 ethtool -l eth0 Thanks.
RE: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.
> > Your suggestion seems to: > > a. Assume that the required period should be in ns, not in > > 16*ns units. > > b. mishandles the +8/-8 in the calculation. > > c. Doesn't seem to consider the upper bound on period. > > Duh, you would have to convert the result into the proper form for the HW > register and add bounds checking. I mean, that goes without saying. > The important fact is that your algorithm it not optimal for ppm < 60. Your algorithm ignores the HW limitation. Consider (ppb == 1): your logic would output N == 7, *M == 70, Which has perfect accuracy [N / *M is 1 / 10^9]. But the solution for 'period' * 16 + 8 == 7 * 10^9 isn't a whole number, so this result doesn't really reflect the actual approximation error since we couldn't configure it to HW. The original would return val == 1, period == 6249; While this does have some error [val / (period * 16 + 8) is slightly bigger than 1 / 10^9, error at 18[?] digit after dot], it's the best we can configure for the HW. One simple adjustment we could do is simply break from the loop If 'diff == 0'. At least for small PPB value this would be hit relatively quickly. > > One thing I still don't get is *why* we're trying to optimize this > > area of the code - > > So you prefer using 21 64-bit divisions when using 8 produces better results? No. In an ideal world, I would have liked optimizing everything. But in this world if I do find time to spend on optimizations I rather do that for the stuff that matters. I.e., datapath.
Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb
On Fri, Feb 10, 2017 at 03:19:48AM -0800, Selvin Xavier wrote: > Enables the fastpath ib_poll_cq verb. > > v2: Fixed sparse warnings > v3: Fixes endianness related warnings reported by sparse. Also, fixes > smatch and checkpatch warnings > v5: Uses ETH_P_IBOE macro for RoCE ethertype > > Signed-off-by: Eddie Wai > Signed-off-by: Devesh Sharma > Signed-off-by: Somnath Kotur > Signed-off-by: Sriharsha Basavapatna > Signed-off-by: Selvin Xavier > --- > drivers/infiniband/hw/bnxt_re/ib_verbs.c | 522 > drivers/infiniband/hw/bnxt_re/ib_verbs.h | 1 + > drivers/infiniband/hw/bnxt_re/main.c | 22 +- > drivers/infiniband/hw/bnxt_re/qplib_fp.c | 560 > ++- > drivers/infiniband/hw/bnxt_re/qplib_fp.h | 7 +- > 5 files changed, 1107 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c > b/drivers/infiniband/hw/bnxt_re/ib_verbs.c > index 54d85bc..33af2e3 100644 > --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c > +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c > @@ -2230,6 +2230,528 @@ struct ib_cq *bnxt_re_create_cq(struct ib_device > *ibdev, > return ERR_PTR(rc); > } > > +static u8 __req_to_ib_wc_status(u8 qstatus) > +{ > + switch (qstatus) { > + case CQ_REQ_STATUS_OK: > + return IB_WC_SUCCESS; > + case CQ_REQ_STATUS_BAD_RESPONSE_ERR: > + return IB_WC_BAD_RESP_ERR; > + case CQ_REQ_STATUS_LOCAL_LENGTH_ERR: > + return IB_WC_LOC_LEN_ERR; > + case CQ_REQ_STATUS_LOCAL_QP_OPERATION_ERR: > + return IB_WC_LOC_QP_OP_ERR; > + case CQ_REQ_STATUS_LOCAL_PROTECTION_ERR: > + return IB_WC_LOC_PROT_ERR; > + case CQ_REQ_STATUS_MEMORY_MGT_OPERATION_ERR: > + return IB_WC_GENERAL_ERR; > + case CQ_REQ_STATUS_REMOTE_INVALID_REQUEST_ERR: > + return IB_WC_REM_INV_REQ_ERR; > + case CQ_REQ_STATUS_REMOTE_ACCESS_ERR: > + return IB_WC_REM_ACCESS_ERR; > + case CQ_REQ_STATUS_REMOTE_OPERATION_ERR: > + return IB_WC_REM_OP_ERR; > + case CQ_REQ_STATUS_RNR_NAK_RETRY_CNT_ERR: > + return IB_WC_RNR_RETRY_EXC_ERR; > + case CQ_REQ_STATUS_TRANSPORT_RETRY_CNT_ERR: > + return IB_WC_RETRY_EXC_ERR; > + case CQ_REQ_STATUS_WORK_REQUEST_FLUSHED_ERR: > + return IB_WC_WR_FLUSH_ERR; > + default: > + return IB_WC_GENERAL_ERR; > + } > + return 0; > +} > + > +static u8 __rawqp1_to_ib_wc_status(u8 qstatus) > +{ > + switch (qstatus) { > + case CQ_RES_RAWETH_QP1_STATUS_OK: > + return IB_WC_SUCCESS; > + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_ACCESS_ERROR: > + return IB_WC_LOC_ACCESS_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_HW_LOCAL_LENGTH_ERR: > + return IB_WC_LOC_LEN_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_PROTECTION_ERR: > + return IB_WC_LOC_PROT_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_QP_OPERATION_ERR: > + return IB_WC_LOC_QP_OP_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_MEMORY_MGT_OPERATION_ERR: > + return IB_WC_GENERAL_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_WORK_REQUEST_FLUSHED_ERR: > + return IB_WC_WR_FLUSH_ERR; > + case CQ_RES_RAWETH_QP1_STATUS_HW_FLUSH_ERR: > + return IB_WC_WR_FLUSH_ERR; > + default: > + return IB_WC_GENERAL_ERR; > + } > +} > + > +static u8 __rc_to_ib_wc_status(u8 qstatus) > +{ > + switch (qstatus) { > + case CQ_RES_RC_STATUS_OK: > + return IB_WC_SUCCESS; > + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR: > + return IB_WC_LOC_ACCESS_ERR; > + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR: > + return IB_WC_LOC_LEN_ERR; > + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR: > + return IB_WC_LOC_PROT_ERR; > + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR: > + return IB_WC_LOC_QP_OP_ERR; > + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR: > + return IB_WC_GENERAL_ERR; > + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR: > + return IB_WC_REM_INV_REQ_ERR; > + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR: > + return IB_WC_WR_FLUSH_ERR; > + case CQ_RES_RC_STATUS_HW_FLUSH_ERR: > + return IB_WC_WR_FLUSH_ERR; > + default: > + return IB_WC_GENERAL_ERR; > + } > +} > + Why don't you use these defines directly? Thanks signature.asc Description: PGP signature