Re: linux-next: build failure after merge of the rcu tree

2017-02-12 Thread Stephen Rothwell
Hi Paul,

On Sun, 12 Feb 2017 20:37:48 -0800 "Paul E. McKenney" 
 wrote:
>
> I chickened out on that commit for this merge window, so it will come
> back at -rc1.  But I will cover that when I rebase to -rc1.

OK, thanks.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] Make EN2 pin optional in the TRF7970A driver

2017-02-12 Thread Heiko Schocher

Hello Rob,

Am 10.02.2017 um 16:51 schrieb Rob Herring:

On Tue, Feb 07, 2017 at 06:22:04AM +0100, Heiko Schocher wrote:

From: Guan Ben 

Make the EN2 pin optional. This is useful for boards,
which have this pin fix wired, for example to ground.

Signed-off-by: Guan Ben 
Signed-off-by: Mark Jonas 
Signed-off-by: Heiko Schocher 

---

  .../devicetree/bindings/net/nfc/trf7970a.txt   |  4 ++--
  drivers/nfc/trf7970a.c | 26 --
  2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt 
b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
index 32b35a0..5889a3d 100644
--- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
+++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
@@ -5,8 +5,8 @@ Required properties:
  - spi-max-frequency: Maximum SPI frequency (<= 200).
  - interrupt-parent: phandle of parent interrupt handler.
  - interrupts: A single interrupt specifier.
-- ti,enable-gpios: Two GPIO entries used for 'EN' and 'EN2' pins on the
-  TRF7970A.
+- ti,enable-gpios: One or two GPIO entries used for 'EN' and 'EN2' pins on the
+  TRF7970A. EN2 is optional.


Could EN ever be optional/fixed? If so, perhaps deprecate this property
and do 2 properties, one for each pin.


The hardware I have has the EN2 pin fix connected to ground. Looking
into http://www.ti.com/lit/ds/slos743k/slos743k.pdf page 19 table 6-3
and 6-4 the EN2 pin is a don;t core if EN = 1. If EN = 0 EN2 pin
selects between Power Down and Sleep Mode ... I see no reason why
this is not possible/allowed ...

Hmm.. I do not like the idea of deprecating the "ti,enable-gpios"
property into 2 seperate properties ... but if this would be a reason
for not accepting this patch, I can do this ... How should I name
the 2 new properties?

"ti,pin-enable"  and "ti,pin-enable2" ?

bye,
Heiko
--
DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb

2017-02-12 Thread Leon Romanovsky
On Mon, Feb 13, 2017 at 10:47:10AM +0530, Selvin Xavier wrote:
> On Sun, Feb 12, 2017 at 8:00 PM, Leon Romanovsky  wrote:
> >> +static u8 __rc_to_ib_wc_status(u8 qstatus)
> >> +{
> >> + switch (qstatus) {
> >> + case CQ_RES_RC_STATUS_OK:
> >> + return IB_WC_SUCCESS;
> >> + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR:
> >> + return IB_WC_LOC_ACCESS_ERR;
> >> + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR:
> >> + return IB_WC_LOC_LEN_ERR;
> >> + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR:
> >> + return IB_WC_LOC_PROT_ERR;
> >> + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR:
> >> + return IB_WC_LOC_QP_OP_ERR;
> >> + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR:
> >> + return IB_WC_GENERAL_ERR;
> >> + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR:
> >> + return IB_WC_REM_INV_REQ_ERR;
> >> + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR:
> >> + return IB_WC_WR_FLUSH_ERR;
> >> + case CQ_RES_RC_STATUS_HW_FLUSH_ERR:
> >> + return IB_WC_WR_FLUSH_ERR;
> >> + default:
> >> + return IB_WC_GENERAL_ERR;
> >> + }
> >> +}
> >> +
> >
> > Why don't you use these defines directly?
>
> CQ_RES* values are returned by the HW and these values are
> different from the corresponding IB_WC status values.   say,
> CQ_RES_RC_STATUS_HW_FLUSH_ERR is 8 where as
> IB_WC_WR_FLUSH_ERR is 5.
> So we thought it is better to map these values in a function rather
> than having a switch/case in the calling function.
>
> Let me know if you meant something different in your query.

Thanks,
This from_u8 -> to_u8 conversion confused me, because of our similar
function mlx5_handle_error_cqe() which updates wc->status at the same
time as it is called. So I expected to see something similar in your code
where you fill wc.

Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature


Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb

2017-02-12 Thread Selvin Xavier
On Sun, Feb 12, 2017 at 8:00 PM, Leon Romanovsky  wrote:
>> +static u8 __rc_to_ib_wc_status(u8 qstatus)
>> +{
>> + switch (qstatus) {
>> + case CQ_RES_RC_STATUS_OK:
>> + return IB_WC_SUCCESS;
>> + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR:
>> + return IB_WC_LOC_ACCESS_ERR;
>> + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR:
>> + return IB_WC_LOC_LEN_ERR;
>> + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR:
>> + return IB_WC_LOC_PROT_ERR;
>> + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR:
>> + return IB_WC_LOC_QP_OP_ERR;
>> + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR:
>> + return IB_WC_GENERAL_ERR;
>> + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR:
>> + return IB_WC_REM_INV_REQ_ERR;
>> + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR:
>> + return IB_WC_WR_FLUSH_ERR;
>> + case CQ_RES_RC_STATUS_HW_FLUSH_ERR:
>> + return IB_WC_WR_FLUSH_ERR;
>> + default:
>> + return IB_WC_GENERAL_ERR;
>> + }
>> +}
>> +
>
> Why don't you use these defines directly?

CQ_RES* values are returned by the HW and these values are
different from the corresponding IB_WC status values.   say,
CQ_RES_RC_STATUS_HW_FLUSH_ERR is 8 where as
IB_WC_WR_FLUSH_ERR is 5.
So we thought it is better to map these values in a function rather
than having a switch/case in the calling function.

Let me know if you meant something different in your query.


Re: linux-next: build failure after merge of the rcu tree

2017-02-12 Thread Paul E. McKenney
On Mon, Feb 13, 2017 at 01:21:33PM +1100, Stephen Rothwell wrote:
> Hi Paul,
> 
> On Thu, 19 Jan 2017 13:54:37 -0800 Paul McKenney  wrote:
> >
> > On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell  
> > wrote:
> > > Hi Paul,
> > >
> > > After merging the rcu tree, today's linux-next build (x86_64 allmodconfig)
> > > failed like this:
> > >
> > > net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here 
> > > (not in a function)
> > >   .slab_flags = SLAB_DESTROY_BY_RCU,
> > > ^
> > >
> > > Caused by commit
> > >
> > >   c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")
> > >
> > > interacting with commit
> > >
> > >   ac7138746e14 ("smc: establish new socket family")
> > >
> > > from the net-next tree.
> > >
> > > I have applied the following merge fix patch (someone will need to
> > > remember to mention this to Linus):  
> > 
> > Thank you, Stephen!  I expect that there might be a bit more
> > bikeshedding on the name, but here is hoping...  :-/
> 
> The need for this merge fix patch has gone away today.  Is that a
> permanent situation, or will it come back?

I chickened out on that commit for this merge window, so it will come
back at -rc1.  But I will cover that when I rebase to -rc1.

Thanx, Paul



Re: [PATCH 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread

2017-02-12 Thread jeffy

Hi brian,

On 02/11/2017 09:26 AM, Brian Norris wrote:

Hi Jeffy,

I'm really not an expert on bluetooth or HIDP, but I can't bring myself
to say that this is correct. I still think you have a problem.

On Tue, Jan 24, 2017 at 12:07:51PM +0800, Jeffy Chen wrote:

It looks like hidp_session_thread has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

  net/bluetooth/hidp/core.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 0bec458..43d6e6a 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -36,6 +36,7 @@
  #define VERSION "1.2"
  
  static DECLARE_RWSEM(hidp_session_sem);

+static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq);
  static LIST_HEAD(hidp_session_list);
  
  static unsigned char hidp_keycode[256] = {

@@ -1068,12 +1069,15 @@ static int hidp_session_start_sync(struct hidp_session 
*session)
   * Wake up session thread and notify it to stop. This is asynchronous and
   * returns immediately. Call this whenever a runtime error occurs and you want
   * the session to stop.
- * Note: wake_up_process() performs any necessary memory-barriers for us.
   */
  static void hidp_session_terminate(struct hidp_session *session)
  {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(&hidp_session_wq);

So, you're adding a whole new wait queue here.


  }
  
  /*

@@ -1180,7 +1184,9 @@ static void hidp_session_run(struct hidp_session *session)
struct sock *ctrl_sk = session->ctrl_sock->sk;
struct sock *intr_sk = session->intr_sock->sk;
struct sk_buff *skb;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
  
+	add_wait_queue(&hidp_session_wq, &wait);

for (;;) {
/*
 * This thread can be woken up two ways:
@@ -1188,12 +1194,10 @@ static void hidp_session_run(struct hidp_session 
*session)
 *session->terminate flag and wakes this thread up.
 *  - Via modifying the socket state of ctrl/intr_sock. This
 *thread is woken up by ->sk_state_changed().
-*
-* Note: set_current_state() performs any necessary
-* memory-barriers for us.
 */
-   set_current_state(TASK_INTERRUPTIBLE);
  
+		/* Ensure session->terminate is updated */

+   smp_mb__before_atomic();
if (atomic_read(&session->terminate))
break;
  
@@ -1227,11 +1231,14 @@ static void hidp_session_run(struct hidp_session *session)

hidp_process_transmit(session, &session->ctrl_transmit,
  session->ctrl_sock);
  
-		schedule();

+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);

And you're waiting on it here.

But you're already on two other wait queues (hidp_session_thread()). So
the nice WQ_FLAG_WOKEN handling will only happen if you get woken via
the new hidp_session_wq queue. But what about the other two? Seems like
again you might have a race condition that would lead you to
(temporarily, at least?) missing a wake-up attempt.

Thanx for point that out.


I'm not really sure what the best way to resolve this would be. My best
guess would be to either consolidate the use of these wait queues, or
lese roll a version of wait_woken() to handle 2 or more wait heads...

Am I wrong? I easily could be.

Brian


}
+   remove_wait_queue(&hidp_session_wq, &wait);
  
  	atomic_inc(&session->terminate);

-   set_current_state(TASK_RUNNING);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
  }
  
  /*

--
2.1.4










Re: [PATCH 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session

2017-02-12 Thread jeffy

Hi brian,

On 02/11/2017 09:43 AM, Brian Norris wrote:

Hi,

On Tue, Jan 24, 2017 at 12:07:50PM +0800, Jeffy Chen wrote:

It looks like cmtp_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

  net/bluetooth/cmtp/core.c | 21 ++---
  1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 9e59b66..6b03f2b 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -280,16 +280,16 @@ static int cmtp_session(void *arg)
struct cmtp_session *session = arg;
struct sock *sk = session->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
  
  	BT_DBG("session %p", session);
  
  	set_user_nice(current, -15);
  
-	init_waitqueue_entry(&wait, current);

add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
  
  		if (atomic_read(&session->terminate))

break;
@@ -306,9 +306,8 @@ static int cmtp_session(void *arg)
  
  		cmtp_process_transmit(session);
  
-		schedule();

+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
  
  	down_write(&cmtp_session_sem);

@@ -393,7 +392,11 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, 
struct socket *sock)
err = cmtp_attach_device(session);
if (err < 0) {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+

Same comment about the barrier.

Done, there are barriers in wake functions indeed, thanx!



+   wake_up_interruptible(sk_sleep(session->sock->sk));
up_write(&cmtp_session_sem);
return err;
}
@@ -431,7 +434,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req)
  
  		/* Stop session thread */

atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();

And again.

But otherwise I think this looks OK, again with the caveat that I don't
know Bluetooth/CMTP that well:

Reviewed-by: Brian Norris 


+
+   wake_up_interruptible(sk_sleep(session->sock->sk));
} else
err = -ENOENT;
  
--

2.1.4










Re: [PATCH 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session

2017-02-12 Thread jeffy

Hi brian,

On 02/11/2017 09:40 AM, Brian Norris wrote:

Hi,

On Tue, Jan 24, 2017 at 12:07:49PM +0800, Jeffy Chen wrote:

It looks like bnep_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

  net/bluetooth/bnep/core.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index fbf251f..da04d51 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -484,16 +484,16 @@ static int bnep_session(void *arg)
struct net_device *dev = s->dev;
struct sock *sk = s->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
  
  	BT_DBG("");
  
  	set_user_nice(current, -15);
  
-	init_waitqueue_entry(&wait, current);

add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
  
  		if (atomic_read(&s->terminate))

break;
@@ -515,9 +515,8 @@ static int bnep_session(void *arg)
break;
netif_wake_queue(dev);
  
-		schedule();

+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
  
  	/* Cleanup session */

@@ -666,7 +665,11 @@ int bnep_del_connection(struct bnep_conndel_req *req)
s = __bnep_get_session(req->dst);
if (s) {
atomic_inc(&s->terminate);
-   wake_up_process(s->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+

__wake_up() suggests:

  * It may be assumed that this function implies a write memory barrier before
  * changing the task state if and only if any tasks are woken up.

so the above barrier is probably unnecessary. I'm not so sure about the
one before atomic_read(); seems fine.

Got it, thanx!


Other than that, I this looks ok:

Reviewed-by: Brian Norris 

But I haven't been testing BNEP.

Brian


+   wake_up_interruptible(sk_sleep(s->sock->sk));
} else
err = -ENOENT;
  
--

2.1.4










[PATCH v3 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread

2017-02-12 Thread Jeffy Chen
It looks like hidp_session_thread has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
1/ Fix could not wake up by wake attempts on original wait queues.
2/ Remove unnecessary memory barrier before wake_up_* functions.

---

Changes in v3: None
Changes in v2: None

 net/bluetooth/hidp/core.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 0bec458..076bc50 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -36,6 +36,7 @@
 #define VERSION "1.2"
 
 static DECLARE_RWSEM(hidp_session_sem);
+static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq);
 static LIST_HEAD(hidp_session_list);
 
 static unsigned char hidp_keycode[256] = {
@@ -1068,12 +1069,12 @@ static int hidp_session_start_sync(struct hidp_session 
*session)
  * Wake up session thread and notify it to stop. This is asynchronous and
  * returns immediately. Call this whenever a runtime error occurs and you want
  * the session to stop.
- * Note: wake_up_process() performs any necessary memory-barriers for us.
+ * Note: wake_up_interruptible() performs any necessary memory-barriers for us.
  */
 static void hidp_session_terminate(struct hidp_session *session)
 {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+   wake_up_interruptible(&hidp_session_wq);
 }
 
 /*
@@ -1180,7 +1181,9 @@ static void hidp_session_run(struct hidp_session *session)
struct sock *ctrl_sk = session->ctrl_sock->sk;
struct sock *intr_sk = session->intr_sock->sk;
struct sk_buff *skb;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
+   add_wait_queue(&hidp_session_wq, &wait);
for (;;) {
/*
 * This thread can be woken up two ways:
@@ -1188,12 +1191,10 @@ static void hidp_session_run(struct hidp_session 
*session)
 *session->terminate flag and wakes this thread up.
 *  - Via modifying the socket state of ctrl/intr_sock. This
 *thread is woken up by ->sk_state_changed().
-*
-* Note: set_current_state() performs any necessary
-* memory-barriers for us.
 */
-   set_current_state(TASK_INTERRUPTIBLE);
 
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
if (atomic_read(&session->terminate))
break;
 
@@ -1227,11 +1228,22 @@ static void hidp_session_run(struct hidp_session 
*session)
hidp_process_transmit(session, &session->ctrl_transmit,
  session->ctrl_sock);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
+   remove_wait_queue(&hidp_session_wq, &wait);
 
atomic_inc(&session->terminate);
-   set_current_state(TASK_RUNNING);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+}
+
+int hidp_session_wake_function(wait_queue_t *wait, unsigned int mode,
+  int sync, void *key)
+{
+   wake_up_interruptible(&hidp_session_wq);
+
+   return default_wake_function(wait, mode, sync, key);
 }
 
 /*
@@ -1244,7 +1256,8 @@ static void hidp_session_run(struct hidp_session *session)
 static int hidp_session_thread(void *arg)
 {
struct hidp_session *session = arg;
-   wait_queue_t ctrl_wait, intr_wait;
+   DEFINE_WAIT_FUNC(ctrl_wait, hidp_session_wake_function);
+   DEFINE_WAIT_FUNC(intr_wait, hidp_session_wake_function);
 
BT_DBG("session %p", session);
 
@@ -1254,8 +1267,6 @@ static int hidp_session_thread(void *arg)
set_user_nice(current, -15);
hidp_set_timer(session);
 
-   init_waitqueue_entry(&ctrl_wait, current);
-   init_waitqueue_entry(&intr_wait, current);
add_wait_queue(sk_sleep(session->ctrl_sock->sk), &ctrl_wait);
add_wait_queue(sk_sleep(session->intr_sock->sk), &intr_wait);
/* This memory barrier is paired with wq_has_sleeper(). See
-- 
2.1.4




[PATCH v3 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session

2017-02-12 Thread Jeffy Chen
It looks like bnep_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
Reviewed-by: Brian Norris 
---

Changes in v3:
Add brian's Reviewed-by.

Changes in v2:
Remove unnecessary memory barrier before wake_up_* functions.

 net/bluetooth/bnep/core.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index fbf251f..4d6b94d 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -484,16 +484,16 @@ static int bnep_session(void *arg)
struct net_device *dev = s->dev;
struct sock *sk = s->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("");
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(&wait, current);
add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(&s->terminate))
break;
@@ -515,9 +515,8 @@ static int bnep_session(void *arg)
break;
netif_wake_queue(dev);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
 
/* Cleanup session */
@@ -666,7 +665,7 @@ int bnep_del_connection(struct bnep_conndel_req *req)
s = __bnep_get_session(req->dst);
if (s) {
atomic_inc(&s->terminate);
-   wake_up_process(s->task);
+   wake_up_interruptible(sk_sleep(s->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4




[PATCH v3 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session

2017-02-12 Thread Jeffy Chen
It looks like cmtp_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
Reviewed-by: Brian Norris 
Remove unnecessary memory barrier before wake_up_* functions.

---

Changes in v3:
Add brian's Reviewed-by.

Changes in v2: None

 net/bluetooth/cmtp/core.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 9e59b66..1152ce3 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -280,16 +280,16 @@ static int cmtp_session(void *arg)
struct cmtp_session *session = arg;
struct sock *sk = session->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("session %p", session);
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(&wait, current);
add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(&session->terminate))
break;
@@ -306,9 +306,8 @@ static int cmtp_session(void *arg)
 
cmtp_process_transmit(session);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
 
down_write(&cmtp_session_sem);
@@ -393,7 +392,7 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, 
struct socket *sock)
err = cmtp_attach_device(session);
if (err < 0) {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+   wake_up_interruptible(sk_sleep(session->sock->sk));
up_write(&cmtp_session_sem);
return err;
}
@@ -431,7 +430,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req)
 
/* Stop session thread */
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(sk_sleep(session->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4




[PATCH v2 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread

2017-02-12 Thread Jeffy Chen
It looks like hidp_session_thread has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
1/ Fix could not wake up by wake attempts on original wait queues.
2/ Remove unnecessary memory barrier before wake_up_* functions.

---

Changes in v2: None

 net/bluetooth/hidp/core.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 0bec458..076bc50 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -36,6 +36,7 @@
 #define VERSION "1.2"
 
 static DECLARE_RWSEM(hidp_session_sem);
+static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq);
 static LIST_HEAD(hidp_session_list);
 
 static unsigned char hidp_keycode[256] = {
@@ -1068,12 +1069,12 @@ static int hidp_session_start_sync(struct hidp_session 
*session)
  * Wake up session thread and notify it to stop. This is asynchronous and
  * returns immediately. Call this whenever a runtime error occurs and you want
  * the session to stop.
- * Note: wake_up_process() performs any necessary memory-barriers for us.
+ * Note: wake_up_interruptible() performs any necessary memory-barriers for us.
  */
 static void hidp_session_terminate(struct hidp_session *session)
 {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+   wake_up_interruptible(&hidp_session_wq);
 }
 
 /*
@@ -1180,7 +1181,9 @@ static void hidp_session_run(struct hidp_session *session)
struct sock *ctrl_sk = session->ctrl_sock->sk;
struct sock *intr_sk = session->intr_sock->sk;
struct sk_buff *skb;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
+   add_wait_queue(&hidp_session_wq, &wait);
for (;;) {
/*
 * This thread can be woken up two ways:
@@ -1188,12 +1191,10 @@ static void hidp_session_run(struct hidp_session 
*session)
 *session->terminate flag and wakes this thread up.
 *  - Via modifying the socket state of ctrl/intr_sock. This
 *thread is woken up by ->sk_state_changed().
-*
-* Note: set_current_state() performs any necessary
-* memory-barriers for us.
 */
-   set_current_state(TASK_INTERRUPTIBLE);
 
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
if (atomic_read(&session->terminate))
break;
 
@@ -1227,11 +1228,22 @@ static void hidp_session_run(struct hidp_session 
*session)
hidp_process_transmit(session, &session->ctrl_transmit,
  session->ctrl_sock);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
+   remove_wait_queue(&hidp_session_wq, &wait);
 
atomic_inc(&session->terminate);
-   set_current_state(TASK_RUNNING);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+}
+
+int hidp_session_wake_function(wait_queue_t *wait, unsigned int mode,
+  int sync, void *key)
+{
+   wake_up_interruptible(&hidp_session_wq);
+
+   return default_wake_function(wait, mode, sync, key);
 }
 
 /*
@@ -1244,7 +1256,8 @@ static void hidp_session_run(struct hidp_session *session)
 static int hidp_session_thread(void *arg)
 {
struct hidp_session *session = arg;
-   wait_queue_t ctrl_wait, intr_wait;
+   DEFINE_WAIT_FUNC(ctrl_wait, hidp_session_wake_function);
+   DEFINE_WAIT_FUNC(intr_wait, hidp_session_wake_function);
 
BT_DBG("session %p", session);
 
@@ -1254,8 +1267,6 @@ static int hidp_session_thread(void *arg)
set_user_nice(current, -15);
hidp_set_timer(session);
 
-   init_waitqueue_entry(&ctrl_wait, current);
-   init_waitqueue_entry(&intr_wait, current);
add_wait_queue(sk_sleep(session->ctrl_sock->sk), &ctrl_wait);
add_wait_queue(sk_sleep(session->intr_sock->sk), &intr_wait);
/* This memory barrier is paired with wq_has_sleeper(). See
-- 
2.1.4




[PATCH v2 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session

2017-02-12 Thread Jeffy Chen
It looks like bnep_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

Changes in v2:
Remove unnecessary memory barrier before wake_up_* functions.

 net/bluetooth/bnep/core.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index fbf251f..4d6b94d 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -484,16 +484,16 @@ static int bnep_session(void *arg)
struct net_device *dev = s->dev;
struct sock *sk = s->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("");
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(&wait, current);
add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(&s->terminate))
break;
@@ -515,9 +515,8 @@ static int bnep_session(void *arg)
break;
netif_wake_queue(dev);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
 
/* Cleanup session */
@@ -666,7 +665,7 @@ int bnep_del_connection(struct bnep_conndel_req *req)
s = __bnep_get_session(req->dst);
if (s) {
atomic_inc(&s->terminate);
-   wake_up_process(s->task);
+   wake_up_interruptible(sk_sleep(s->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4




[PATCH v2 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session

2017-02-12 Thread Jeffy Chen
It looks like cmtp_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
Remove unnecessary memory barrier before wake_up_* functions.

---

Changes in v2: None

 net/bluetooth/cmtp/core.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 9e59b66..1152ce3 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -280,16 +280,16 @@ static int cmtp_session(void *arg)
struct cmtp_session *session = arg;
struct sock *sk = session->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("session %p", session);
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(&wait, current);
add_wait_queue(sk_sleep(sk), &wait);
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(&session->terminate))
break;
@@ -306,9 +306,8 @@ static int cmtp_session(void *arg)
 
cmtp_process_transmit(session);
 
-   schedule();
+   wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
 
down_write(&cmtp_session_sem);
@@ -393,7 +392,7 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, 
struct socket *sock)
err = cmtp_attach_device(session);
if (err < 0) {
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+   wake_up_interruptible(sk_sleep(session->sock->sk));
up_write(&cmtp_session_sem);
return err;
}
@@ -431,7 +430,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req)
 
/* Stop session thread */
atomic_inc(&session->terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(sk_sleep(session->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4




Re: [net-next 00/14][pull request] 40GbE Intel Wired LAN Driver Updates 2017-02-11

2017-02-12 Thread David Miller
From: Jeff Kirsher 
Date: Sat, 11 Feb 2017 21:30:14 -0800

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.

Please address Sergei's feedback for patch #12 in followup changes, if
necessary, thank you.


Re: [PATCH] net: micrel: ks8851: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 09:57:47 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: micrel: ks8695net: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Wed,  8 Feb 2017 23:54:45 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: micrel: ks8851_mll: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 11:28:25 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: microchip: encx24j600: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 22:42:18 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: nuvoton: w90p910: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Sun, 12 Feb 2017 21:38:29 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: natsemi: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 23:58:25 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: neterion: s2io: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Sun, 12 Feb 2017 11:44:36 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: micrel: ksz884x: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 20:25:06 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: myricom: myri10ge: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 23:17:23 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: natsemi: ns83820: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Fri, 10 Feb 2017 23:57:48 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: neterion: vxge: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Sun, 12 Feb 2017 17:33:13 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH] net: microchip: enc28j60: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread David Miller
From: Philippe Reynes 
Date: Thu,  9 Feb 2017 22:02:47 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH net-next 0/9] bnxt_en: Misc updates.

2017-02-12 Thread David Miller
From: Michael Chan 
Date: Sun, 12 Feb 2017 19:18:09 -0500

> Miscellaneous updates include update of the firmware spec, ethtool flash
> enhancement, ethtool -l minor fix, NTUPLE support enhancements, FEC
> link settings message during link up, and new PCI IDs.  Please review.
> Thanks.

This all looks pretty straightforward to me, series applied, thanks
Michael.


Re: [PATCH net] net/llc: avoid BUG_ON() in skb_orphan()

2017-02-12 Thread David Miller
From: Eric Dumazet 
Date: Sun, 12 Feb 2017 14:03:52 -0800

> From: Eric Dumazet 
> 
> It seems nobody used LLC since linux-3.12.
> 
> Fortunately fuzzers like syzkaller still know how to run this code,
> otherwise it would be no fun.
> 
> Setting skb->sk without skb->destructor leads to all kinds of
> bugs, we now prefer to be very strict about it.
> 
> Ideally here we would use skb_set_owner() but this helper does not exist yet,
> only CAN seems to have a private helper for that.
> 
> Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()")
> Signed-off-by: Eric Dumazet 
> Reported-by: Andrey Konovalov 

Applied and queued up for -stable, thanks Eric.


Re: [PATCH 00/21] Netfilter updates for net-next

2017-02-12 Thread David Miller
From: Pablo Neira Ayuso 
Date: Sun, 12 Feb 2017 20:42:32 +0100

> The following patchset contains Netfilter updates for your net-next
> tree, most relevantly they are:
 ..
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled, I really like the RULE_ID generation count stuff for
userspace.

Thanks.


Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-12 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 10 Feb 2017 20:28:24 -0800

> If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
> to the given cgroup the descendent cgroup will be able to override
> effective bpf program that was inherited from this cgroup.
> By default it's not passed, therefore override is disallowed.
> 
> Examples:
> 1.
> prog X attached to /A with default
> prog Y fails to attach to /A/B and /A/B/C
> Everything under /A runs prog X
> 
> 2.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A/B with default (non-override)
> prog M attached to /A/B with allow_override.
> Everything under /A/B runs prog M only.
> 
> 3.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A with default.
> The user has to detach first to switch the mode.
> 
> In the future this behavior may be extended with a chain of
> non-overridable programs.
> 
> Also fix the bug where detach from cgroup where nothing is attached
> was not throwing error. Return ENOENT in such case.
> 
> Add several testcases and adjust libbpf.
> 
> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
> Signed-off-by: Alexei Starovoitov 

Applied.


[lkp-robot] [xdp] 543d41bf78: INFO:suspicious_RCU_usage

2017-02-12 Thread kernel test robot

FYI, we noticed the following commit:

commit: 543d41bf78792e858e6f6598945d307ff808b7fc ("xdp: Infrastructure to 
generalize XDP")
url: 
https://github.com/0day-ci/linux/commits/Tom-Herbert/xdp-Generalize-XDP/20170209-092238


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-i386 -enable-kvm -smp 2 -m 320M

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


+-+++
| | df6dd79be8 | 543d41bf78 
|
+-+++
| boot_successes  | 10 | 0  
|
| boot_failures   | 2  | 12 
|
| WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page | 2  | 2  
|
| INFO:suspicious_RCU_usage   | 0  | 12 
|
+-+++



[6.814497] [ INFO: suspicious RCU usage. ]
[6.814497] [ INFO: suspicious RCU usage. ]
[6.814990] 4.10.0-rc7-01379-g543d41b #1 Not tainted
[6.814990] 4.10.0-rc7-01379-g543d41b #1 Not tainted
[6.815618] ---
[6.815618] ---
[6.816107] net/core/xdp.c:201 suspicious rcu_dereference_check() usage!
[6.816107] net/core/xdp.c:201 suspicious rcu_dereference_check() usage!
[6.817090] 
[6.817090] other info that might help us debug this:
[6.817090] 
[6.817090] 
[6.817090] other info that might help us debug this:
[6.817090] 
[6.818000] 
[6.818000] rcu_scheduler_active = 2, debug_locks = 0
[6.818000] 
[6.818000] rcu_scheduler_active = 2, debug_locks = 0
[6.818778] 1 lock held by swapper/1:
[6.818778] 1 lock held by swapper/1:
[6.819213]  #0:  (xdp_hook_mutex){+.+...}, at: [] 
__xdp_unregister_hooks+0x1c/0x185
[6.819213]  #0:  (xdp_hook_mutex){+.+...}, at: [] 
__xdp_unregister_hooks+0x1c/0x185
[6.820199] 
[6.820199] stack backtrace:
[6.820199] 
[6.820199] stack backtrace:
[6.820710] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.10.0-rc7-01379-g543d41b #1
[6.820710] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.10.0-rc7-01379-g543d41b #1
[6.821530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3-20161025_171302-gandalf 04/01/2014
[6.821530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3-20161025_171302-gandalf 04/01/2014
[6.822747] Call Trace:
[6.822747] Call Trace:
[6.823052]  dump_stack+0x16/0x18
[6.823052]  dump_stack+0x16/0x18
[6.823434]  lockdep_rcu_suspicious+0xdb/0xee
[6.823434]  lockdep_rcu_suspicious+0xdb/0xee
[6.823908]  __xdp_unregister_hooks+0x171/0x185
[6.823908]  __xdp_unregister_hooks+0x171/0x185
[6.824421]  ? __might_sleep+0x2d/0x86
[6.824421]  ? __might_sleep+0x2d/0x86
[6.824848]  xdp_unregister_all_hooks+0x3a/0x3f
[6.824848]  xdp_unregister_all_hooks+0x3a/0x3f
[6.825398]  free_netdev+0x25/0xca
[6.825398]  free_netdev+0x25/0xca
[6.825801]  lance_probe+0x115/0x122
[6.825801]  lance_probe+0x115/0x122
[6.826191]  probe_list2+0x20/0x41
[6.826191]  probe_list2+0x20/0x41
[6.826586]  net_olddevs_init+0x42/0x4e
[6.826586]  net_olddevs_init+0x42/0x4e
[6.827037]  ? probe_list2+0x41/0x41
[6.827037]  ? probe_list2+0x41/0x41
[6.827448]  do_one_initcall+0x3c/0x184
[6.827448]  do_one_initcall+0x3c/0x184
[6.827866]  ? repair_env_string+0x12/0x54
[6.827866]  ? repair_env_string+0x12/0x54
[6.828326]  ? parse_args+0x24e/0x402
[6.828326]  ? parse_args+0x24e/0x402
[6.828785]  ? trace_hardirqs_on+0xb/0xd
[6.828785]  ? trace_hardirqs_on+0xb/0xd
[6.829235]  kernel_init_freeable+0xe1/0x15c
[6.829235]  kernel_init_freeable+0xe1/0x15c
[6.829729]  ? rest_init+0x10e/0x10e
[6.829729]  ? rest_init+0x10e/0x10e
[6.830134]  kernel_init+0xb/0xe5
[6.830134]  kernel_init+0xb/0xe5
[6.830515]  ? schedule_tail+0xc/0x4a
[6.830515]  ? schedule_tail+0xc/0x4a
[6.830925]  ? rest_init+0x10e/0x10e
[6.830925]  ? rest_init+0x10e/0x10e
[6.831343]  ret_from_fork+0x21/0x2c
[6.831343]  ret_from_fork+0x21/0x2c
[6.832026] libphy: Fixed MDIO Bus: probed
[6.832026] libphy: Fixed MDIO Bus: probed
[6.832650] arcnet: arcnet loaded
[6.832650] arcnet: arcnet loaded
[6.833011] arcnet:rfc1201: RFC1201 "standard" (`a') encapsulation support 
loaded
[6.833011] arcnet:rfc1201: RFC1201 "standard" (`a') encapsulation support 
loaded
[6.833856] arcnet:arc_rawmode: raw mode (`r') encapsulation support loaded
[6.833856] arcnet:arc_rawmode: raw mode (`r') encapsulation

Re: linux-next: build failure after merge of the rcu tree

2017-02-12 Thread Stephen Rothwell
Hi Paul,

On Thu, 19 Jan 2017 13:54:37 -0800 Paul McKenney  wrote:
>
> On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell  
> wrote:
> > Hi Paul,
> >
> > After merging the rcu tree, today's linux-next build (x86_64 allmodconfig)
> > failed like this:
> >
> > net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here (not 
> > in a function)
> >   .slab_flags = SLAB_DESTROY_BY_RCU,
> > ^
> >
> > Caused by commit
> >
> >   c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")
> >
> > interacting with commit
> >
> >   ac7138746e14 ("smc: establish new socket family")
> >
> > from the net-next tree.
> >
> > I have applied the following merge fix patch (someone will need to
> > remember to mention this to Linus):  
> 
> Thank you, Stephen!  I expect that there might be a bit more
> bikeshedding on the name, but here is hoping...  :-/

The need for this merge fix patch has gone away today.  Is that a
permanent situation, or will it come back?
-- 
Cheers,
Stephen Rothwell


[PATCH net 1/1] net: fec: fix multicast filtering hardware setup

2017-02-12 Thread Andy Duan
From: Rui Sousa 

Fix hardware setup of multicast address hash:
- Never clear the hardware hash (to avoid packet loss)
- Construct the hash register values in software and then write once
to hardware

Signed-off-by: Rui Sousa 
Signed-off-by: Fugang Duan 
---
 drivers/net/ethernet/freescale/fec_main.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 2cc552d..91a1664 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -2910,6 +2910,7 @@ static void set_multicast_list(struct net_device *ndev)
struct netdev_hw_addr *ha;
unsigned int i, bit, data, crc, tmp;
unsigned char hash;
+   unsigned int hash_high = 0, hash_low = 0;
 
if (ndev->flags & IFF_PROMISC) {
tmp = readl(fep->hwp + FEC_R_CNTRL);
@@ -2932,11 +2933,7 @@ static void set_multicast_list(struct net_device *ndev)
return;
}
 
-   /* Clear filter and add the addresses in hash register
-*/
-   writel(0, fep->hwp + FEC_GRP_HASH_TABLE_HIGH);
-   writel(0, fep->hwp + FEC_GRP_HASH_TABLE_LOW);
-
+   /* Add the addresses in hash register */
netdev_for_each_mc_addr(ha, ndev) {
/* calculate crc32 value of mac address */
crc = 0x;
@@ -2954,16 +2951,14 @@ static void set_multicast_list(struct net_device *ndev)
 */
hash = (crc >> (32 - FEC_HASH_BITS)) & 0x3f;
 
-   if (hash > 31) {
-   tmp = readl(fep->hwp + FEC_GRP_HASH_TABLE_HIGH);
-   tmp |= 1 << (hash - 32);
-   writel(tmp, fep->hwp + FEC_GRP_HASH_TABLE_HIGH);
-   } else {
-   tmp = readl(fep->hwp + FEC_GRP_HASH_TABLE_LOW);
-   tmp |= 1 << hash;
-   writel(tmp, fep->hwp + FEC_GRP_HASH_TABLE_LOW);
-   }
+   if (hash > 31)
+   hash_high |= 1 << (hash - 32);
+   else
+   hash_low |= 1 << hash;
}
+
+   writel(hash_high, fep->hwp + FEC_GRP_HASH_TABLE_HIGH);
+   writel(hash_low, fep->hwp + FEC_GRP_HASH_TABLE_LOW);
 }
 
 /* Set a MAC change in hardware. */
-- 
1.9.1



Re: [PATCH net-next v1] bpf: Remove redundant ifdef

2017-02-12 Thread Wangnan (F)



On 2017/2/12 3:37, Mickaël Salaün wrote:

Remove a useless ifdef __NR_bpf as requested by Wang Nan.

Inline one-line static functions as it was in the bpf_sys.h file.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Wang Nan 
Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com
---
  tools/lib/bpf/bpf.c | 12 +++-
  1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 50e04cc5..2de9c386989a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -42,21 +42,15 @@
  # endif
  #endif
  
-static __u64 ptr_to_u64(const void *ptr)

+static inline __u64 ptr_to_u64(const void *ptr)
  {
return (__u64) (unsigned long) ptr;
  }
  
-static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,

-  unsigned int size)
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+ unsigned int size)
  {
-#ifdef __NR_bpf
return syscall(__NR_bpf, cmd, attr, size);
-#else
-   fprintf(stderr, "No bpf syscall, kernel headers too old?\n");
-   errno = ENOSYS;
-   return -1;
-#endif
  }
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size,


Acked-by: Wang Nan 

However, it is better to merge this patch with commit 
702498a1426bc95b6f49f9c5fba616110cbd3947.


Thank you.



RE: [PATCH net 1/1] net: fec: fix multicast filtering hardware setup

2017-02-12 Thread Andy Duan
From: Fabio Estevam  Sent: Saturday, February 11, 2017 5:20 
AM
>To: Andy Duan 
>Cc: David S. Miller ; netdev@vger.kernel.org;
>Stephen Hemminger 
>Subject: Re: [PATCH net 1/1] net: fec: fix multicast filtering hardware setup
>
>On Fri, Feb 10, 2017 at 3:54 AM, Andy Duan  wrote:
>> Fix hardware setup of multicast address hash:
>> - Never clear the hardware hash (to avoid packet loss)
>> - Construct the hash register values in software and then write once
>> to hardware
>>
>> Signed-off-by: Fugang Duan 
>> Signed-off-by: Rui Sousa 
>
>It seems you missed to put Rui's name in the From: field.

I did some change base on original patch and merge into net tree.
Forget to change thr FR field, send it again, not V2 version.

Regards,
Andy


Re: [PATCH v1] samples/bpf: Add a .gitignore for binaries

2017-02-12 Thread David Ahern
On 2/12/17 2:23 PM, Mickaël Salaün wrote:
> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
> new file mode 100644
> index ..a7562a5ef4c2
> --- /dev/null
> +++ b/samples/bpf/.gitignore
> @@ -0,0 +1,32 @@
> +fds_example
> +lathist

...

Listing each target is going to be a PITA to maintain. It would be
better to put targets into a build directory (bin?) and ignore the
directory.


Re: net/packet: use-after-free in packet_rcv_fanout

2017-02-12 Thread Sowmini Varadhan
On (02/10/17 10:02), Eric Dumazet wrote:
> At least, Anoob patch is making a step into the right direction ;)
> https://patchwork.ozlabs.org/patch/726532/

I've not been able to reproduce Dmitry's panic (though I did not try
very hard either) but there's a call to fanout_release from packet_release
before the synchronize_net() - I wonder if this could end up kfree'ing f
when there are threads in the middle of dev_queue_xmit_nit().

--Sowmini
 


Re: [PATCH v4 0/3] Miscellaneous fixes for BPF (perf tree)

2017-02-12 Thread Wangnan (F)



On 2017/2/9 4:27, Mickaël Salaün wrote:

This series brings some fixes and small improvements to the BPF samples.

This is intended for the perf tree and apply on 7a5980f9c006 ("tools lib bpf:
Add missing header to the library").

Changes since v3:
* remove applied patch 1/5
* remove patch 2/5 on bpf_load_program() as requested by Wang Nan

Changes since v2:
* add this cover letter

Changes since v1:
* exclude patches not intended for the perf tree

Regards,

Mickaël Salaün (3):
   samples/bpf: Ignore already processed ELF sections
   samples/bpf: Reset global variables
   samples/bpf: Add missing header

  samples/bpf/bpf_load.c | 7 +++
  samples/bpf/tracex5_kern.c | 1 +
  2 files changed, 8 insertions(+)


Looks good to me.

Thank you.



Re: net: hix5hd2_gmac uninitialized net_device

2017-02-12 Thread Dongpo Li


On 2017/2/11 8:51, Marty Plummer wrote:
> On Fri, Feb 10, 2017 at 06:21:35PM +0800, Dongpo Li wrote:
>> I think the error "No irq resource" happened for some other reason, has no 
>> relation with
>> the info "(unnamed net_device) (uninitialized):".
>> You can add more debug info to find bug.
> Do you have any particular suggestions as to what to check out, or is
> this just a general 'debug more' instruction?
I haven't encountered such a problem. So it needs you to debug what happens.

>> Yes, I agree with you that the ndev has not been initialized completely,
>> because the function "register_netdev" has not been called yet.
>> It's better to use the "dev_err" to replace the "netdev_err".
>>
> Ah, I see. So, prior to line 1266's call to register_netdev, it will
> always be uninitialized and unnamed, regardless of what is or isn't
> right elsewhere. Good to know. So, I could replace these netdev_err
> with dev_err for now, up until that point, so I can get a bit more info,
> yes?
Yes.


Regards,
Dongpo

.



Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Eric Dumazet
On Sun, 2017-02-12 at 23:38 +0100, Jesper Dangaard Brouer wrote:

> Just so others understand this: The number of RX queue slots is
> indirectly the size of the page-recycle "cache" in this scheme (that
> depend on refcnt tricks to see if page can be reused).

Note that the page recycle tricks only work on some occasions.

To provision correctly hosts dealing with TCP flows, one should not rely
on page recycling or any opportunistic (non guaranteed) behavior.

Page recycling, _if_ possible, will help to reduce system load
and thus lower latencies.

> 
> 
> > A single TCP flow easily can have more than 1024 MSS waiting in its
> > receive queue (typical receive window on linux is 6MB/2 )
> 
> So, you do need to increase the page-"cache" size, and need this for
> real-life cases, interesting.

I believe this sizing was done mostly to cope with normal system
scheduling constraints [1], reducing packet losses under incast blasts.

Sizing happened before I did my patches to switch to order-0 pages
anyway.

The fact that it allowed page-recycling to happen more often was nice of
course.


[1]
- One can not really assume host will always have the ability to process
the RX ring in time, unless maybe CPU are fully dedicated to the napi
polling logic.
- Recent work to shift softirqs to ksoftirqd is potentially magnifying
the problem.






[PATCH net-next 5/9] bnxt_en: Add hardware NTUPLE filter for encapsulated packets.

2017-02-12 Thread Michael Chan
If skb_flow_dissect_flow_keys() returns with the encapsulation flag
set, pass the information to the firmware to setup the NTUPLE filter
accordingly.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 516c5d7..f3d829f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3456,6 +3456,9 @@ static int bnxt_hwrm_cfa_ntuple_filter_free(struct bnxt 
*bp,
 CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_DST_PORT_MASK |\
 CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_DST_ID)
 
+#define BNXT_NTP_TUNNEL_FLTR_FLAG  \
+   CFA_NTUPLE_FILTER_ALLOC_REQ_ENABLES_TUNNEL_TYPE
+
 static int bnxt_hwrm_cfa_ntuple_filter_alloc(struct bnxt *bp,
 struct bnxt_ntuple_filter *fltr)
 {
@@ -3496,6 +3499,11 @@ static int bnxt_hwrm_cfa_ntuple_filter_alloc(struct bnxt 
*bp,
req.dst_ipaddr[0] = keys->addrs.v4addrs.dst;
req.dst_ipaddr_mask[0] = cpu_to_be32(0x);
}
+   if (keys->control.flags & FLOW_DIS_ENCAPSULATION) {
+   req.enables |= cpu_to_le32(BNXT_NTP_TUNNEL_FLTR_FLAG);
+   req.tunnel_type =
+   CFA_NTUPLE_FILTER_ALLOC_REQ_TUNNEL_TYPE_ANYTUNNEL;
+   }
 
req.src_port = keys->ports.src;
req.src_port_mask = cpu_to_be16(0x);
@@ -6869,6 +6877,7 @@ static bool bnxt_fltr_match(struct bnxt_ntuple_filter *f1,
keys1->ports.ports == keys2->ports.ports &&
keys1->basic.ip_proto == keys2->basic.ip_proto &&
keys1->basic.n_proto == keys2->basic.n_proto &&
+   keys1->control.flags == keys2->control.flags &&
ether_addr_equal(f1->src_mac_addr, f2->src_mac_addr) &&
ether_addr_equal(f1->dst_mac_addr, f2->dst_mac_addr))
return true;
@@ -6886,9 +6895,6 @@ static int bnxt_rx_flow_steer(struct net_device *dev, 
const struct sk_buff *skb,
int rc = 0, idx, bit_id, l2_idx = 0;
struct hlist_head *head;
 
-   if (skb->encapsulation)
-   return -EPROTONOSUPPORT;
-
if (!ether_addr_equal(dev->dev_addr, eth->h_dest)) {
struct bnxt_vnic_info *vnic = &bp->vnic_info[0];
int off = 0, j;
@@ -6927,6 +6933,11 @@ static int bnxt_rx_flow_steer(struct net_device *dev, 
const struct sk_buff *skb,
rc = -EPROTONOSUPPORT;
goto err_free;
}
+   if ((fkeys->control.flags & FLOW_DIS_ENCAPSULATION) &&
+   bp->hwrm_spec_code < 0x10601) {
+   rc = -EPROTONOSUPPORT;
+   goto err_free;
+   }
 
memcpy(new_fltr->dst_mac_addr, eth->h_dest, ETH_ALEN);
memcpy(new_fltr->src_mac_addr, eth->h_source, ETH_ALEN);
-- 
1.8.3.1



[PATCH net-next 6/9] bnxt_en: Do not setup PHY unless driving a single PF.

2017-02-12 Thread Michael Chan
If it is a VF or an NPAR function, the firmware call to setup the PHY
will fail.  Adding this check will prevent unnecessary firmware calls
to setup the PHY unless calling from the PF.  This will also eliminate
many unnecessary warning messages when the call from a VF or NPAR fails.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f3d829f..afd1190 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -5853,6 +5853,9 @@ static int bnxt_update_phy_setting(struct bnxt *bp)
   rc);
return rc;
}
+   if (!BNXT_SINGLE_PF(bp))
+   return 0;
+
if ((link_info->autoneg & BNXT_AUTONEG_FLOW_CTRL) &&
(link_info->auto_pause_setting & BNXT_LINK_PAUSE_BOTH) !=
link_info->req_flow_ctrl)
-- 
1.8.3.1



[PATCH net-next 3/9] bnxt_en: Fix ethtool -l pre-set max combined channel.

2017-02-12 Thread Michael Chan
With commit d1e7925e6d80 ("bnxt_en: Centralize logic to reserve rings."),
ring allocation for combined rings has become stricter.  A combined
ring must now have an rx-tx ring pair.  The pre-set max. for combined
rings should now be min(rx, tx).

Fixes: d1e7925e6d80 ("bnxt_en: Centralize logic to reserve rings.")
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 4b45b88..6903a87 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -357,7 +357,7 @@ static void bnxt_get_channels(struct net_device *dev,
int max_rx_rings, max_tx_rings, tcs;
 
bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, true);
-   channel->max_combined = max_t(int, max_rx_rings, max_tx_rings);
+   channel->max_combined = min_t(int, max_rx_rings, max_tx_rings);
 
if (bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, false)) {
max_rx_rings = 0;
-- 
1.8.3.1



[PATCH net-next 7/9] bnxt_en: Print FEC settings as part of the linkup dmesg.

2017-02-12 Thread Michael Chan
Print FEC (Forward Error Correction) autoneg and encoding settings during
link up.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 13 -
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  4 
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index afd1190..9f1dfbe 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -5437,7 +5437,7 @@ static void bnxt_report_link(struct bnxt *bp)
if (bp->link_info.link_up) {
const char *duplex;
const char *flow_ctrl;
-   u16 speed;
+   u16 speed, fec;
 
netif_carrier_on(bp->dev);
if (bp->link_info.duplex == BNXT_LINK_DUPLEX_FULL)
@@ -5459,6 +5459,12 @@ static void bnxt_report_link(struct bnxt *bp)
netdev_info(bp->dev, "EEE is %s\n",
bp->eee.eee_active ? "active" :
 "not active");
+   fec = bp->link_info.fec_cfg;
+   if (!(fec & PORT_PHY_QCFG_RESP_FEC_CFG_FEC_NONE_SUPPORTED))
+   netdev_info(bp->dev, "FEC autoneg %s encodings: %s\n",
+   (fec & BNXT_FEC_AUTONEG) ? "on" : "off",
+   (fec & BNXT_FEC_ENC_BASE_R) ? "BaseR" :
+(fec & BNXT_FEC_ENC_RS) ? "RS" : "None");
} else {
netif_carrier_off(bp->dev);
netdev_err(bp->dev, "NIC Link is Down\n");
@@ -5583,6 +5589,11 @@ static int bnxt_update_link(struct bnxt *bp, bool 
chng_link_state)
}
}
}
+
+   link_info->fec_cfg = PORT_PHY_QCFG_RESP_FEC_CFG_FEC_NONE_SUPPORTED;
+   if (bp->hwrm_spec_code >= 0x10504)
+   link_info->fec_cfg = le16_to_cpu(resp->fec_cfg);
+
/* TODO: need to add more logic to report VF link */
if (chng_link_state) {
if (link_info->phy_link_status == BNXT_LINK_LINK)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index eaed700..faf26a2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -867,6 +867,10 @@ struct bnxt_link_info {
u16 force_link_speed;
u32 preemphasis;
u8  module_status;
+   u16 fec_cfg;
+#define BNXT_FEC_AUTONEG   PORT_PHY_QCFG_RESP_FEC_CFG_FEC_AUTONEG_ENABLED
+#define BNXT_FEC_ENC_BASE_RPORT_PHY_QCFG_RESP_FEC_CFG_FEC_CLAUSE74_ENABLED
+#define BNXT_FEC_ENC_RS
PORT_PHY_QCFG_RESP_FEC_CFG_FEC_CLAUSE91_ENABLED
 
/* copy of requested setting from ethtool cmd */
u8  autoneg;
-- 
1.8.3.1



[PATCH net-next 1/9] bnxt_en: Update to firmware interface spec 1.7.0.

2017-02-12 Thread Michael Chan
The new spec has NVRAM defragmentation support which will be used in
the next patch to improve ethtool flash operation.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 437 +-
 3 files changed, 363 insertions(+), 88 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index cda1c78..8ac5987 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1,6 +1,7 @@
 /* Broadcom NetXtreme-C/E network driver.
  *
  * Copyright (c) 2014-2016 Broadcom Corporation
+ * Copyright (c) 2016-2017 Broadcom Limited
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -3974,7 +3975,7 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
req.length = cpu_to_le32(bp->rx_agg_ring_mask + 1);
break;
case HWRM_RING_ALLOC_CMPL:
-   req.ring_type = RING_ALLOC_REQ_RING_TYPE_CMPL;
+   req.ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
req.length = cpu_to_le32(bp->cp_ring_mask + 1);
if (bp->flags & BNXT_FLAG_USING_MSIX)
req.int_mode = RING_ALLOC_REQ_INT_MODE_MSIX;
@@ -3993,7 +3994,7 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
 
if (rc || err) {
switch (ring_type) {
-   case RING_FREE_REQ_RING_TYPE_CMPL:
+   case RING_FREE_REQ_RING_TYPE_L2_CMPL:
netdev_err(bp->dev, "hwrm_ring_alloc cp failed. rc:%x 
err:%x\n",
   rc, err);
return -1;
@@ -4137,7 +4138,7 @@ static int hwrm_ring_free_send_msg(struct bnxt *bp,
 
if (rc || error_code) {
switch (ring_type) {
-   case RING_FREE_REQ_RING_TYPE_CMPL:
+   case RING_FREE_REQ_RING_TYPE_L2_CMPL:
netdev_err(bp->dev, "hwrm_ring_free cp failed. rc:%d\n",
   rc);
return rc;
@@ -4226,7 +4227,7 @@ static void bnxt_hwrm_ring_free(struct bnxt *bp, bool 
close_path)
 
if (ring->fw_ring_id != INVALID_HW_RING_ID) {
hwrm_ring_free_send_msg(bp, ring,
-   RING_FREE_REQ_RING_TYPE_CMPL,
+   RING_FREE_REQ_RING_TYPE_L2_CMPL,
INVALID_HW_RING_ID);
ring->fw_ring_id = INVALID_HW_RING_ID;
bp->grp_info[i].cp_fw_ring_id = INVALID_HW_RING_ID;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 9f07b9c..eaed700 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1,6 +1,7 @@
 /* Broadcom NetXtreme-C/E network driver.
  *
  * Copyright (c) 2014-2016 Broadcom Corporation
+ * Copyright (c) 2016-2017 Broadcom Limited
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -11,10 +12,10 @@
 #define BNXT_H
 
 #define DRV_MODULE_NAME"bnxt_en"
-#define DRV_MODULE_VERSION "1.6.0"
+#define DRV_MODULE_VERSION "1.7.0"
 
 #define DRV_VER_MAJ1
-#define DRV_VER_MIN6
+#define DRV_VER_MIN7
 #define DRV_VER_UPD0
 
 struct tx_bd {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
index 5df32ab..6e275c2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
@@ -11,12 +11,12 @@
 #ifndef BNXT_HSI_H
 #define BNXT_HSI_H
 
-/* HSI and HWRM Specification 1.6.1 */
+/* HSI and HWRM Specification 1.7.0 */
 #define HWRM_VERSION_MAJOR 1
-#define HWRM_VERSION_MINOR 6
-#define HWRM_VERSION_UPDATE1
+#define HWRM_VERSION_MINOR 7
+#define HWRM_VERSION_UPDATE0
 
-#define HWRM_VERSION_STR   "1.6.1"
+#define HWRM_VERSION_STR   "1.7.0"
 /*
  * Following is the signature for HWRM message field that indicates not
  * applicable (All F's). Need to cast it the size of the field if needed.
@@ -834,20 +834,32 @@ struct hwrm_func_qcfg_output {
__le32 min_bw;
#define FUNC_QCFG_RESP_MIN_BW_BW_VALUE_MASK 0xfffUL
#define FUNC_QCFG_RESP_MIN_BW_BW_VALUE_SFT  0
-   #define FUNC_QCFG_RESP_MIN_BW_RSVD  0x1000UL
+   #define FUNC_QCFG_RESP_MIN_BW_SCALE 0x1000UL
+   #define FUNC_QCFG_RESP_MIN_BW_SCALE_BITS   (0x0UL << 28)
+   #define FUNC_QCFG_RESP_MIN_BW_SCALE_BYTES  (0

[PATCH net-next 2/9] bnxt_en: Retry failed NVM_INSTALL_UPDATE with defragmentation flag.

2017-02-12 Thread Michael Chan
From: Kshitij Soni 

If the HWRM_NVM_INSTALL_UPDATE command fails with the error code
NVM_INSTALL_UPDATE_CMD_ERR_CODE_FRAG_ERR, retry the command with
a new flag to allow defragmentation.  Since we are checking the
response for error code, we also need to take the mutex until
we finish reading the response.

Signed-off-by: Kshitij Soni 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 32 ++-
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 7aa248d..4b45b88 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1578,17 +1578,37 @@ static int bnxt_flash_package_from_file(struct 
net_device *dev,
bnxt_hwrm_cmd_hdr_init(bp, &install, HWRM_NVM_INSTALL_UPDATE, -1, -1);
install.install_type = cpu_to_le32(install_type);
 
-   rc = hwrm_send_message(bp, &install, sizeof(install),
-  INSTALL_PACKAGE_TIMEOUT);
-   if (rc)
-   return -EOPNOTSUPP;
+   mutex_lock(&bp->hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, &install, sizeof(install),
+   INSTALL_PACKAGE_TIMEOUT);
+   if (rc) {
+   rc = -EOPNOTSUPP;
+   goto flash_pkg_exit;
+   }
+
+   if (resp->error_code) {
+   u8 error_code = ((struct hwrm_err_output *)resp)->cmd_err;
+
+   if (error_code == NVM_INSTALL_UPDATE_CMD_ERR_CODE_FRAG_ERR) {
+   install.flags |= cpu_to_le16(
+  NVM_INSTALL_UPDATE_REQ_FLAGS_ALLOWED_TO_DEFRAG);
+   rc = _hwrm_send_message(bp, &install, sizeof(install),
+   INSTALL_PACKAGE_TIMEOUT);
+   if (rc) {
+   rc = -EOPNOTSUPP;
+   goto flash_pkg_exit;
+   }
+   }
+   }
 
if (resp->result) {
netdev_err(dev, "PKG install error = %d, problem_item = %d\n",
   (s8)resp->result, (int)resp->problem_item);
-   return -ENOPKG;
+   rc = -ENOPKG;
}
-   return 0;
+flash_pkg_exit:
+   mutex_unlock(&bp->hwrm_cmd_lock);
+   return rc;
 }
 
 static int bnxt_flash_device(struct net_device *dev,
-- 
1.8.3.1



[PATCH net-next 9/9] bnxt_en: Added PCI IDs for BCM57452 and BCM57454 ASICs

2017-02-12 Thread Michael Chan
From: Deepak Khungar 

Signed-off-by: Deepak Khungar 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c899d61..71f9a18 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -99,6 +99,8 @@ enum board_idx {
BCM57407_NPAR,
BCM57414_NPAR,
BCM57416_NPAR,
+   BCM57452,
+   BCM57454,
NETXTREME_E_VF,
NETXTREME_C_VF,
 };
@@ -133,6 +135,8 @@ enum board_idx {
{ "Broadcom BCM57407 NetXtreme-E Ethernet Partition" },
{ "Broadcom BCM57414 NetXtreme-E Ethernet Partition" },
{ "Broadcom BCM57416 NetXtreme-E Ethernet Partition" },
+   { "Broadcom BCM57452 NetXtreme-E 10Gb/25Gb/40Gb/50Gb Ethernet" },
+   { "Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
{ "Broadcom NetXtreme-E Ethernet Virtual Function" },
{ "Broadcom NetXtreme-C Ethernet Virtual Function" },
 };
@@ -168,6 +172,8 @@ enum board_idx {
{ PCI_VDEVICE(BROADCOM, 0x16ed), .driver_data = BCM57414_NPAR },
{ PCI_VDEVICE(BROADCOM, 0x16ee), .driver_data = BCM57416_NPAR },
{ PCI_VDEVICE(BROADCOM, 0x16ef), .driver_data = BCM57416_NPAR },
+   { PCI_VDEVICE(BROADCOM, 0x16f1), .driver_data = BCM57452 },
+   { PCI_VDEVICE(BROADCOM, 0x1614), .driver_data = BCM57454 },
 #ifdef CONFIG_BNXT_SRIOV
{ PCI_VDEVICE(BROADCOM, 0x16c1), .driver_data = NETXTREME_E_VF },
{ PCI_VDEVICE(BROADCOM, 0x16cb), .driver_data = NETXTREME_C_VF },
-- 
1.8.3.1



[PATCH net-next 4/9] bnxt_en: Allow NETIF_F_NTUPLE to be enabled on VFs.

2017-02-12 Thread Michael Chan
Commit ae10ae740ad2 ("bnxt_en: Add new hardware RFS mode.") has added
code to allow NTUPLE to be enabled on VFs.  So we now remove the
BNXT_VF() check in rfs_capable() to allow NTUPLE on VFs.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8ac5987..516c5d7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6291,7 +6291,7 @@ static bool bnxt_rfs_capable(struct bnxt *bp)
 #ifdef CONFIG_RFS_ACCEL
int vnics, max_vnics, max_rss_ctxs;
 
-   if (BNXT_VF(bp) || !(bp->flags & BNXT_FLAG_MSIX_CAP))
+   if (!(bp->flags & BNXT_FLAG_MSIX_CAP))
return false;
 
vnics = 1 + bp->rx_nr_rings;
-- 
1.8.3.1



[PATCH net-next 8/9] bnxt_en: Fix bnxt_setup_tc() error message.

2017-02-12 Thread Michael Chan
Add proper puctuation to make the message more clear.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9f1dfbe..c899d61 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6833,7 +6833,7 @@ int bnxt_setup_mq_tc(struct net_device *dev, u8 tc)
int rc;
 
if (tc > bp->max_tc) {
-   netdev_err(dev, "too many traffic classes requested: %d Max 
supported is %d\n",
+   netdev_err(dev, "Too many traffic classes requested: %d. Max 
supported is %d.\n",
   tc, bp->max_tc);
return -EINVAL;
}
-- 
1.8.3.1



[PATCH net-next 0/9] bnxt_en: Misc updates.

2017-02-12 Thread Michael Chan
Miscellaneous updates include update of the firmware spec, ethtool flash
enhancement, ethtool -l minor fix, NTUPLE support enhancements, FEC
link settings message during link up, and new PCI IDs.  Please review.
Thanks.

Deepak Khungar (1):
  bnxt_en: Added PCI IDs for BCM57452 and BCM57454 ASICs

Kshitij Soni (1):
  bnxt_en: Retry failed NVM_INSTALL_UPDATE with defragmentation flag.

Michael Chan (7):
  bnxt_en: Update to firmware interface spec 1.7.0.
  bnxt_en: Fix ethtool -l pre-set max combined channel.
  bnxt_en: Allow NETIF_F_NTUPLE to be enabled on VFs.
  bnxt_en: Add hardware NTUPLE filter for encapsulated packets.
  bnxt_en: Do not setup PHY unless driving a single PF.
  bnxt_en: Print FEC settings as part of the linkup dmesg.
  bnxt_en: Fix bnxt_setup_tc() error message.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  52 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  34 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 437 ++
 4 files changed, 431 insertions(+), 101 deletions(-)

-- 
1.8.3.1



linux-next: manual merge of the ipsec-next tree with the net-next tree

2017-02-12 Thread Stephen Rothwell
Hi Steffen,

Today's linux-next merge of the ipsec-next tree got a conflict in:

  net/xfrm/xfrm_policy.c

between commit:

  63fca65d0863 ("net: add confirm_neigh method to dst_ops")

from the net-next tree and commits:

  3d7d25a68ea5 ("xfrm: policy: remove garbage_collect callback")
  a2817d8b279b ("xfrm: policy: remove family field")

from the ipsec-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/xfrm/xfrm_policy.c
index f68d75766d51,04ed1a1ae019..
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@@ -2856,32 -2843,15 +2843,32 @@@ static struct neighbour *xfrm_neigh_loo
return dst->path->ops->neigh_lookup(dst, skb, daddr);
  }
  
 +static void xfrm_confirm_neigh(const struct dst_entry *dst, const void *daddr)
 +{
 +  const struct dst_entry *path = dst->path;
 +
 +  for (; dst != path; dst = dst->child) {
 +  const struct xfrm_state *xfrm = dst->xfrm;
 +
 +  if (xfrm->props.mode == XFRM_MODE_TRANSPORT)
 +  continue;
 +  if (xfrm->type->flags & XFRM_TYPE_REMOTE_COADDR)
 +  daddr = xfrm->coaddr;
 +  else if (!(xfrm->type->flags & XFRM_TYPE_LOCAL_COADDR))
 +  daddr = &xfrm->id.daddr;
 +  }
 +  path->ops->confirm_neigh(path, daddr);
 +}
 +
- int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo)
+ int xfrm_policy_register_afinfo(const struct xfrm_policy_afinfo *afinfo, int 
family)
  {
int err = 0;
-   if (unlikely(afinfo == NULL))
-   return -EINVAL;
-   if (unlikely(afinfo->family >= NPROTO))
+ 
+   if (WARN_ON(family >= ARRAY_SIZE(xfrm_policy_afinfo)))
return -EAFNOSUPPORT;
+ 
spin_lock(&xfrm_policy_afinfo_lock);
-   if (unlikely(xfrm_policy_afinfo[afinfo->family] != NULL))
+   if (unlikely(xfrm_policy_afinfo[family] != NULL))
err = -EEXIST;
else {
struct dst_ops *dst_ops = afinfo->dst_ops;
@@@ -2899,11 -2869,7 +2886,9 @@@
dst_ops->link_failure = xfrm_link_failure;
if (likely(dst_ops->neigh_lookup == NULL))
dst_ops->neigh_lookup = xfrm_neigh_lookup;
 +  if (likely(!dst_ops->confirm_neigh))
 +  dst_ops->confirm_neigh = xfrm_confirm_neigh;
-   if (likely(afinfo->garbage_collect == NULL))
-   afinfo->garbage_collect = xfrm_garbage_collect_deferred;
-   rcu_assign_pointer(xfrm_policy_afinfo[afinfo->family], afinfo);
+   rcu_assign_pointer(xfrm_policy_afinfo[family], afinfo);
}
spin_unlock(&xfrm_policy_afinfo_lock);
  


Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-12 Thread Andy Lutomirski
On Sun, Feb 12, 2017 at 12:01 AM, Daniel Mack  wrote:
> On 02/11/2017 05:28 AM, Alexei Starovoitov wrote:
>> If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
>> to the given cgroup the descendent cgroup will be able to override
>> effective bpf program that was inherited from this cgroup.
>> By default it's not passed, therefore override is disallowed.
>>
>> Examples:
>> 1.
>> prog X attached to /A with default
>> prog Y fails to attach to /A/B and /A/B/C
>> Everything under /A runs prog X
>>
>> 2.
>> prog X attached to /A with allow_override.
>> prog Y fails to attach to /A/B with default (non-override)
>> prog M attached to /A/B with allow_override.
>> Everything under /A/B runs prog M only.
>>
>> 3.
>> prog X attached to /A with allow_override.
>> prog Y fails to attach to /A with default.
>> The user has to detach first to switch the mode.
>>
>> In the future this behavior may be extended with a chain of
>> non-overridable programs.
>>
>> Also fix the bug where detach from cgroup where nothing is attached
>> was not throwing error. Return ENOENT in such case.
>>
>> Add several testcases and adjust libbpf.
>>
>> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
>> Signed-off-by: Alexei Starovoitov 
>
> Looks good to me.
>
> Acked-by: Daniel Mack 
>
> Let's get this into 4.10!


Agreed.

--Andy



>
>
> Thanks,
> Daniel
>
>
>
>> ---
>> v1->v2: disallowed overridable->non_override transition as suggested by Andy
>> added tests and fixed double detach bug
>>
>> Andy, Daniel,
>> please review and ack quickly, so it can land into 4.10.
>> ---
>>  include/linux/bpf-cgroup.h   | 13 
>>  include/uapi/linux/bpf.h |  7 +
>>  kernel/bpf/cgroup.c  | 59 +++---
>>  kernel/bpf/syscall.c | 20 
>>  kernel/cgroup.c  |  9 +++---
>>  samples/bpf/test_cgrp2_attach.c  |  2 +-
>>  samples/bpf/test_cgrp2_attach2.c | 68 
>> +---
>>  samples/bpf/test_cgrp2_sock.c|  2 +-
>>  samples/bpf/test_cgrp2_sock2.c   |  2 +-
>>  tools/lib/bpf/bpf.c  |  4 ++-
>>  tools/lib/bpf/bpf.h  |  3 +-
>>  11 files changed, 151 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 92bc89ae7e20..c970a25d2a49 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -21,20 +21,19 @@ struct cgroup_bpf {
>>*/
>>   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>>   struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
>> + bool disallow_override[MAX_BPF_ATTACH_TYPE];
>>  };
>>
>>  void cgroup_bpf_put(struct cgroup *cgrp);
>>  void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
>>
>> -void __cgroup_bpf_update(struct cgroup *cgrp,
>> -  struct cgroup *parent,
>> -  struct bpf_prog *prog,
>> -  enum bpf_attach_type type);
>> +int __cgroup_bpf_update(struct cgroup *cgrp, struct cgroup *parent,
>> + struct bpf_prog *prog, enum bpf_attach_type type,
>> + bool overridable);
>>
>>  /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
>> -void cgroup_bpf_update(struct cgroup *cgrp,
>> -struct bpf_prog *prog,
>> -enum bpf_attach_type type);
>> +int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
>> +   enum bpf_attach_type type, bool overridable);
>>
>>  int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>   struct sk_buff *skb,
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index e5b8cf16cbaf..69f65b710b10 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -116,6 +116,12 @@ enum bpf_attach_type {
>>
>>  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
>>
>> +/* If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
>> + * to the given target_fd cgroup the descendent cgroup will be able to
>> + * override effective bpf program that was inherited from this cgroup
>> + */
>> +#define BPF_F_ALLOW_OVERRIDE (1U << 0)
>> +
>>  #define BPF_PSEUDO_MAP_FD1
>>
>>  /* flags for BPF_MAP_UPDATE_ELEM command */
>> @@ -171,6 +177,7 @@ union bpf_attr {
>>   __u32   target_fd;  /* container object to attach 
>> to */
>>   __u32   attach_bpf_fd;  /* eBPF program to attach */
>>   __u32   attach_type;
>> + __u32   attach_flags;
>>   };
>>  } __attribute__((aligned(8)));
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a515f7b007c6..da0f53690295 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -52,6 +52,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
>> *parent)
>>   e = rcu_dereference_protected(parent->bpf.effective[type],
>>

Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Jesper Dangaard Brouer
On Sun, 12 Feb 2017 12:57:46 -0800
Eric Dumazet  wrote:

> On Sun, 2017-02-12 at 18:31 +0200, Tariq Toukan wrote:
> > On 09/02/2017 6:56 PM, Eric Dumazet wrote:  
> > >> Default, out of box.  
> > > Well. Please report :
> > >
> > > ethtool  -l eth0
> > > ethtool -g eth0  
> > $ ethtool -g p1p1
> > Ring parameters for p1p1:
> > Pre-set maximums:
> > RX: 8192
> > RX Mini:0
> > RX Jumbo:   0
> > TX: 8192
> > Current hardware settings:
> > RX: 1024
> > RX Mini:0
> > RX Jumbo:   0
> > TX: 512  
> 
> We are using 4096 slots per RX queue, this is why I could not reproduce
> your results.

Just so others understand this: The number of RX queue slots is
indirectly the size of the page-recycle "cache" in this scheme (that
depend on refcnt tricks to see if page can be reused).  


> A single TCP flow easily can have more than 1024 MSS waiting in its
> receive queue (typical receive window on linux is 6MB/2 )

So, you do need to increase the page-"cache" size, and need this for
real-life cases, interesting.


> I mentioned that having a slightly inflated skb->truesize might have an
> impact in some workloads. (charging for 2048 bytes per MSS instead of
> 1536), but this is not related to mlx4 and should be tweaked in TCP
> stack instead, since this 2048 bytes (half a page on x86) strategy is
> now well spread.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH 0/2] IPv4-mapped on wire, :: dst address issue

2017-02-12 Thread Jonathan T. Leighton
Under some circumstances IPv6 datagrams are sent with IPv4-mapped IPv6
addresses as the source. Given an IPv6 socket bound to an IPv4-mapped
IPv6 address, and an IPv6 destination address, both TCP and UDP will
will send packets using the IPv4-mapped IPv6 address as the source. Per
RFC 6890 (Table 20), IPv4-mapped IPv6 source addresses are not allowed
in an IP datagram. The problem can be observed by attempting to
connect() either a TCP or UDP socket, or by using sendmsg() with a UDP
socket. The patch is intended to correct this issue for all socket
types.

linux follows the BSD convention that an IPv6 destination address
specified as in6addr_any is converted to the loopback address.
Currently, neither TCP nor UDP consider the possibility that the source
address is an IPv4-mapped IPv6 address, and assume that the appropriate
loopback address is ::1. The patch adds a check on whether or not the
source address is an IPv4-mapped IPv6 address and then sets the
destination address to either :::127.0.0.1 or ::1, as appropriate.

Jon

Jonathan T. Leighton (2):
  ipv6: Inhibit IPv4-mapped src address on the wire.
  ipv6: Handle IPv4-mapped src to in6addr_any dst.

 net/ipv6/datagram.c   | 14 +-
 net/ipv6/ip6_output.c |  3 +++
 net/ipv6/tcp_ipv6.c   | 11 ---
 net/ipv6/udp.c|  4 
 4 files changed, 24 insertions(+), 8 deletions(-)

-- 
2.7.4



[PATCH 1/2] ipv6: Inhibit IPv4-mapped src address on the wire.

2017-02-12 Thread Jonathan T. Leighton
This patch adds a check for the problematic case of an IPv4-mapped IPv6
source address and a destination address that is neither an IPv4-mapped
IPv6 address nor in6addr_any, and returns an appropriate error. The
check in done before returning from looking up the route.

Signed-off-by: Jonathan T. Leighton 
---
 net/ipv6/ip6_output.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a75871c..d0f51b4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1022,6 +1022,9 @@ static int ip6_dst_lookup_tail(struct net *net, const 
struct sock *sk,
}
}
 #endif
+   if (ipv6_addr_v4mapped(&fl6->saddr) &&
+   !(ipv6_addr_v4mapped(&fl6->daddr) || ipv6_addr_any(&fl6->daddr)))
+   return -EAFNOSUPPORT;
 
return 0;
 
-- 
2.7.4



[PATCH 2/2] ipv6: Handle IPv4-mapped src to in6addr_any dst.

2017-02-12 Thread Jonathan T. Leighton
This patch adds a check on the type of the source address for the case
where the destination address is in6addr_any. If the source is an
IPv4-mapped IPv6 source address, the destination is changed to
:::127.0.0.1, and otherwise the destination is changed to ::1. This
is done in three locations to handle UDP calls to either connect() or
sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays
handling an in6addr_any destination until very late, so the patch only
needs to handle the case where the source is an IPv4-mapped IPv6
address.

Signed-off-by: Jonathan T. Leighton 
---
 net/ipv6/datagram.c | 14 +-
 net/ipv6/tcp_ipv6.c | 11 ---
 net/ipv6/udp.c  |  4 
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a3eaafd..eec27f8 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -167,18 +167,22 @@ int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr,
if (np->sndflow)
fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
 
-   addr_type = ipv6_addr_type(&usin->sin6_addr);
-
-   if (addr_type == IPV6_ADDR_ANY) {
+   if (ipv6_addr_any(&usin->sin6_addr)) {
/*
 *  connect to self
 */
-   usin->sin6_addr.s6_addr[15] = 0x01;
+   if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
+   ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK),
+  &usin->sin6_addr);
+   else
+   usin->sin6_addr = in6addr_loopback;
}
 
+   addr_type = ipv6_addr_type(&usin->sin6_addr);
+
daddr = &usin->sin6_addr;
 
-   if (addr_type == IPV6_ADDR_MAPPED) {
+   if (addr_type & IPV6_ADDR_MAPPED) {
struct sockaddr_in sin;
 
if (__ipv6_only_sock(sk)) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b5d2721..21c7199 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -149,8 +149,13 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr 
*uaddr,
 *  connect() to INADDR_ANY means loopback (BSD'ism).
 */
 
-   if (ipv6_addr_any(&usin->sin6_addr))
-   usin->sin6_addr.s6_addr[15] = 0x1;
+   if (ipv6_addr_any(&usin->sin6_addr)) {
+   if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
+   ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK),
+  &usin->sin6_addr);
+   else
+   usin->sin6_addr = in6addr_loopback;
+   }
 
addr_type = ipv6_addr_type(&usin->sin6_addr);
 
@@ -189,7 +194,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr 
*uaddr,
 *  TCP over IPv4
 */
 
-   if (addr_type == IPV6_ADDR_MAPPED) {
+   if (addr_type & IPV6_ADDR_MAPPED) {
u32 exthdrlen = icsk->icsk_ext_hdr_len;
struct sockaddr_in sin;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index df71ba0..4e4c401 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1046,6 +1046,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (addr_len < SIN6_LEN_RFC2133)
return -EINVAL;
daddr = &sin6->sin6_addr;
+   if (ipv6_addr_any(daddr) &&
+   ipv6_addr_v4mapped(&np->saddr))
+   ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK),
+  daddr);
break;
case AF_INET:
goto do_udp_sendmsg;
-- 
2.7.4



[PATCH net] net/llc: avoid BUG_ON() in skb_orphan()

2017-02-12 Thread Eric Dumazet
From: Eric Dumazet 

It seems nobody used LLC since linux-3.12.

Fortunately fuzzers like syzkaller still know how to run this code,
otherwise it would be no fun.

Setting skb->sk without skb->destructor leads to all kinds of
bugs, we now prefer to be very strict about it.

Ideally here we would use skb_set_owner() but this helper does not exist yet,
only CAN seems to have a private helper for that.

Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()")
Signed-off-by: Eric Dumazet 
Reported-by: Andrey Konovalov 
---
 net/llc/llc_conn.c |3 +++
 net/llc/llc_sap.c  |3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 
3e821daf9dd4a2fbf00550591e92b153efd4a73a..8bc5a1bd2d453542df31506f543feb64b64cdd96
 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -821,7 +821,10 @@ void llc_conn_handler(struct llc_sap *sap, struct sk_buff 
*skb)
 * another trick required to cope with how the PROCOM state
 * machine works. -acme
 */
+   skb_orphan(skb);
+   sock_hold(sk);
skb->sk = sk;
+   skb->destructor = sock_efree;
}
if (!sock_owned_by_user(sk))
llc_conn_rcv(sk, skb);
diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c
index 
d0e1e804ebd73dcebcf2f930b921233a49b0f454..5404d0d195cc581613e356b75bd70321e617673e
 100644
--- a/net/llc/llc_sap.c
+++ b/net/llc/llc_sap.c
@@ -290,7 +290,10 @@ static void llc_sap_rcv(struct llc_sap *sap, struct 
sk_buff *skb,
 
ev->type   = LLC_SAP_EV_TYPE_PDU;
ev->reason = 0;
+   skb_orphan(skb);
+   sock_hold(sk);
skb->sk = sk;
+   skb->destructor = sock_efree;
llc_sap_state_process(sap, skb);
 }
 




Re: [PATCH net-next v2] net: phy: Allow splitting MDIO bus/device support from PHYs

2017-02-12 Thread kbuild test robot
Hi Florian,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-phy-Allow-splitting-MDIO-bus-device-support-from-PHYs/20170210-115834
config: i386-randconfig-h1-02130126 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `unimac_mdio_remove':
>> mdio-bcm-unimac.c:(.text+0x2f6d9b): undefined reference to 
>> `mdiobus_unregister'
>> mdio-bcm-unimac.c:(.text+0x2f6db0): undefined reference to `mdiobus_free'
   drivers/built-in.o: In function `unimac_mdio_reset':
>> mdio-bcm-unimac.c:(.text+0x2f6e6a): undefined reference to 
>> `of_mdio_parse_addr'
>> mdio-bcm-unimac.c:(.text+0x2f6ef7): undefined reference to `mdiobus_read'
   drivers/built-in.o: In function `unimac_mdio_probe':
>> mdio-bcm-unimac.c:(.text+0x2f7297): undefined reference to 
>> `mdiobus_alloc_size'
>> mdio-bcm-unimac.c:(.text+0x2f7320): undefined reference to 
>> `of_mdiobus_register'
   mdio-bcm-unimac.c:(.text+0x2f73aa): undefined reference to `mdiobus_free'
   drivers/built-in.o: In function `alloc_mdio_bitbang':
>> (.text+0x2f7b28): undefined reference to `mdiobus_alloc_size'
   drivers/built-in.o: In function `free_mdio_bitbang':
>> (.text+0x2f7bc1): undefined reference to `mdiobus_free'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v1] samples/bpf: Add a .gitignore for binaries

2017-02-12 Thread Mickaël Salaün
Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
 samples/bpf/.gitignore | 32 
 1 file changed, 32 insertions(+)
 create mode 100644 samples/bpf/.gitignore

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index ..a7562a5ef4c2
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1,32 @@
+fds_example
+lathist
+lwt_len_hist
+map_perf_test
+offwaketime
+sampleip
+sockex1
+sockex2
+sockex3
+sock_example
+spintest
+tc_l2_redirect
+test_cgrp2_array_pin
+test_cgrp2_attach
+test_cgrp2_attach2
+test_cgrp2_sock
+test_cgrp2_sock2
+test_current_task_under_cgroup
+test_lru_dist
+test_overhead
+test_probe_write_user
+trace_event
+trace_output
+tracex1
+tracex2
+tracex3
+tracex4
+tracex5
+tracex6
+xdp1
+xdp2
+xdp_tx_iptunnel
-- 
2.11.0



Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Eric Dumazet
On Sun, 2017-02-12 at 18:31 +0200, Tariq Toukan wrote:
> On 09/02/2017 6:56 PM, Eric Dumazet wrote:
> >> Default, out of box.
> > Well. Please report :
> >
> > ethtool  -l eth0
> > ethtool -g eth0
> $ ethtool -g p1p1
> Ring parameters for p1p1:
> Pre-set maximums:
> RX: 8192
> RX Mini:0
> RX Jumbo:   0
> TX: 8192
> Current hardware settings:
> RX: 1024
> RX Mini:0
> RX Jumbo:   0
> TX: 512

We are using 4096 slots per RX queue, this is why I could not reproduce
your results.

A single TCP flow easily can have more than 1024 MSS waiting in its
receive queue (typical receive window on linux is 6MB/2 )

I mentioned that having a slightly inflated skb->truesize might have an
impact in some workloads. (charging for 2048 bytes per MSS instead of
1536), but this is not related to mlx4 and should be tweaked in TCP
stack instead, since this 2048 bytes (half a page on x86) strategy is
now well spread.







[PATCH] net: nuvoton: w90p910: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/nuvoton/w90p910_ether.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/nuvoton/w90p910_ether.c 
b/drivers/net/ethernet/nuvoton/w90p910_ether.c
index 119f6dc..9709c8c 100644
--- a/drivers/net/ethernet/nuvoton/w90p910_ether.c
+++ b/drivers/net/ethernet/nuvoton/w90p910_ether.c
@@ -874,16 +874,18 @@ static void w90p910_get_drvinfo(struct net_device *dev,
strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version));
 }
 
-static int w90p910_get_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int w90p910_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
 {
struct w90p910_ether *ether = netdev_priv(dev);
-   return mii_ethtool_gset(ðer->mii, cmd);
+   return mii_ethtool_get_link_ksettings(ðer->mii, cmd);
 }
 
-static int w90p910_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int w90p910_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct w90p910_ether *ether = netdev_priv(dev);
-   return mii_ethtool_sset(ðer->mii, cmd);
+   return mii_ethtool_set_link_ksettings(ðer->mii, cmd);
 }
 
 static int w90p910_nway_reset(struct net_device *dev)
@@ -899,11 +901,11 @@ static u32 w90p910_get_link(struct net_device *dev)
 }
 
 static const struct ethtool_ops w90p910_ether_ethtool_ops = {
-   .get_settings   = w90p910_get_settings,
-   .set_settings   = w90p910_set_settings,
.get_drvinfo= w90p910_get_drvinfo,
.nway_reset = w90p910_nway_reset,
.get_link   = w90p910_get_link,
+   .get_link_ksettings = w90p910_get_link_ksettings,
+   .set_link_ksettings = w90p910_set_link_ksettings,
 };
 
 static const struct net_device_ops w90p910_ether_netdev_ops = {
-- 
1.7.4.4



Re: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.

2017-02-12 Thread Richard Cochran
On Sun, Feb 12, 2017 at 03:07:34PM +, Mintz, Yuval wrote:
> Your algorithm ignores the HW limitation. Consider (ppb == 1): 
> your logic would output N == 7, *M == 70,
>Which has perfect accuracy [N / *M is 1 / 10^9].
> But the solution for
>'period' * 16 + 8 == 7 * 10^9
> isn't a whole number, so this result doesn't really reflect the actual
> approximation error since we couldn't configure it to HW.

Ok, so change my code to read:

/*truncate to HW resolution*/
reg = (m - 8) / 16;
m = reg * 16 + 8;

Your HW will happyly accept the value of 'reg', right?

> The original would return val == 1, period == 6249; While this
> does have some error [val / (period * 16 + 8) is slightly bigger
> than 1 / 10^9, error at 18[?] digit after dot], it's the best we can
> configure for the HW.

That is *not* the best you can do:

Perfect:  1 / 10 = .1
Yours:1 / 2  = .18
Mine*:7 / 62 = .1114

[ * revised with the above change ]

Not a huge difference, but yours is not "the best we can".

Let's try another:

ppb = 40831

Perfect:  40831 / 10 = .40831
Yours:4 / 97960  = .4083299305839118
Mine: 5 / 122456 = .4083099235643823

See the difference?

Please, try the two algorithms and plot the RMS error over the
interval ppb = 1 ... 10.  The result may surprise you.

> No. In an ideal world, I would have liked optimizing everything.
> But in this world if I do find time to spend on optimizations 
> I rather do that for the stuff that matters. I.e., datapath.

As the PTP maintainer, I look after about the PTP drivers.  They
should be as good as we can make them (even when the HW is a broken as
yours is).  That is why I bothered to review and to spend time
thinking about your problem.  I especially care about having good
examples in the tree, since this stuff will inevitably get copied by
new driver authors.  It is wonderful that your data path is so very
optimized, but that is no excuse for poor PTP code.

Thanks,
Richard


[PATCH 05/21] netfilter: nf_tables: add flush field to struct nft_set_iter

2017-02-12 Thread Pablo Neira Ayuso
This provides context to walk callback iterator, thus, we know if the
walk happens from the set flush path. This is required by the new bitmap
set type coming in a follow up patch which has no real struct
nft_set_ext, so it has to allocate it based on the two bit compact
element representation.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h | 1 +
 net/netfilter/nf_tables_api.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index ab155644d489..5830f594842e 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -203,6 +203,7 @@ struct nft_set_elem {
 struct nft_set;
 struct nft_set_iter {
u8  genmask;
+   boolflush;
unsigned intcount;
unsigned intskip;
int err;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index c09b11eb36fc..7ae810b03462 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3121,6 +3121,7 @@ int nf_tables_bind_set(const struct nft_ctx *ctx, struct 
nft_set *set,
iter.count  = 0;
iter.err= 0;
iter.fn = nf_tables_bind_check_setelem;
+   iter.flush  = false;
 
set->ops->walk(ctx, set, &iter);
if (iter.err < 0)
@@ -3374,6 +3375,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct 
netlink_callback *cb)
args.iter.count = 0;
args.iter.err   = 0;
args.iter.fn= nf_tables_dump_setelem;
+   args.iter.flush = false;
set->ops->walk(&ctx, set, &args.iter);
 
nla_nest_end(skb, nest);
@@ -3939,6 +3941,7 @@ static int nf_tables_delsetelem(struct net *net, struct 
sock *nlsk,
struct nft_set_iter iter = {
.genmask= genmask,
.fn = nft_flush_set,
+   .flush  = true,
};
set->ops->walk(&ctx, set, &iter);
 
@@ -5089,6 +5092,7 @@ static int nf_tables_check_loops(const struct nft_ctx 
*ctx,
iter.count  = 0;
iter.err= 0;
iter.fn = nf_tables_loop_check_setelem;
+   iter.flush  = false;
 
set->ops->walk(ctx, set, &iter);
if (iter.err < 0)
-- 
2.1.4



[PATCH 01/21] netfilter: nft_exthdr: Add support for existence check

2017-02-12 Thread Pablo Neira Ayuso
From: Phil Sutter 

If NFT_EXTHDR_F_PRESENT is set, exthdr will not copy any header field
data into *dest, but instead set it to 1 if the header is found and 0
otherwise.

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  6 ++
 net/netfilter/nft_exthdr.c   | 22 --
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 7b730cab99bd..53aac8b8ed6b 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -704,6 +704,10 @@ enum nft_payload_attributes {
 };
 #define NFTA_PAYLOAD_MAX   (__NFTA_PAYLOAD_MAX - 1)
 
+enum nft_exthdr_flags {
+   NFT_EXTHDR_F_PRESENT = (1 << 0),
+};
+
 /**
  * enum nft_exthdr_attributes - nf_tables IPv6 extension header expression 
netlink attributes
  *
@@ -711,6 +715,7 @@ enum nft_payload_attributes {
  * @NFTA_EXTHDR_TYPE: extension header type (NLA_U8)
  * @NFTA_EXTHDR_OFFSET: extension header offset (NLA_U32)
  * @NFTA_EXTHDR_LEN: extension header length (NLA_U32)
+ * @NFTA_EXTHDR_FLAGS: extension header flags (NLA_U32)
  */
 enum nft_exthdr_attributes {
NFTA_EXTHDR_UNSPEC,
@@ -718,6 +723,7 @@ enum nft_exthdr_attributes {
NFTA_EXTHDR_TYPE,
NFTA_EXTHDR_OFFSET,
NFTA_EXTHDR_LEN,
+   NFTA_EXTHDR_FLAGS,
__NFTA_EXTHDR_MAX
 };
 #define NFTA_EXTHDR_MAX(__NFTA_EXTHDR_MAX - 1)
diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
index 47beb3abcc9d..a89e5ab150db 100644
--- a/net/netfilter/nft_exthdr.c
+++ b/net/netfilter/nft_exthdr.c
@@ -23,6 +23,7 @@ struct nft_exthdr {
u8  offset;
u8  len;
enum nft_registers  dreg:8;
+   u8  flags;
 };
 
 static void nft_exthdr_eval(const struct nft_expr *expr,
@@ -35,8 +36,12 @@ static void nft_exthdr_eval(const struct nft_expr *expr,
int err;
 
err = ipv6_find_hdr(pkt->skb, &offset, priv->type, NULL, NULL);
-   if (err < 0)
+   if (priv->flags & NFT_EXTHDR_F_PRESENT) {
+   *dest = (err >= 0);
+   return;
+   } else if (err < 0) {
goto err;
+   }
offset += priv->offset;
 
dest[priv->len / NFT_REG32_SIZE] = 0;
@@ -52,6 +57,7 @@ static const struct nla_policy 
nft_exthdr_policy[NFTA_EXTHDR_MAX + 1] = {
[NFTA_EXTHDR_TYPE]  = { .type = NLA_U8 },
[NFTA_EXTHDR_OFFSET]= { .type = NLA_U32 },
[NFTA_EXTHDR_LEN]   = { .type = NLA_U32 },
+   [NFTA_EXTHDR_FLAGS] = { .type = NLA_U32 },
 };
 
 static int nft_exthdr_init(const struct nft_ctx *ctx,
@@ -59,7 +65,7 @@ static int nft_exthdr_init(const struct nft_ctx *ctx,
   const struct nlattr * const tb[])
 {
struct nft_exthdr *priv = nft_expr_priv(expr);
-   u32 offset, len;
+   u32 offset, len, flags = 0;
int err;
 
if (tb[NFTA_EXTHDR_DREG] == NULL ||
@@ -76,10 +82,20 @@ static int nft_exthdr_init(const struct nft_ctx *ctx,
if (err < 0)
return err;
 
+   if (tb[NFTA_EXTHDR_FLAGS]) {
+   err = nft_parse_u32_check(tb[NFTA_EXTHDR_FLAGS], U8_MAX, 
&flags);
+   if (err < 0)
+   return err;
+
+   if (flags & ~NFT_EXTHDR_F_PRESENT)
+   return -EINVAL;
+   }
+
priv->type   = nla_get_u8(tb[NFTA_EXTHDR_TYPE]);
priv->offset = offset;
priv->len= len;
priv->dreg   = nft_parse_register(tb[NFTA_EXTHDR_DREG]);
+   priv->flags  = flags;
 
return nft_validate_register_store(ctx, priv->dreg, NULL,
   NFT_DATA_VALUE, priv->len);
@@ -97,6 +113,8 @@ static int nft_exthdr_dump(struct sk_buff *skb, const struct 
nft_expr *expr)
goto nla_put_failure;
if (nla_put_be32(skb, NFTA_EXTHDR_LEN, htonl(priv->len)))
goto nla_put_failure;
+   if (nla_put_be32(skb, NFTA_EXTHDR_FLAGS, htonl(priv->flags)))
+   goto nla_put_failure;
return 0;
 
 nla_put_failure:
-- 
2.1.4



[PATCH 04/21] netfilter: nf_tables: rename deactivate_one() to flush()

2017-02-12 Thread Pablo Neira Ayuso
Although semantics are similar to deactivate() with no implicit element
lookup, this is only called from the set flush path, so better rename
this to flush().

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h | 8 
 net/netfilter/nf_tables_api.c | 2 +-
 net/netfilter/nft_set_hash.c  | 8 
 net/netfilter/nft_set_rbtree.c| 8 
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index a721bcb1210c..ab155644d489 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -260,7 +260,7 @@ struct nft_expr;
  * @insert: insert new element into set
  * @activate: activate new element in the next generation
  * @deactivate: lookup for element and deactivate it in the next generation
- * @deactivate_one: deactivate element in the next generation
+ * @flush: deactivate element in the next generation
  * @remove: remove element from set
  * @walk: iterate over all set elemeennts
  * @privsize: function to return size of set private data
@@ -295,9 +295,9 @@ struct nft_set_ops {
void *  (*deactivate)(const struct net *net,
  const struct nft_set *set,
  const struct nft_set_elem 
*elem);
-   bool(*deactivate_one)(const struct net *net,
- const struct nft_set 
*set,
- void *priv);
+   bool(*flush)(const struct net *net,
+const struct nft_set *set,
+void *priv);
void(*remove)(const struct net *net,
  const struct nft_set *set,
  const struct nft_set_elem 
*elem);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 790ffed82930..c09b11eb36fc 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3898,7 +3898,7 @@ static int nft_flush_set(const struct nft_ctx *ctx,
if (!trans)
return -ENOMEM;
 
-   if (!set->ops->deactivate_one(ctx->net, set, elem->priv)) {
+   if (!set->ops->flush(ctx->net, set, elem->priv)) {
err = -ENOENT;
goto err1;
}
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index bb157bd47fe8..2f10ac3b1b10 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -167,8 +167,8 @@ static void nft_hash_activate(const struct net *net, const 
struct nft_set *set,
nft_set_elem_clear_busy(&he->ext);
 }
 
-static bool nft_hash_deactivate_one(const struct net *net,
-   const struct nft_set *set, void *priv)
+static bool nft_hash_flush(const struct net *net,
+  const struct nft_set *set, void *priv)
 {
struct nft_hash_elem *he = priv;
 
@@ -195,7 +195,7 @@ static void *nft_hash_deactivate(const struct net *net,
rcu_read_lock();
he = rhashtable_lookup_fast(&priv->ht, &arg, nft_hash_params);
if (he != NULL &&
-   !nft_hash_deactivate_one(net, set, he))
+   !nft_hash_flush(net, set, he))
he = NULL;
 
rcu_read_unlock();
@@ -398,7 +398,7 @@ static struct nft_set_ops nft_hash_ops __read_mostly = {
.insert = nft_hash_insert,
.activate   = nft_hash_activate,
.deactivate = nft_hash_deactivate,
-   .deactivate_one = nft_hash_deactivate_one,
+   .flush  = nft_hash_flush,
.remove = nft_hash_remove,
.lookup = nft_hash_lookup,
.update = nft_hash_update,
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 9fbd70da1633..81b8a4c2c061 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -172,8 +172,8 @@ static void nft_rbtree_activate(const struct net *net,
nft_set_elem_change_active(net, set, &rbe->ext);
 }
 
-static bool nft_rbtree_deactivate_one(const struct net *net,
- const struct nft_set *set, void *priv)
+static bool nft_rbtree_flush(const struct net *net,
+const struct nft_set *set, void *priv)
 {
struct nft_rbtree_elem *rbe = priv;
 
@@ -214,7 +214,7 @@ static void *nft_rbtree_deactivate(const struct net *net,
parent = parent->rb_right;
continue;
}
-   nft_rbtree_deactivate_one(net, set, rbe);
+   nft_rbtree_flush(net, set, rbe

[PATCH 07/21] netfilter: nf_tables: add space notation to sets

2017-02-12 Thread Pablo Neira Ayuso
The space notation allows us to classify the set backend implementation
based on the amount of required memory. This provides an order of the
set representation scalability in terms of memory. The size field is
still left in place so use this if the userspace provides no explicit
number of elements, so we cannot calculate the real memory that this set
needs. This also helps us break ties in the set backend selection
routine, eg. two backend implementations provide the same performance.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  2 ++
 net/netfilter/nf_tables_api.c | 22 +-
 net/netfilter/nft_set_hash.c  |  1 +
 net/netfilter/nft_set_rbtree.c|  1 +
 4 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index d76ac2f80a40..21ce50e6d0c5 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -245,10 +245,12 @@ enum nft_set_class {
  *
  * @size: required memory
  * @lookup: lookup performance class
+ * @space: memory class
  */
 struct nft_set_estimate {
unsigned intsize;
enum nft_set_class  lookup;
+   enum nft_set_class  space;
 };
 
 struct nft_set_ext;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index fa7cd1679079..cb6ae46f6c48 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2404,6 +2404,7 @@ nft_select_set_ops(const struct nlattr * const nla[],
bops= NULL;
best.size   = ~0;
best.lookup = ~0;
+   best.space  = ~0;
 
list_for_each_entry(ops, &nf_tables_set_ops, list) {
if ((ops->features & features) != features)
@@ -2415,14 +2416,25 @@ nft_select_set_ops(const struct nlattr * const nla[],
case NFT_SET_POL_PERFORMANCE:
if (est.lookup < best.lookup)
break;
-   if (est.lookup == best.lookup && est.size < best.size)
-   break;
+   if (est.lookup == best.lookup) {
+   if (!desc->size) {
+   if (est.space < best.space)
+   break;
+   } else if (est.size < best.size) {
+   break;
+   }
+   }
continue;
case NFT_SET_POL_MEMORY:
-   if (est.size < best.size)
-   break;
-   if (est.size == best.size && est.lookup < best.lookup)
+   if (!desc->size) {
+   if (est.space < best.space)
+   break;
+   if (est.space == best.space &&
+   est.lookup < best.lookup)
+   break;
+   } else if (est.size < best.size) {
break;
+   }
continue;
default:
break;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index e58e7f02138b..6938bc890f31 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -385,6 +385,7 @@ static bool nft_hash_estimate(const struct nft_set_desc 
*desc, u32 features,
}
 
est->lookup = NFT_SET_CLASS_O_1;
+   est->space  = NFT_SET_CLASS_O_N;
 
return true;
 }
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 2b6ea10c4bbd..3387ed7dd231 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -292,6 +292,7 @@ static bool nft_rbtree_estimate(const struct nft_set_desc 
*desc, u32 features,
est->size = nsize;
 
est->lookup = NFT_SET_CLASS_O_LOG_N;
+   est->space  = NFT_SET_CLASS_O_N;
 
return true;
 }
-- 
2.1.4



[PATCH 13/21] netfilter: nf_ct_sip: Use mod_timer_pending()

2017-02-12 Thread Pablo Neira Ayuso
From: Gao Feng 

timer_del() followed by timer_add() can be replaced by
mod_timer_pending().

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_sip.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index c3fc14e021ec..24174c520239 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -809,13 +809,11 @@ static int refresh_signalling_expectation(struct nf_conn 
*ct,
exp->tuple.dst.protonum != proto ||
exp->tuple.dst.u.udp.port != port)
continue;
-   if (!del_timer(&exp->timeout))
-   continue;
-   exp->flags &= ~NF_CT_EXPECT_INACTIVE;
-   exp->timeout.expires = jiffies + expires * HZ;
-   add_timer(&exp->timeout);
-   found = 1;
-   break;
+   if (mod_timer_pending(&exp->timeout, jiffies + expires * HZ)) {
+   exp->flags &= ~NF_CT_EXPECT_INACTIVE;
+   found = 1;
+   break;
+   }
}
spin_unlock_bh(&nf_conntrack_expect_lock);
return found;
-- 
2.1.4



[PATCH 15/21] netfilter: nfnetlink: get rid of u_intX_t types

2017-02-12 Thread Pablo Neira Ayuso
Use uX types instead.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nfnetlink.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index a09fa9fd8f3d..586212ebba9e 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -100,9 +100,9 @@ int nfnetlink_subsys_unregister(const struct 
nfnetlink_subsystem *n)
 }
 EXPORT_SYMBOL_GPL(nfnetlink_subsys_unregister);
 
-static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u_int16_t 
type)
+static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u16 type)
 {
-   u_int8_t subsys_id = NFNL_SUBSYS_ID(type);
+   u8 subsys_id = NFNL_SUBSYS_ID(type);
 
if (subsys_id >= NFNL_SUBSYS_COUNT)
return NULL;
@@ -111,9 +111,9 @@ static inline const struct nfnetlink_subsystem 
*nfnetlink_get_subsys(u_int16_t t
 }
 
 static inline const struct nfnl_callback *
-nfnetlink_find_client(u_int16_t type, const struct nfnetlink_subsystem *ss)
+nfnetlink_find_client(u16 type, const struct nfnetlink_subsystem *ss)
 {
-   u_int8_t cb_id = NFNL_MSG_TYPE(type);
+   u8 cb_id = NFNL_MSG_TYPE(type);
 
if (cb_id >= ss->cb_count)
return NULL;
@@ -185,7 +185,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
 
{
int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
-   u_int8_t cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
+   u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
struct nlattr *cda[ss->cb[cb_id].attr_count + 1];
struct nlattr *attr = (void *)nlh + min_len;
int attrlen = nlh->nlmsg_len - min_len;
@@ -273,7 +273,7 @@ enum {
 };
 
 static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
-   u_int16_t subsys_id)
+   u16 subsys_id)
 {
struct sk_buff *oskb = skb;
struct net *net = sock_net(skb->sk);
@@ -365,7 +365,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
{
int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
-   u_int8_t cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
+   u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
struct nlattr *cda[ss->cb[cb_id].attr_count + 1];
struct nlattr *attr = (void *)nlh + min_len;
int attrlen = nlh->nlmsg_len - min_len;
@@ -439,7 +439,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 static void nfnetlink_rcv(struct sk_buff *skb)
 {
struct nlmsghdr *nlh = nlmsg_hdr(skb);
-   u_int16_t res_id;
+   u16 res_id;
int msglen;
 
if (nlh->nlmsg_len < NLMSG_HDRLEN ||
-- 
2.1.4



[PATCH 03/21] netfilter: nf_tables: use struct nft_set_iter in set element flush

2017-02-12 Thread Pablo Neira Ayuso
Instead of struct nft_set_dump_args, remove unnecessary wrapper
structure.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 3643ce345b59..790ffed82930 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3936,15 +3936,13 @@ static int nf_tables_delsetelem(struct net *net, struct 
sock *nlsk,
return -EBUSY;
 
if (nla[NFTA_SET_ELEM_LIST_ELEMENTS] == NULL) {
-   struct nft_set_dump_args args = {
-   .iter   = {
-   .genmask= genmask,
-   .fn = nft_flush_set,
-   },
+   struct nft_set_iter iter = {
+   .genmask= genmask,
+   .fn = nft_flush_set,
};
-   set->ops->walk(&ctx, set, &args.iter);
+   set->ops->walk(&ctx, set, &iter);
 
-   return args.iter.err;
+   return iter.err;
}
 
nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
-- 
2.1.4



[PATCH 09/21] netfilter: nft_ct: add zone id get support

2017-02-12 Thread Pablo Neira Ayuso
From: Florian Westphal 

Just like with counters the direction attribute is optional.
We set priv->dir to MAX unconditionally to avoid duplicating the assignment
for all keys with optional direction.

For keys where direction is mandatory, existing code already returns
an error.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  2 ++
 net/netfilter/nft_ct.c   | 22 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 53aac8b8ed6b..3e60ed78c538 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -870,6 +870,7 @@ enum nft_rt_attributes {
  * @NFT_CT_PKTS: conntrack packets
  * @NFT_CT_BYTES: conntrack bytes
  * @NFT_CT_AVGPKT: conntrack average bytes per packet
+ * @NFT_CT_ZONE: conntrack zone
  */
 enum nft_ct_keys {
NFT_CT_STATE,
@@ -889,6 +890,7 @@ enum nft_ct_keys {
NFT_CT_PKTS,
NFT_CT_BYTES,
NFT_CT_AVGPKT,
+   NFT_CT_ZONE,
 };
 
 /**
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 66a2377510e1..5bd4cdfdcda5 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -151,6 +151,18 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
case NFT_CT_PROTOCOL:
*dest = nf_ct_protonum(ct);
return;
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+   case NFT_CT_ZONE: {
+   const struct nf_conntrack_zone *zone = nf_ct_zone(ct);
+
+   if (priv->dir < IP_CT_DIR_MAX)
+   *dest = nf_ct_zone_id(zone, priv->dir);
+   else
+   *dest = zone->id;
+
+   return;
+   }
+#endif
default:
break;
}
@@ -266,6 +278,7 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
int err;
 
priv->key = ntohl(nla_get_be32(tb[NFTA_CT_KEY]));
+   priv->dir = IP_CT_DIR_MAX;
switch (priv->key) {
case NFT_CT_DIRECTION:
if (tb[NFTA_CT_DIRECTION] != NULL)
@@ -333,11 +346,13 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
case NFT_CT_BYTES:
case NFT_CT_PKTS:
case NFT_CT_AVGPKT:
-   /* no direction? return sum of original + reply */
-   if (tb[NFTA_CT_DIRECTION] == NULL)
-   priv->dir = IP_CT_DIR_MAX;
len = sizeof(u64);
break;
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+   case NFT_CT_ZONE:
+   len = sizeof(u16);
+   break;
+#endif
default:
return -EOPNOTSUPP;
}
@@ -465,6 +480,7 @@ static int nft_ct_get_dump(struct sk_buff *skb, const 
struct nft_expr *expr)
case NFT_CT_BYTES:
case NFT_CT_PKTS:
case NFT_CT_AVGPKT:
+   case NFT_CT_ZONE:
if (priv->dir < IP_CT_DIR_MAX &&
nla_put_u8(skb, NFTA_CT_DIRECTION, priv->dir))
goto nla_put_failure;
-- 
2.1.4



[PATCH 02/21] netfilter: nf_tables: pass netns to set->ops->remove()

2017-02-12 Thread Pablo Neira Ayuso
This new parameter is required by the new bitmap set type that comes in a
follow up patch.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h | 3 ++-
 net/netfilter/nf_tables_api.c | 6 +++---
 net/netfilter/nft_set_hash.c  | 3 ++-
 net/netfilter/nft_set_rbtree.c| 3 ++-
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 7dfdb517f0be..a721bcb1210c 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -298,7 +298,8 @@ struct nft_set_ops {
bool(*deactivate_one)(const struct net *net,
  const struct nft_set 
*set,
  void *priv);
-   void(*remove)(const struct nft_set *set,
+   void(*remove)(const struct net *net,
+ const struct nft_set *set,
  const struct nft_set_elem 
*elem);
void(*walk)(const struct nft_ctx *ctx,
struct nft_set *set,
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 57eeae63f597..3643ce345b59 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3752,7 +3752,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct 
nft_set *set,
return 0;
 
 err6:
-   set->ops->remove(set, &elem);
+   set->ops->remove(ctx->net, set, &elem);
 err5:
kfree(trans);
 err4:
@@ -4804,7 +4804,7 @@ static int nf_tables_commit(struct net *net, struct 
sk_buff *skb)
nf_tables_setelem_notify(&trans->ctx, te->set,
 &te->elem,
 NFT_MSG_DELSETELEM, 0);
-   te->set->ops->remove(te->set, &te->elem);
+   te->set->ops->remove(net, te->set, &te->elem);
atomic_dec(&te->set->nelems);
te->set->ndeact--;
break;
@@ -4925,7 +4925,7 @@ static int nf_tables_abort(struct net *net, struct 
sk_buff *skb)
case NFT_MSG_NEWSETELEM:
te = (struct nft_trans_elem *)trans->data;
 
-   te->set->ops->remove(te->set, &te->elem);
+   te->set->ops->remove(net, te->set, &te->elem);
atomic_dec(&te->set->nelems);
break;
case NFT_MSG_DELSETELEM:
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index e36069fb76ae..bb157bd47fe8 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -203,7 +203,8 @@ static void *nft_hash_deactivate(const struct net *net,
return he;
 }
 
-static void nft_hash_remove(const struct nft_set *set,
+static void nft_hash_remove(const struct net *net,
+   const struct nft_set *set,
const struct nft_set_elem *elem)
 {
struct nft_hash *priv = nft_set_priv(set);
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index f06f55ee516d..9fbd70da1633 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -151,7 +151,8 @@ static int nft_rbtree_insert(const struct net *net, const 
struct nft_set *set,
return err;
 }
 
-static void nft_rbtree_remove(const struct nft_set *set,
+static void nft_rbtree_remove(const struct net *net,
+ const struct nft_set *set,
  const struct nft_set_elem *elem)
 {
struct nft_rbtree *priv = nft_set_priv(set);
-- 
2.1.4



[PATCH 06/21] netfilter: nf_tables: rename struct nft_set_estimate class field

2017-02-12 Thread Pablo Neira Ayuso
Use lookup as field name instead, to prepare the introduction of the
memory class in a follow up patch.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  4 ++--
 net/netfilter/nf_tables_api.c | 12 ++--
 net/netfilter/nft_set_hash.c  |  2 +-
 net/netfilter/nft_set_rbtree.c|  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 5830f594842e..d76ac2f80a40 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -244,11 +244,11 @@ enum nft_set_class {
  *   characteristics
  *
  * @size: required memory
- * @class: lookup performance class
+ * @lookup: lookup performance class
  */
 struct nft_set_estimate {
unsigned intsize;
-   enum nft_set_class  class;
+   enum nft_set_class  lookup;
 };
 
 struct nft_set_ext;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 7ae810b03462..fa7cd1679079 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2401,9 +2401,9 @@ nft_select_set_ops(const struct nlattr * const nla[],
features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT;
}
 
-   bops   = NULL;
-   best.size  = ~0;
-   best.class = ~0;
+   bops= NULL;
+   best.size   = ~0;
+   best.lookup = ~0;
 
list_for_each_entry(ops, &nf_tables_set_ops, list) {
if ((ops->features & features) != features)
@@ -2413,15 +2413,15 @@ nft_select_set_ops(const struct nlattr * const nla[],
 
switch (policy) {
case NFT_SET_POL_PERFORMANCE:
-   if (est.class < best.class)
+   if (est.lookup < best.lookup)
break;
-   if (est.class == best.class && est.size < best.size)
+   if (est.lookup == best.lookup && est.size < best.size)
break;
continue;
case NFT_SET_POL_MEMORY:
if (est.size < best.size)
break;
-   if (est.size == best.size && est.class < best.class)
+   if (est.size == best.size && est.lookup < best.lookup)
break;
continue;
default:
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 2f10ac3b1b10..e58e7f02138b 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -384,7 +384,7 @@ static bool nft_hash_estimate(const struct nft_set_desc 
*desc, u32 features,
est->size = esize + 2 * sizeof(struct nft_hash_elem *);
}
 
-   est->class = NFT_SET_CLASS_O_1;
+   est->lookup = NFT_SET_CLASS_O_1;
 
return true;
 }
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 81b8a4c2c061..2b6ea10c4bbd 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -291,7 +291,7 @@ static bool nft_rbtree_estimate(const struct nft_set_desc 
*desc, u32 features,
else
est->size = nsize;
 
-   est->class = NFT_SET_CLASS_O_LOG_N;
+   est->lookup = NFT_SET_CLASS_O_LOG_N;
 
return true;
 }
-- 
2.1.4



[PATCH 10/21] netfilter: nft_ct: prepare for key-dependent error unwind

2017-02-12 Thread Pablo Neira Ayuso
From: Florian Westphal 

Next patch will add ZONE_ID set support which will need similar
error unwind (put operation) as conntrack labels.

Prepare for this: remove the 'label_got' boolean in favor
of a switch statement that can be extended in next patch.

As we already have that in the set_destroy function place that in
a separate function and call it from the set init function.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_ct.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 5bd4cdfdcda5..2d82df2737da 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -386,12 +386,24 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
return 0;
 }
 
+static void __nft_ct_set_destroy(const struct nft_ctx *ctx, struct nft_ct 
*priv)
+{
+   switch (priv->key) {
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+   case NFT_CT_LABELS:
+   nf_connlabels_put(ctx->net);
+   break;
+#endif
+   default:
+   break;
+   }
+}
+
 static int nft_ct_set_init(const struct nft_ctx *ctx,
   const struct nft_expr *expr,
   const struct nlattr * const tb[])
 {
struct nft_ct *priv = nft_expr_priv(expr);
-   bool label_got = false;
unsigned int len;
int err;
 
@@ -412,7 +424,6 @@ static int nft_ct_set_init(const struct nft_ctx *ctx,
err = nf_connlabels_get(ctx->net, (len * BITS_PER_BYTE) - 1);
if (err)
return err;
-   label_got = true;
break;
 #endif
default:
@@ -431,8 +442,7 @@ static int nft_ct_set_init(const struct nft_ctx *ctx,
return 0;
 
 err1:
-   if (label_got)
-   nf_connlabels_put(ctx->net);
+   __nft_ct_set_destroy(ctx, priv);
return err;
 }
 
@@ -447,16 +457,7 @@ static void nft_ct_set_destroy(const struct nft_ctx *ctx,
 {
struct nft_ct *priv = nft_expr_priv(expr);
 
-   switch (priv->key) {
-#ifdef CONFIG_NF_CONNTRACK_LABELS
-   case NFT_CT_LABELS:
-   nf_connlabels_put(ctx->net);
-   break;
-#endif
-   default:
-   break;
-   }
-
+   __nft_ct_set_destroy(ctx, priv);
nft_ct_netns_put(ctx->net, ctx->afi->family);
 }
 
-- 
2.1.4



[PATCH 08/21] netfilter: nf_tables: add bitmap set type

2017-02-12 Thread Pablo Neira Ayuso
This patch adds a new bitmap set type. This bitmap uses two bits to
represent one element. These two bits determine the element state in the
current and the future generation that fits into the nf_tables commit
protocol. When dumping elements back to userspace, the two bits are
expanded into a struct nft_set_ext object.

If no NFTA_SET_DESC_SIZE is specified, the existing automatic set
backend selection prefers bitmap over hash in case of keys whose size is
<= 16 bit. If the set size is know, the bitmap set type is selected if
with 16 bit kets and more than 390 elements in the set, otherwise the
hash table set implementation is used.

For 8 bit keys, the bitmap consumes 66 bytes. For 16 bit keys, the
bitmap takes 16388 bytes.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Kconfig  |   6 +
 net/netfilter/Makefile |   1 +
 net/netfilter/nft_set_bitmap.c | 314 +
 3 files changed, 321 insertions(+)
 create mode 100644 net/netfilter/nft_set_bitmap.c

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index dfbe9deeb8c4..ea479ed43373 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -509,6 +509,12 @@ config NFT_SET_HASH
  This option adds the "hash" set type that is used to build one-way
  mappings between matchings and actions.
 
+config NFT_SET_BITMAP
+   tristate "Netfilter nf_tables bitmap set module"
+   help
+ This option adds the "bitmap" set type that is used to build sets
+ whose keys are smaller or equal to 16 bits.
+
 config NFT_COUNTER
tristate "Netfilter nf_tables counter module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 6b3034f12661..c9b78e7b342f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_NFT_REJECT)  += nft_reject.o
 obj-$(CONFIG_NFT_REJECT_INET)  += nft_reject_inet.o
 obj-$(CONFIG_NFT_SET_RBTREE)   += nft_set_rbtree.o
 obj-$(CONFIG_NFT_SET_HASH) += nft_set_hash.o
+obj-$(CONFIG_NFT_SET_BITMAP)   += nft_set_bitmap.o
 obj-$(CONFIG_NFT_COUNTER)  += nft_counter.o
 obj-$(CONFIG_NFT_LOG)  += nft_log.o
 obj-$(CONFIG_NFT_MASQ) += nft_masq.o
diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c
new file mode 100644
index ..97f9649bcc7e
--- /dev/null
+++ b/net/netfilter/nft_set_bitmap.c
@@ -0,0 +1,314 @@
+/*
+ * Copyright (c) 2017 Pablo Neira Ayuso 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* This bitmap uses two bits to represent one element. These two bits determine
+ * the element state in the current and the future generation.
+ *
+ * An element can be in three states. The generation cursor is represented 
using
+ * the ^ character, note that this cursor shifts on every succesful 
transaction.
+ * If no transaction is going on, we observe all elements are in the following
+ * state:
+ *
+ * 11 = this element is active in the current generation. In case of no 
updates,
+ * ^it stays active in the next generation.
+ * 00 = this element is inactive in the current generation. In case of no
+ * ^updates, it stays inactive in the next generation.
+ *
+ * On transaction handling, we observe these two temporary states:
+ *
+ * 01 = this element is inactive in the current generation and it becomes 
active
+ * ^in the next one. This happens when the element is inserted but commit
+ *  path has not yet been executed yet, so activation is still pending. On
+ *  transaction abortion, the element is removed.
+ * 10 = this element is active in the current generation and it becomes 
inactive
+ * ^in the next one. This happens when the element is deactivated but 
commit
+ *  path has not yet been executed yet, so removal is still pending. On
+ *  transation abortion, the next generation bit is reset to go back to
+ *  restore its previous state.
+ */
+struct nft_bitmap {
+   u16 bitmap_size;
+   u8  bitmap[];
+};
+
+static inline void nft_bitmap_location(u32 key, u32 *idx, u32 *off)
+{
+   u32 k = (key << 1);
+
+   *idx = k / BITS_PER_BYTE;
+   *off = k % BITS_PER_BYTE;
+}
+
+/* Fetch the two bits that represent the element and check if it is active 
based
+ * on the generation mask.
+ */
+static inline bool
+nft_bitmap_active(const u8 *bitmap, u32 idx, u32 off, u8 genmask)
+{
+   return (bitmap[idx] & (0x3 << off)) & (genmask << off);
+}
+
+static bool nft_bitmap_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext)
+{
+   const struct nft_bitmap *priv = nft_set_priv(set);
+   u8 genmask = nft_genmask_cur(net);
+   u32 idx, off;
+

[PATCH 14/21] netfilter: nf_ct_expect: nf_ct_expect_insert() returns void

2017-02-12 Thread Pablo Neira Ayuso
From: Gao Feng 

Because nf_ct_expect_insert() always succeeds now, its return value can
be just void instead of int. And remove code that checks for its return
value.

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_expect.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index f8dbacf66795..e19a69787d99 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -353,7 +353,7 @@ void nf_ct_expect_put(struct nf_conntrack_expect *exp)
 }
 EXPORT_SYMBOL_GPL(nf_ct_expect_put);
 
-static int nf_ct_expect_insert(struct nf_conntrack_expect *exp)
+static void nf_ct_expect_insert(struct nf_conntrack_expect *exp)
 {
struct nf_conn_help *master_help = nfct_help(exp->master);
struct nf_conntrack_helper *helper;
@@ -380,7 +380,6 @@ static int nf_ct_expect_insert(struct nf_conntrack_expect 
*exp)
add_timer(&exp->timeout);
 
NF_CT_STAT_INC(net, expect_create);
-   return 0;
 }
 
 /* Race with expectations being used means we could have none to find; OK. */
@@ -464,9 +463,8 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect 
*expect,
if (ret <= 0)
goto out;
 
-   ret = nf_ct_expect_insert(expect);
-   if (ret < 0)
-   goto out;
+   nf_ct_expect_insert(expect);
+
spin_unlock_bh(&nf_conntrack_expect_lock);
nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
return ret;
-- 
2.1.4



[PATCH 00/21] Netfilter updates for net-next

2017-02-12 Thread Pablo Neira Ayuso
Hi David,

The following patchset contains Netfilter updates for your net-next
tree, most relevantly they are:

1) Extend nft_exthdr to allow to match TCP options bitfields, from
   Manuel Messner.

2) Allow to check if IPv6 extension header is present in nf_tables,
   from Phil Sutter.

3) Allow to set and match conntrack zone in nf_tables, patches from
   Florian Westphal.

4) Several patches for the nf_tables set infrastructure, this includes
   cleanup and preparatory patches to add the new bitmap set type.

5) Add optional ruleset generation ID check to nf_tables and allow to
   delete rules that got no public handle yet via NFTA_RULE_ID. These
   patches add the missing kernel infrastructure to support rule
   deletion by description from userspace.

6) Missing NFT_SET_OBJECT flag to select the right backend when sets
   stores an object map.

7) A couple of cleanups for the expectation and SIP helper, from Gao
   feng.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks!



The following changes since commit 6e7bc478c9a006c701c14476ec9d389a484b4864:

  net: skb_needs_check() accepts CHECKSUM_NONE for tx (2017-02-03 17:33:01 
-0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 7286ff7fde9f963736c7e575572899d8e16b06b7:

  netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection 
(2017-02-12 14:45:14 +0100)


Florian Westphal (3):
  netfilter: nft_ct: add zone id get support
  netfilter: nft_ct: prepare for key-dependent error unwind
  netfilter: nft_ct: add zone id set support

Gao Feng (2):
  netfilter: nf_ct_sip: Use mod_timer_pending()
  netfilter: nf_ct_expect: nf_ct_expect_insert() returns void

Manuel Messner (1):
  netfilter: nft_exthdr: add TCP option matching

Pablo Neira Ayuso (14):
  netfilter: nf_tables: pass netns to set->ops->remove()
  netfilter: nf_tables: use struct nft_set_iter in set element flush
  netfilter: nf_tables: rename deactivate_one() to flush()
  netfilter: nf_tables: add flush field to struct nft_set_iter
  netfilter: nf_tables: rename struct nft_set_estimate class field
  netfilter: nf_tables: add space notation to sets
  netfilter: nf_tables: add bitmap set type
  netfilter: nfnetlink: get rid of u_intX_t types
  netfilter: nfnetlink: add nfnetlink_rcv_skb_batch()
  netfilter: nfnetlink: allow to check for generation ID
  netfilter: nf_tables: add check_genid to the nfnetlink subsystem
  netfilter: nf_tables: add NFTA_RULE_ID attribute
  netfilter: update MAINTAINERS
  netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection

Phil Sutter (1):
  netfilter: nft_exthdr: Add support for existence check

 MAINTAINERS  |   3 +-
 include/linux/netfilter/nfnetlink.h  |   1 +
 include/net/netfilter/nf_tables.h|  21 ++-
 include/uapi/linux/netfilter/nf_tables.h |  27 ++-
 include/uapi/linux/netfilter/nfnetlink.h |  12 ++
 net/netfilter/Kconfig|  10 +-
 net/netfilter/Makefile   |   1 +
 net/netfilter/nf_conntrack_expect.c  |   8 +-
 net/netfilter/nf_conntrack_sip.c |  12 +-
 net/netfilter/nf_tables_api.c|  89 ++---
 net/netfilter/nfnetlink.c|  90 ++---
 net/netfilter/nft_ct.c   | 195 +--
 net/netfilter/nft_exthdr.c   | 139 --
 net/netfilter/nft_set_bitmap.c   | 314 +++
 net/netfilter/nft_set_hash.c |  16 +-
 net/netfilter/nft_set_rbtree.c   |  16 +-
 16 files changed, 832 insertions(+), 122 deletions(-)
 create mode 100644 net/netfilter/nft_set_bitmap.c


[PATCH 12/21] netfilter: nft_exthdr: add TCP option matching

2017-02-12 Thread Pablo Neira Ayuso
From: Manuel Messner 

This patch implements the kernel side of the TCP option patch.

Signed-off-by: Manuel Messner 
Reviewed-by: Florian Westphal 
Acked-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  17 -
 net/netfilter/Kconfig|   4 +-
 net/netfilter/nft_exthdr.c   | 119 +++
 3 files changed, 124 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 3e60ed78c538..207951516ede 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -709,13 +709,27 @@ enum nft_exthdr_flags {
 };
 
 /**
- * enum nft_exthdr_attributes - nf_tables IPv6 extension header expression 
netlink attributes
+ * enum nft_exthdr_op - nf_tables match options
+ *
+ * @NFT_EXTHDR_OP_IPV6: match against ipv6 extension headers
+ * @NFT_EXTHDR_OP_TCP: match against tcp options
+ */
+enum nft_exthdr_op {
+   NFT_EXTHDR_OP_IPV6,
+   NFT_EXTHDR_OP_TCPOPT,
+   __NFT_EXTHDR_OP_MAX
+};
+#define NFT_EXTHDR_OP_MAX  (__NFT_EXTHDR_OP_MAX - 1)
+
+/**
+ * enum nft_exthdr_attributes - nf_tables extension header expression netlink 
attributes
  *
  * @NFTA_EXTHDR_DREG: destination register (NLA_U32: nft_registers)
  * @NFTA_EXTHDR_TYPE: extension header type (NLA_U8)
  * @NFTA_EXTHDR_OFFSET: extension header offset (NLA_U32)
  * @NFTA_EXTHDR_LEN: extension header length (NLA_U32)
  * @NFTA_EXTHDR_FLAGS: extension header flags (NLA_U32)
+ * @NFTA_EXTHDR_OP: option match type (NLA_U8)
  */
 enum nft_exthdr_attributes {
NFTA_EXTHDR_UNSPEC,
@@ -724,6 +738,7 @@ enum nft_exthdr_attributes {
NFTA_EXTHDR_OFFSET,
NFTA_EXTHDR_LEN,
NFTA_EXTHDR_FLAGS,
+   NFTA_EXTHDR_OP,
__NFTA_EXTHDR_MAX
 };
 #define NFTA_EXTHDR_MAX(__NFTA_EXTHDR_MAX - 1)
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index ea479ed43373..9b28864cc36a 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -467,10 +467,10 @@ config NF_TABLES_NETDEV
  This option enables support for the "netdev" table.
 
 config NFT_EXTHDR
-   tristate "Netfilter nf_tables IPv6 exthdr module"
+   tristate "Netfilter nf_tables exthdr module"
help
  This option adds the "exthdr" expression that you can use to match
- IPv6 extension headers.
+ IPv6 extension headers and tcp options.
 
 config NFT_META
tristate "Netfilter nf_tables meta module"
diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
index a89e5ab150db..c308920b194c 100644
--- a/net/netfilter/nft_exthdr.c
+++ b/net/netfilter/nft_exthdr.c
@@ -15,20 +15,29 @@
 #include 
 #include 
 #include 
-// FIXME:
-#include 
+#include 
 
 struct nft_exthdr {
u8  type;
u8  offset;
u8  len;
+   u8  op;
enum nft_registers  dreg:8;
u8  flags;
 };
 
-static void nft_exthdr_eval(const struct nft_expr *expr,
-   struct nft_regs *regs,
-   const struct nft_pktinfo *pkt)
+static unsigned int optlen(const u8 *opt, unsigned int offset)
+{
+   /* Beware zero-length options: make finite progress */
+   if (opt[offset] <= TCPOPT_NOP || opt[offset + 1] == 0)
+   return 1;
+   else
+   return opt[offset + 1];
+}
+
+static void nft_exthdr_ipv6_eval(const struct nft_expr *expr,
+struct nft_regs *regs,
+const struct nft_pktinfo *pkt)
 {
struct nft_exthdr *priv = nft_expr_priv(expr);
u32 *dest = ®s->data[priv->dreg];
@@ -52,6 +61,53 @@ static void nft_exthdr_eval(const struct nft_expr *expr,
regs->verdict.code = NFT_BREAK;
 }
 
+static void nft_exthdr_tcp_eval(const struct nft_expr *expr,
+   struct nft_regs *regs,
+   const struct nft_pktinfo *pkt)
+{
+   u8 buff[sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE];
+   struct nft_exthdr *priv = nft_expr_priv(expr);
+   unsigned int i, optl, tcphdr_len, offset;
+   u32 *dest = ®s->data[priv->dreg];
+   struct tcphdr *tcph;
+   u8 *opt;
+
+   if (!pkt->tprot_set || pkt->tprot != IPPROTO_TCP)
+   goto err;
+
+   tcph = skb_header_pointer(pkt->skb, pkt->xt.thoff, sizeof(*tcph), buff);
+   if (!tcph)
+   goto err;
+
+   tcphdr_len = __tcp_hdrlen(tcph);
+   if (tcphdr_len < sizeof(*tcph))
+   goto err;
+
+   tcph = skb_header_pointer(pkt->skb, pkt->xt.thoff, tcphdr_len, buff);
+   if (!tcph)
+   goto err;
+
+   opt = (u8 *)tcph;
+   for (i = sizeof(*tcph); i < tcphdr_len - 1; i += optl) {
+   optl = optlen(opt, i);
+
+   if (priv->type != opt[i])

[PATCH 19/21] netfilter: nf_tables: add NFTA_RULE_ID attribute

2017-02-12 Thread Pablo Neira Ayuso
This new attribute allows us to uniquely identify a rule in transaction.
Robots may trigger an insertion followed by deletion in a batch, in that
scenario we still don't have a public rule handle that we can use to
delete the rule. This is similar to the NFTA_SET_ID attribute that
allows us to refer to an anonymous set from a batch.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h|  3 +++
 include/uapi/linux/netfilter/nf_tables.h |  2 ++
 net/netfilter/nf_tables_api.c| 26 ++
 3 files changed, 31 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 21ce50e6d0c5..ac84686aaafb 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1202,10 +1202,13 @@ struct nft_trans {
 
 struct nft_trans_rule {
struct nft_rule *rule;
+   u32 rule_id;
 };
 
 #define nft_trans_rule(trans)  \
(((struct nft_trans_rule *)trans->data)->rule)
+#define nft_trans_rule_id(trans)   \
+   (((struct nft_trans_rule *)trans->data)->rule_id)
 
 struct nft_trans_set {
struct nft_set  *set;
diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 207951516ede..05215d30fe5c 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -207,6 +207,7 @@ enum nft_chain_attributes {
  * @NFTA_RULE_COMPAT: compatibility specifications of the rule (NLA_NESTED: 
nft_rule_compat_attributes)
  * @NFTA_RULE_POSITION: numeric handle of the previous rule (NLA_U64)
  * @NFTA_RULE_USERDATA: user data (NLA_BINARY, NFT_USERDATA_MAXLEN)
+ * @NFTA_RULE_ID: uniquely identifies a rule in a transaction (NLA_U32)
  */
 enum nft_rule_attributes {
NFTA_RULE_UNSPEC,
@@ -218,6 +219,7 @@ enum nft_rule_attributes {
NFTA_RULE_POSITION,
NFTA_RULE_USERDATA,
NFTA_RULE_PAD,
+   NFTA_RULE_ID,
__NFTA_RULE_MAX
 };
 #define NFTA_RULE_MAX  (__NFTA_RULE_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 71c60a04b66b..6c782532615f 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -240,6 +240,10 @@ static struct nft_trans *nft_trans_rule_add(struct nft_ctx 
*ctx, int msg_type,
if (trans == NULL)
return NULL;
 
+   if (msg_type == NFT_MSG_NEWRULE && ctx->nla[NFTA_RULE_ID] != NULL) {
+   nft_trans_rule_id(trans) =
+   ntohl(nla_get_be32(ctx->nla[NFTA_RULE_ID]));
+   }
nft_trans_rule(trans) = rule;
list_add_tail(&trans->list, &ctx->net->nft.commit_list);
 
@@ -2293,6 +2297,22 @@ static int nf_tables_newrule(struct net *net, struct 
sock *nlsk,
return err;
 }
 
+static struct nft_rule *nft_rule_lookup_byid(const struct net *net,
+const struct nlattr *nla)
+{
+   u32 id = ntohl(nla_get_be32(nla));
+   struct nft_trans *trans;
+
+   list_for_each_entry(trans, &net->nft.commit_list, list) {
+   struct nft_rule *rule = nft_trans_rule(trans);
+
+   if (trans->msg_type == NFT_MSG_NEWRULE &&
+   id == nft_trans_rule_id(trans))
+   return rule;
+   }
+   return ERR_PTR(-ENOENT);
+}
+
 static int nf_tables_delrule(struct net *net, struct sock *nlsk,
 struct sk_buff *skb, const struct nlmsghdr *nlh,
 const struct nlattr * const nla[])
@@ -2331,6 +2351,12 @@ static int nf_tables_delrule(struct net *net, struct 
sock *nlsk,
return PTR_ERR(rule);
 
err = nft_delrule(&ctx, rule);
+   } else if (nla[NFTA_RULE_ID]) {
+   rule = nft_rule_lookup_byid(net, nla[NFTA_RULE_ID]);
+   if (IS_ERR(rule))
+   return PTR_ERR(rule);
+
+   err = nft_delrule(&ctx, rule);
} else {
err = nft_delrule_by_chain(&ctx);
}
-- 
2.1.4



[PATCH 20/21] netfilter: update MAINTAINERS

2017-02-12 Thread Pablo Neira Ayuso
It's been a while since Patrick has been suspended as coreteam member [1].
Update this file to remove him.

While at this, remove references to all foo-tables variants, given the
project hosts more than just that, eg. ipset, conntrack, ...

[1] https://marc.info/?l=netfilter-devel&m=146887464512702

Signed-off-by: Pablo Neira Ayuso 
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index a9368bba9b37..5864bbd99f8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8579,9 +8579,8 @@ F:Documentation/networking/s2io.txt
 F: Documentation/networking/vxge.txt
 F: drivers/net/ethernet/neterion/
 
-NETFILTER ({IP,IP6,ARP,EB,NF}TABLES)
+NETFILTER
 M: Pablo Neira Ayuso 
-M: Patrick McHardy 
 M: Jozsef Kadlecsik 
 L: netfilter-de...@vger.kernel.org
 L: coret...@netfilter.org
-- 
2.1.4



[PATCH 16/21] netfilter: nfnetlink: add nfnetlink_rcv_skb_batch()

2017-02-12 Thread Pablo Neira Ayuso
Add new nfnetlink_rcv_skb_batch() to wrap initial nfnetlink batch
handling.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nfnetlink.c | 51 ++-
 1 file changed, 28 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 586212ebba9e..ca645a3b1375 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -436,12 +436,35 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, 
struct nlmsghdr *nlh,
kfree_skb(skb);
 }
 
-static void nfnetlink_rcv(struct sk_buff *skb)
+static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-   struct nlmsghdr *nlh = nlmsg_hdr(skb);
+   struct nfgenmsg *nfgenmsg;
u16 res_id;
int msglen;
 
+   msglen = NLMSG_ALIGN(nlh->nlmsg_len);
+   if (msglen > skb->len)
+   msglen = skb->len;
+
+   if (nlh->nlmsg_len < NLMSG_HDRLEN ||
+   skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg))
+   return;
+
+   nfgenmsg = nlmsg_data(nlh);
+   skb_pull(skb, msglen);
+   /* Work around old nft using host byte order */
+   if (nfgenmsg->res_id == NFNL_SUBSYS_NFTABLES)
+   res_id = NFNL_SUBSYS_NFTABLES;
+   else
+   res_id = ntohs(nfgenmsg->res_id);
+
+   nfnetlink_rcv_batch(skb, nlh, res_id);
+}
+
+static void nfnetlink_rcv(struct sk_buff *skb)
+{
+   struct nlmsghdr *nlh = nlmsg_hdr(skb);
+
if (nlh->nlmsg_len < NLMSG_HDRLEN ||
skb->len < nlh->nlmsg_len)
return;
@@ -451,28 +474,10 @@ static void nfnetlink_rcv(struct sk_buff *skb)
return;
}
 
-   if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN) {
-   struct nfgenmsg *nfgenmsg;
-
-   msglen = NLMSG_ALIGN(nlh->nlmsg_len);
-   if (msglen > skb->len)
-   msglen = skb->len;
-
-   if (nlh->nlmsg_len < NLMSG_HDRLEN ||
-   skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg))
-   return;
-
-   nfgenmsg = nlmsg_data(nlh);
-   skb_pull(skb, msglen);
-   /* Work around old nft using host byte order */
-   if (nfgenmsg->res_id == NFNL_SUBSYS_NFTABLES)
-   res_id = NFNL_SUBSYS_NFTABLES;
-   else
-   res_id = ntohs(nfgenmsg->res_id);
-   nfnetlink_rcv_batch(skb, nlh, res_id);
-   } else {
+   if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN)
+   nfnetlink_rcv_skb_batch(skb, nlh);
+   else
netlink_rcv_skb(skb, &nfnetlink_rcv_msg);
-   }
 }
 
 #ifdef CONFIG_MODULES
-- 
2.1.4



[PATCH 11/21] netfilter: nft_ct: add zone id set support

2017-02-12 Thread Pablo Neira Ayuso
From: Florian Westphal 

zones allow tracking multiple connections sharing identical tuples,
this is needed e.g. when tracking distinct vlans with overlapping ip
addresses (conntrack is l2 agnostic).

Thus the zone has to be set before the packet is picked up by the
connection tracker.  This is done by means of 'conntrack templates' which
are conntrack structures used solely to pass this info from one netfilter
hook to the next.

The iptables CT target instantiates these connection tracking templates
once per rule, i.e. the template is fixed/tied to particular zone, can
be read-only and therefore be re-used by as many skbs simultaneously as
needed.

We can't follow this model because we want to take the zone id from
an sreg at rule eval time so we could e.g. fill in the zone id from
the packets vlan id or a e.g. nftables key : value maps.

To avoid cost of per packet alloc/free of the template, use a percpu
template 'scratch' object and use the refcount to detect the (unlikely)
case where the template is still attached to another skb (i.e., previous
skb was nfqueued ...).

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_ct.c | 144 -
 1 file changed, 143 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 2d82df2737da..c6b8022c0e47 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -32,6 +32,11 @@ struct nft_ct {
};
 };
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+static DEFINE_PER_CPU(struct nf_conn *, nft_ct_pcpu_template);
+static unsigned int nft_ct_pcpu_template_refcnt __read_mostly;
+#endif
+
 static u64 nft_ct_get_eval_counter(const struct nf_conn_counter *c,
   enum nft_ct_keys k,
   enum ip_conntrack_dir d)
@@ -191,6 +196,53 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
regs->verdict.code = NFT_BREAK;
 }
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+static void nft_ct_set_zone_eval(const struct nft_expr *expr,
+struct nft_regs *regs,
+const struct nft_pktinfo *pkt)
+{
+   struct nf_conntrack_zone zone = { .dir = NF_CT_DEFAULT_ZONE_DIR };
+   const struct nft_ct *priv = nft_expr_priv(expr);
+   struct sk_buff *skb = pkt->skb;
+   enum ip_conntrack_info ctinfo;
+   u16 value = regs->data[priv->sreg];
+   struct nf_conn *ct;
+
+   ct = nf_ct_get(skb, &ctinfo);
+   if (ct) /* already tracked */
+   return;
+
+   zone.id = value;
+
+   switch (priv->dir) {
+   case IP_CT_DIR_ORIGINAL:
+   zone.dir = NF_CT_ZONE_DIR_ORIG;
+   break;
+   case IP_CT_DIR_REPLY:
+   zone.dir = NF_CT_ZONE_DIR_REPL;
+   break;
+   default:
+   break;
+   }
+
+   ct = this_cpu_read(nft_ct_pcpu_template);
+
+   if (likely(atomic_read(&ct->ct_general.use) == 1)) {
+   nf_ct_zone_add(ct, &zone);
+   } else {
+   /* previous skb got queued to userspace */
+   ct = nf_ct_tmpl_alloc(nft_net(pkt), &zone, GFP_ATOMIC);
+   if (!ct) {
+   regs->verdict.code = NF_DROP;
+   return;
+   }
+   }
+
+   atomic_inc(&ct->ct_general.use);
+   nf_ct_set(skb, ct, IP_CT_NEW);
+}
+#endif
+
 static void nft_ct_set_eval(const struct nft_expr *expr,
struct nft_regs *regs,
const struct nft_pktinfo *pkt)
@@ -269,6 +321,45 @@ static void nft_ct_netns_put(struct net *net, uint8_t 
family)
nf_ct_netns_put(net, family);
 }
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+static void nft_ct_tmpl_put_pcpu(void)
+{
+   struct nf_conn *ct;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   ct = per_cpu(nft_ct_pcpu_template, cpu);
+   if (!ct)
+   break;
+   nf_ct_put(ct);
+   per_cpu(nft_ct_pcpu_template, cpu) = NULL;
+   }
+}
+
+static bool nft_ct_tmpl_alloc_pcpu(void)
+{
+   struct nf_conntrack_zone zone = { .id = 0 };
+   struct nf_conn *tmp;
+   int cpu;
+
+   if (nft_ct_pcpu_template_refcnt)
+   return true;
+
+   for_each_possible_cpu(cpu) {
+   tmp = nf_ct_tmpl_alloc(&init_net, &zone, GFP_KERNEL);
+   if (!tmp) {
+   nft_ct_tmpl_put_pcpu();
+   return false;
+   }
+
+   atomic_set(&tmp->ct_general.use, 1);
+   per_cpu(nft_ct_pcpu_template, cpu) = tmp;
+   }
+
+   return true;
+}
+#endif
+
 static int nft_ct_get_init(const struct nft_ctx *ctx,
   const struct nft_expr *expr,
   const struct nlattr * const tb[])
@@ -394,6 +485,11 @@ static void __nft_ct_set_destroy(const struct nft_ctx 
*ctx, struc

[PATCH 18/21] netfilter: nf_tables: add check_genid to the nfnetlink subsystem

2017-02-12 Thread Pablo Neira Ayuso
This patch implements the check generation id as provided by nfnetlink.
This allows us to reject ruleset updates against stale baseline, so
userspace can retry update with a fresh ruleset cache.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index cb6ae46f6c48..71c60a04b66b 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4972,6 +4972,11 @@ static int nf_tables_abort(struct net *net, struct 
sk_buff *skb)
return 0;
 }
 
+static bool nf_tables_valid_genid(struct net *net, u32 genid)
+{
+   return net->nft.base_seq == genid;
+}
+
 static const struct nfnetlink_subsystem nf_tables_subsys = {
.name   = "nf_tables",
.subsys_id  = NFNL_SUBSYS_NFTABLES,
@@ -4979,6 +4984,7 @@ static const struct nfnetlink_subsystem nf_tables_subsys 
= {
.cb = nf_tables_cb,
.commit = nf_tables_commit,
.abort  = nf_tables_abort,
+   .valid_genid= nf_tables_valid_genid,
 };
 
 int nft_chain_validate_dependency(const struct nft_chain *chain,
-- 
2.1.4



[PATCH 21/21] netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection

2017-02-12 Thread Pablo Neira Ayuso
Check for NFT_SET_OBJECT feature flag, otherwise we may end up selecting
the wrong set backend.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c  | 3 ++-
 net/netfilter/nft_set_hash.c   | 2 +-
 net/netfilter/nft_set_rbtree.c | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 6c782532615f..ff7304ae58ac 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2424,7 +2424,8 @@ nft_select_set_ops(const struct nlattr * const nla[],
features = 0;
if (nla[NFTA_SET_FLAGS] != NULL) {
features = ntohl(nla_get_be32(nla[NFTA_SET_FLAGS]));
-   features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT;
+   features &= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_TIMEOUT |
+   NFT_SET_OBJECT;
}
 
bops= NULL;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 6938bc890f31..5f652720fc78 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -404,7 +404,7 @@ static struct nft_set_ops nft_hash_ops __read_mostly = {
.lookup = nft_hash_lookup,
.update = nft_hash_update,
.walk   = nft_hash_walk,
-   .features   = NFT_SET_MAP | NFT_SET_TIMEOUT,
+   .features   = NFT_SET_MAP | NFT_SET_OBJECT | NFT_SET_TIMEOUT,
.owner  = THIS_MODULE,
 };
 
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 3387ed7dd231..71e8fb886a73 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -310,7 +310,7 @@ static struct nft_set_ops nft_rbtree_ops __read_mostly = {
.activate   = nft_rbtree_activate,
.lookup = nft_rbtree_lookup,
.walk   = nft_rbtree_walk,
-   .features   = NFT_SET_INTERVAL | NFT_SET_MAP,
+   .features   = NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT,
.owner  = THIS_MODULE,
 };
 
-- 
2.1.4



[PATCH 17/21] netfilter: nfnetlink: allow to check for generation ID

2017-02-12 Thread Pablo Neira Ayuso
This patch allows userspace to specify the generation ID that has been
used to build an incremental batch update.

If userspace specifies the generation ID in the batch message as
attribute, then nfnetlink compares it to the current generation ID so
you make sure that you work against the right baseline. Otherwise, bail
out with ERESTART so userspace knows that its changeset is stale and
needs to respin. Userspace can do this transparently at the cost of
taking slightly more time to refresh caches and rework the changeset.

This check is optional, if there is no NFNL_BATCH_GENID attribute in the
batch begin message, then no check is performed.

Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netfilter/nfnetlink.h  |  1 +
 include/uapi/linux/netfilter/nfnetlink.h | 12 
 net/netfilter/nfnetlink.c| 31 +++
 3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/include/linux/netfilter/nfnetlink.h 
b/include/linux/netfilter/nfnetlink.h
index 1d82dd5e9a08..1b49209dd5c7 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -28,6 +28,7 @@ struct nfnetlink_subsystem {
const struct nfnl_callback *cb; /* callback for individual types */
int (*commit)(struct net *net, struct sk_buff *skb);
int (*abort)(struct net *net, struct sk_buff *skb);
+   bool (*valid_genid)(struct net *net, u32 genid);
 };
 
 int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
diff --git a/include/uapi/linux/netfilter/nfnetlink.h 
b/include/uapi/linux/netfilter/nfnetlink.h
index 4bb8cb7730e7..a09906a30d77 100644
--- a/include/uapi/linux/netfilter/nfnetlink.h
+++ b/include/uapi/linux/netfilter/nfnetlink.h
@@ -65,4 +65,16 @@ struct nfgenmsg {
 #define NFNL_MSG_BATCH_BEGIN   NLMSG_MIN_TYPE
 #define NFNL_MSG_BATCH_END NLMSG_MIN_TYPE+1
 
+/**
+ * enum nfnl_batch_attributes - nfnetlink batch netlink attributes
+ *
+ * @NFNL_BATCH_GENID: generation ID for this changeset (NLA_U32)
+ */
+enum nfnl_batch_attributes {
+NFNL_BATCH_UNSPEC,
+NFNL_BATCH_GENID,
+__NFNL_BATCH_MAX
+};
+#define NFNL_BATCH_MAX (__NFNL_BATCH_MAX - 1)
+
 #endif /* _UAPI_NFNETLINK_H */
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index ca645a3b1375..a2148d0bc50e 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -3,7 +3,7 @@
  *
  * (C) 2001 by Jay Schulist ,
  * (C) 2002-2005 by Harald Welte 
- * (C) 2005,2007 by Pablo Neira Ayuso 
+ * (C) 2005-2017 by Pablo Neira Ayuso 
  *
  * Initial netfilter messages via netlink development funded and
  * generally made possible by Network Robots, Inc. (www.networkrobots.com)
@@ -273,7 +273,7 @@ enum {
 };
 
 static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
-   u16 subsys_id)
+   u16 subsys_id, u32 genid)
 {
struct sk_buff *oskb = skb;
struct net *net = sock_net(skb->sk);
@@ -315,6 +315,12 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, 
struct nlmsghdr *nlh,
return kfree_skb(skb);
}
 
+   if (genid && ss->valid_genid && !ss->valid_genid(net, genid)) {
+   nfnl_unlock(subsys_id);
+   netlink_ack(oskb, nlh, -ERESTART);
+   return kfree_skb(skb);
+   }
+
while (skb->len >= nlmsg_total_size(0)) {
int msglen, type;
 
@@ -436,11 +442,20 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, 
struct nlmsghdr *nlh,
kfree_skb(skb);
 }
 
+static const struct nla_policy nfnl_batch_policy[NFNL_BATCH_MAX + 1] = {
+   [NFNL_BATCH_GENID]  = { .type = NLA_U32 },
+};
+
 static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
+   int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
+   struct nlattr *attr = (void *)nlh + min_len;
+   struct nlattr *cda[NFNL_BATCH_MAX + 1];
+   int attrlen = nlh->nlmsg_len - min_len;
struct nfgenmsg *nfgenmsg;
+   int msglen, err;
+   u32 gen_id = 0;
u16 res_id;
-   int msglen;
 
msglen = NLMSG_ALIGN(nlh->nlmsg_len);
if (msglen > skb->len)
@@ -450,6 +465,14 @@ static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, 
struct nlmsghdr *nlh)
skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg))
return;
 
+   err = nla_parse(cda, NFNL_BATCH_MAX, attr, attrlen, nfnl_batch_policy);
+   if (err < 0) {
+   netlink_ack(skb, nlh, err);
+   return;
+   }
+   if (cda[NFNL_BATCH_GENID])
+   gen_id = ntohl(nla_get_be32(cda[NFNL_BATCH_GENID]));
+
nfgenmsg = nlmsg_data(nlh);
skb_pull(skb, msglen);
/* Work around old nft using host byte order */
@@ -458,7 +481,7 @@ static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, 
struct nlmsghdr *nlh)
else
res_id = ntohs(nf

Re: Fw: [Bug 193911] New: net_prio.ifpriomap is not aware of the network namespace, and discloses all network interface

2017-02-12 Thread Eric W. Biederman
Tejun Heo  writes:

> Hello,
>
> On Sun, Feb 05, 2017 at 11:05:36PM -0800, Cong Wang wrote:
>> > To be more specific, the read operation of net_prio.ifpriomap is handled 
>> > by the
>> > function read_priomap. Tracing from this function, we can find it invokes
>> > for_each_netdev_rcu and set the first parameter as the address of 
>> > init_net. It
>> > iterates all network devices of the host regardless of the network 
>> > namespace.
>> > Thus, from the view of a container, it can read the names of all network
>> > devices of the host.
>> 
>> I think that is probably because cgroup files don't provide a net pointer
>> for the context, if so we probably need some API similar to
>> class_create_file_ns().
>
> Yeah, the whole thing never considered netns or delegation.  Maybe the
> read function itself should probably filter on the namespace of the
> reader?  I'm not completely sure whether trying to fix it won't cause
> some of existing use cases to break.  Eric, what do you think?

Apologies for the delay I just made it back from vacation.

There are cases where we do look at the reader/opener of the  file, and
it is a pain, almost always the best policy is to have the context fixed
at mount time.

I don't see an obvious answer of what better semantics for this file
should be.  Perhaps Docker can mount over this file on older kernels?

The namespace primitives that people build containers out of were never
guaranteed not to leak the fact that you are in a container.  So a small
essentially harmless information leak is not something I panic about.
It is the setting up of the container itself that must know what the
primitives do to ensure that leaks don't happen, if you want to avoid leaks.

That said if this controller/file does not consider netns and delegation
I suspect the right thing to do is put it under CONFIG_BROKEN or
possibly
CONFIG_I_REALLY_NEED_THIS_SILLY_CODE_FOR_BACKWARDS_COMPATIBILITY
aka CONFIG_STAGING and let the code age out of the kernel there.

If someone actually cares about this code and wants to fix it to do the
something reasonable and is willing to dig through all of the subtleties
I can help with that.  I may be wrong but the code feels like something
that just isn't interesting enough to make it worth fixing.

Eric


Re: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.

2017-02-12 Thread Richard Cochran
On Sun, Feb 12, 2017 at 11:52:23AM +, Mintz, Yuval wrote:
> Just to clarify [since it's bit a meaningless otherwise] -
> this +8 is a HW-bug workaround.

Can you please explain exactly what the problem is?

Your code does

period1 = div_s64(val * 10, ppb);
period1 -= 8;
period1 >>= 4;

But correct rounding would be

period1 = div_s64(val * 10, ppb);
period1 += 8;
period1 >>= 4;

Thanks,
Richard


Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Tariq Toukan



On 12/02/2017 5:32 PM, Eric Dumazet wrote:

On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan  wrote:


We consistently see this behavior: the higher the BW, the sharper the
degradation.

This is because the page-cache is of a fixed-size. Any fixed-size page-cache
will always meet one of the following:
1) Too small to keep the pace when load is high.
2) Too big (in terms of memory footprint) when load is low.


So, we had the order-0 allocations for years at Google, then made the
horrible mistake to rebase mlx4 driver from the upstream one,
and we had all these issues under load.

I decided to redo the work I did years ago and upstream it.

Thanks for that. I really appreciate and like your re-factorization.


I have warned Mellanox in the past (for cx-5 driver) that _any_ high
order allocation strategy was nice in benchmarks, but terrible in face
of real server workloads.
( And I am not even referring to malicious attacks )
In mlx5, we fully completed the transition to order-0 allocations in 
Striding RQ.

Think about what happens on real servers : In the order of 100,000 TCP
sockets opened.

Then some incast or outcast problem (Mapreduce jobs are fond of this)
make thousands of TCP socket accumulate _millions_ of TCP messages in
their out of order queue per second.

There is no way you can hold millions of pages in mlx4 driver.
A "dynamic" page pool is going to fail very badly.
I understand your point. Today I am totally aware of the advantages in 
using order-0 pages, I am just trying
to have the bread buttered on both sides, by reducing the allocation 
overhead.
Even though the iperf benchmarks are less realistic than the ones you 
described, I think it is still nice
if we could find solutions for the page allocator in order to keep the 
high rates we had before.
As a common bottleneck, we will always gain by improving the page 
allocator, no matter what is the pages order.


Just two points regarding the dynamic page-cache I implemented:
1) We define an upper limit for the size of the dynamic page-cache, so 
the mata-data do not grow too much.
2) When load is high, our dynamic page-cache _does not exclusively hold 
too many pages_, it just keeps track
of pages that are being anyway processed in stack. In memory 
footprints accounting, I would not account
such page into the "driver's footprint", as it is being used by the 
stack.





Sure, your iperf bench will look great. But who cares ? Doyou really
have customers dedicating hosts to run 1 iperf full time ?

Make sure you run tests with 100,000 TCP sockets, and add networking
small flaps, with 5% packet losses.
This is what we really care here.
I definitely agree that benchmarks should improve to reflect more 
realistic use cases.


I will send the v3 of the patch series, I really hope that it will go
in, because we at Google very much need it ASAP, and I would rather
not have to keep it private in our tree.

Do not focus on your benchmarks, that is marketing only
Focus on ability of the servers to _survive_ and continue their work.

You did not answer to my questions by the way.

ethtool -g eth0
ethtool -l eth0

Yes, sorry the delayed reply, it was sent separately.


Thanks.




Re: net/llc: BUG in llc_sap_state_process/skb_set_owner_r

2017-02-12 Thread Eric Dumazet
On Sun, Feb 12, 2017 at 8:44 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
>
> A reproducer and .config are attached

Thanks for the report.

llc sets skb->sk   without corresponding skb->destructor.

This is considered invalid by our current standards.

As I added the sanity check in skb_destructor() back in linux-3.12
(!!!), I will send the corresponding LLC fix.

( commit 376c7311bdb6efea3322310333576a04d73fbe4c )


Re: net/ipv6: use-after-free in sock_wfree

2017-02-12 Thread Andrey Konovalov
On Mon, Jan 9, 2017 at 6:21 PM, Eric Dumazet  wrote:
> On Mon, Jan 9, 2017 at 9:11 AM, Andrey Konovalov  
> wrote:
>> On Mon, Jan 9, 2017 at 6:08 PM, Andrey Konovalov  
>> wrote:
>>> Hi!
>>>
>>> I've got the following error report while running the syzkaller fuzzer.
>>>
>>> On commit a121103c922847ba5010819a3f250f1f7fc84ab8 (4.10-rc3).
>>>
>>> A reproducer is attached.
>>>
>>> ==
>>> BUG: KASAN: use-after-free in sock_wfree+0x118/0x120
>>> Read of size 8 at addr 880062da0060 by task a.out/4140
>>>
>>> page:ea00018b6800 count:1 mapcount:0 mapping:  (null)
>>> index:0x0 compound_mapcount: 0
>>> flags: 0x1008100(slab|head)
>>> raw: 01008100   000180130013
>>> raw: dead0100 dead0200 88006741f140 
>>> page dumped because: kasan: bad access detected
>>>
>>> CPU: 0 PID: 4140 Comm: a.out Not tainted 4.10.0-rc3+ #59
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:15
>>>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>>>  describe_address mm/kasan/report.c:262
>>>  kasan_report_error+0x121/0x560 mm/kasan/report.c:370
>>>  kasan_report mm/kasan/report.c:392
>>>  __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:413
>>>  sock_flag ./arch/x86/include/asm/bitops.h:324
>>>  sock_wfree+0x118/0x120 net/core/sock.c:1631
>>>  skb_release_head_state+0xfc/0x250 net/core/skbuff.c:655
>>>  skb_release_all+0x15/0x60 net/core/skbuff.c:668
>>>  __kfree_skb+0x15/0x20 net/core/skbuff.c:684
>>>  kfree_skb+0x16e/0x4e0 net/core/skbuff.c:705
>>>  inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
>>>  inet_frag_put ./include/net/inet_frag.h:133
>>>  nf_ct_frag6_gather+0x1125/0x38b0 
>>> net/ipv6/netfilter/nf_conntrack_reasm.c:617
>>>  ipv6_defrag+0x21b/0x350 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
>>>  nf_hook_entry_hookfn ./include/linux/netfilter.h:102
>>>  nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
>>>  nf_hook ./include/linux/netfilter.h:212
>>>  __ip6_local_out+0x52c/0xaf0 net/ipv6/output_core.c:160
>>>  ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
>>>  ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
>>>  ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
>>>  rawv6_push_pending_frames net/ipv6/raw.c:613
>>>  rawv6_sendmsg+0x2cff/0x4130 net/ipv6/raw.c:927
>>>  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
>>>  sock_sendmsg_nosec net/socket.c:635
>>>  sock_sendmsg+0xca/0x110 net/socket.c:645
>>>  sock_write_iter+0x326/0x620 net/socket.c:848
>>>  new_sync_write fs/read_write.c:499
>>>  __vfs_write+0x483/0x760 fs/read_write.c:512
>>>  vfs_write+0x187/0x530 fs/read_write.c:560
>>>  SYSC_write fs/read_write.c:607
>>>  SyS_write+0xfb/0x230 fs/read_write.c:599
>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203
>>> RIP: 0033:0x7ff26e6f5b79
>>> RSP: 002b:7ff268e0ed98 EFLAGS: 0206 ORIG_RAX: 0001
>>> RAX: ffda RBX: 7ff268e0f9c0 RCX: 7ff26e6f5b79
>>> RDX: 0010 RSI: 20f50fe1 RDI: 0003
>>> RBP: 7ff26ebc1220 R08:  R09: 
>>> R10:  R11: 0206 R12: 
>>> R13: 7ff268e0f9c0 R14: 7ff26efec040 R15: 0003
>>>
>>> The buggy address belongs to the object at 880062da
>>>  which belongs to the cache RAWv6 of size 1504
>>> The buggy address 880062da0060 is located 96 bytes inside
>>>  of 1504-byte region [880062da, 880062da05e0)
>>>
>>> Freed by task 4113:
>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>  set_track mm/kasan/kasan.c:514
>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>  slab_free_hook mm/slub.c:1352
>>>  slab_free_freelist_hook mm/slub.c:1374
>>>  slab_free mm/slub.c:2951
>>>  kmem_cache_free+0xb2/0x2c0 mm/slub.c:2973
>>>  sk_prot_free net/core/sock.c:1377
>>>  __sk_destruct+0x49c/0x6e0 net/core/sock.c:1452
>>>  sk_destruct+0x47/0x80 net/core/sock.c:1460
>>>  __sk_free+0x57/0x230 net/core/sock.c:1468
>>>  sk_free+0x23/0x30 net/core/sock.c:1479
>>>  sock_put ./include/net/sock.h:1638
>>>  sk_common_release+0x31e/0x4e0 net/core/sock.c:2782
>>>  rawv6_close+0x54/0x80 net/ipv6/raw.c:1214
>>>  inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
>>>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:431
>>>  sock_release+0x8d/0x1e0 net/socket.c:599
>>>  sock_close+0x16/0x20 net/socket.c:1063
>>>  __fput+0x332/0x7f0 fs/file_table.c:208
>>>  fput+0x15/0x20 fs/file_table.c:244
>>>  task_work_run+0x19b/0x270 kernel/task_work.c:116
>>>  exit_task_work ./include/linux/task_work.h:21
>>>  do_exit+0x186b/0x2800 kernel/exit.c:839
>>>  do_group_exit+0x149/0x420 kernel/exit.c:943
>>>  SYSC_exit_group kernel/exit.c:954
>>>  SyS_exit_group+0x1d/0x20 kernel/exit.c:952
>>>  entry_SYSCALL_64_fastpath+0x

Re: [PATCH net-next 0/4] net/sched: Use TC skip flags to reflect HW offload status

2017-02-12 Thread David Miller
From: Or Gerlitz 
Date: Sun, 12 Feb 2017 11:54:25 +0200

> Re the old kernel argument, these patches are small and pointish,
> would it make sense to you to consider them as fixes and push them
> back to the relevant stable kernels?

Sorry, it doesn't work that way.


Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Eric Dumazet
Please Tariq do not send HTML messages, they are not making to netdev
mailing list.

On Sun, Feb 12, 2017 at 7:55 AM, Tariq Toukan  wrote:
>
> On 09/02/2017 6:43 PM, Tariq Toukan wrote:
>
> We need to test this series again in our functional and performance
> regression systems.
> It will be running during the weekend, so we can analyze the results and
> update you on Sunday.
>
> Both setups running functional regression hanged, on two different issues.
> Both repros don't seem to be immediate, they do not simply happen by running
> the exact case that caused the hang, but by a series of cases.
> I'm analyzing the issue, looking for a minimal repro.
> For now, you can find the traces copied below.
>
> Regards,
> Tariq
>
>
> Setup 1: x86
>
> [ 8646.869516] [ cut here ]
> [ 8646.870970] WARNING: CPU: 4 PID: 0 at net/ipv4/af_inet.c:1498
> inet_gro_complete+0xa6/0xb0


So by the time  inet_gro_complete() is called, iph->procotol became mangled.

This does not make sense to me, my patch do not change skb->head allocations ...
>
>
>
> Setup 2: PowerPC
>
> [10586.623028] Unable to handle kernel paging request for data at address
> 0x80251f9001c
> [10586.623072] Faulting instruction address: 0xc0236fa8
> [10586.623081] Oops: Kernel access of bad area, sig: 11 [#1]
> [10586.623087] SMP NR_CPUS=2048
> [10586.623087] NUMA
> [10586.623093] pSeries
> [10586.623103] Modules linked in: rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib
> ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en ptp pps_core mlx4_ib
> ib_core mlx4_core devlink netconsole 8021q garp mrp stp llc nfsv3 nfs
> fscache sg pseries_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
> ext4 mbcache jbd2 sd_mod ibmvscsi ibmveth scsi_transport_srp [last unloaded:
> devlink]
> [10586.623137] CPU: 8 PID: 30175 Comm: ifconfig Not tainted
> 4.10.0-rc6-eric_v2 #1
> [10586.623144] task: cb1e4480 task.stack: ca3cc000
> [10586.623151] NIP: c0236fa8 LR: d4f738c4 CTR:
> c0236fa0
> [10586.623156] REGS: ca3cf360 TRAP: 0380   Not tainted
> (4.10.0-rc6-eric_v2)
> [10586.623162] MSR: 8280b032 
> [10586.623167]   CR: 28002048  XER: 2000
> [10586.623178] CFAR: d4f87ab0 SOFTE: 1
> [10586.623178] GPR00: d4f739d0 ca3cf5e0 c121da00
> 080251f9
> [10586.623178] GPR04:  0001 0002
> 
> [10586.623178] GPR08: c11a3218 c0026320 080251f9001c
> d4f87a98
> [10586.623178] GPR12: c0236fa0 ce834800 3fffd7c08bcc
> 
> [10586.623178] GPR16:  3fffd7c08bd8 3fffd7c08c18
> 3fffd7c08bd0
> [10586.623178] GPR20: c002b37f1438 c00275b5b400 c002b37f1438
> 0046
> [10586.623178] GPR24: 5deadbeef200 c002b37e0900 
> d4fd0020
> [10586.623178] GPR28: c002b37f0900  
> d4fd0020
> [10586.623223] NIP [c0236fa8] .__free_pages+0x8/0x50
> [10586.623236] LR [d4f738c4]
> .mlx4_en_free_rx_desc.isra.21+0xd4/0x180 [mlx4_en]
> [10586.623243] Call Trace:
> [10586.623248] [ca3cf5e0] [c002b37ed770] 0xc002b37ed770
> (unreliable)
> [10586.623260] [ca3cf690] [d4f739d0]
> .mlx4_en_free_rx_buf+0x60/0x130 [mlx4_en]
> [10586.623274] [ca3cf720] [d4f74658]
> .mlx4_en_deactivate_rx_ring+0x128/0x180 [mlx4_en]
> [10586.623286] [ca3cf7c0] [d4f815c4]
> .mlx4_en_stop_port+0x614/0x950 [mlx4_en]
> [10586.623297] [ca3cf8a0] [d4f81abc]
> .mlx4_en_change_mtu+0x1bc/0x210 [mlx4_en]
> [10586.623307] [ca3cf940] [c0736f50]
> .dev_set_mtu+0x190/0x270
> [10586.623316] [ca3cf9e0] [c07644c8] .dev_ifsioc+0x348/0x3f0
> [10586.623323] [ca3cfa80] [c0764920] .dev_ioctl+0x3b0/0x880
> [10586.623331] [ca3cfb70] [c0712880]
> .sock_do_ioctl+0x90/0xb0
> [10586.623337] [ca3cfc00] [c0713380] .sock_ioctl+0x2b0/0x390
> [10586.623345] [ca3cfca0] [c03059b4]
> .do_vfs_ioctl+0xc4/0x8b0
> [10586.623352] [ca3cfd90] [c0306264] .SyS_ioctl+0xc4/0xe0
> [10586.623360] [ca3cfe30] [c000b184] system_call+0x38/0xe0
> [10586.623367] Instruction dump:
> [10586.623372] fadf0028 7f1cd92a 4bfffe70 7f43d378 7fe4fb78 7fa5eb78
> 38c0 38e5
> [10586.623383] 4bffd689 4bfffe6c 7c0004ac 3943001c <7d005028> 3108
> 7d00512d 40c2fff4
> [10586.623397] ---[ end trace 97ff7bd173bea34a ]---
> [10586.623403]
> [10588.623447] Kernel panic - not syncing: Fatal exception


Yeah, changing MTU seems to be problematic because of the log_rx_info
trick that you already mentioned.

Can you tell me what was the old MTU and what is the new one ?

Thanks


Re: net/llc: bug in llc_pdu_init_as_xid_cmd/skb_over_panic

2017-02-12 Thread Andrey Konovalov
On Fri, Feb 10, 2017 at 4:12 PM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.
>
> A reproducer and .config are attached
>
> kernel BUG at net/core/skbuff.c:105!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 2 PID: 6558 Comm: syz-executor4 Not tainted 4.10.0-rc7+ #126
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 88003c49c480 task.stack: 88003a5c
> RIP: 0010:skb_panic+0x16f/0x200 net/core/skbuff.c:101
> RSP: 0018:88003a5c77d0 EFLAGS: 00010286
> RAX: 0082 RBX: 88006be991c0 RCX: 
> RDX: 0082 RSI: 814567fc RDI: ed00074b8eec
> RBP: 88003a5c7838 R08: 0001 R09: 
> R10: 0002 R11: 0001 R12: 85231ee0
> R13: 834a6722 R14: 0003 R15: 88006c81c580
> FS:  7f89298c7700() GS:88006de0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20ee5000 CR3: 58697000 CR4: 06e0
> Call Trace:
>  skb_over_panic net/core/skbuff.c:110 [inline]
>  skb_put+0x18d/0x1d0 net/core/skbuff.c:1437
>  llc_pdu_init_as_xid_cmd include/net/llc_pdu.h:377 [inline]
>  llc_sap_action_send_xid_c+0x2a2/0x3b0 net/llc/llc_s_ac.c:82
>  llc_exec_sap_trans_actions net/llc/llc_sap.c:152 [inline]
>  llc_sap_next_state net/llc/llc_sap.c:181 [inline]
>  llc_sap_state_process+0x26b/0x4e0 net/llc/llc_sap.c:212
>  llc_build_and_send_xid_pkt+0x19f/0x200 net/llc/llc_sap.c:276
>  llc_ui_sendmsg+0xad9/0x1430 net/llc/af_llc.c:938
>  sock_sendmsg_nosec net/socket.c:635 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:645
>  ___sys_sendmsg+0x9d2/0xae0 net/socket.c:1985
>  __sys_sendmsg+0x138/0x320 net/socket.c:2019
>  SYSC_sendmsg net/socket.c:2030 [inline]
>  SyS_sendmsg+0x2d/0x50 net/socket.c:2026
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458b9
> RSP: 002b:7f89298c6b58 EFLAGS: 0286 ORIG_RAX: 002e
> RAX: ffda RBX: 0005 RCX: 004458b9
> RDX: 00040085 RSI: 20001fc8 RDI: 0005
> RBP: 006e1ae0 R08:  R09: 
> R10:  R11: 0286 R12: 00708000
> R13:  R14: c0206434 R15: 201fcfe0
> Code: 00 00 00 48 89 54 24 10 48 c7 c7 60 19 23 85 48 89 74 24 08 4c
> 89 04 24 4c 89 ea 4c 89 7c 24 18 45 89 f0 4c 89 e6 e8 1e c0 38 fe <0f>
> 0b 4c 89 4d b8 4c 89 45 c0 48 89 75 c8 48 89 55 d0 e8 6a 5e
> RIP: skb_panic+0x16f/0x200 net/core/skbuff.c:101 RSP: 88003a5c77d0
> ---[ end trace 89f0ca2ea5bc3ead ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..

+a...@redhat.com


net/llc: BUG in llc_sap_state_process/skb_set_owner_r

2017-02-12 Thread Andrey Konovalov
Hi,

I've got the following error report while fuzzing the kernel with syzkaller.

On commit 926af6273fc683cd98cd0ce7bf0d04a02eed6742.

A reproducer and .config are attached

kernel BUG at ./include/linux/skbuff.h:2389!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 9315 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #126
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 88006861c480 task.stack: 88006a988000
RIP: 0010:skb_set_owner_r include/linux/skbuff.h:2389 [inline]
RIP: 0010:__sock_queue_rcv_skb+0x8c0/0xda0 net/core/sock.c:425
RSP: 0018:88003ec06b58 EFLAGS: 00010206
RAX: 88006861c480 RBX: 8800371c2568 RCX: 
RDX: 0100 RSI: 110006ba08ab RDI: 880035d04560
RBP: 88003ec06dc0 R08: 0002 R09: 0001
R10:  R11: dc00 R12: 880035d04540
R13: 88003ec06d98 R14: 8800371c2590 R15: 880035d045a0
FS:  7fa8005ac700() GS:88003ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004a6f68 CR3: 38e25000 CR4: 06f0
Call Trace:
 
 sock_queue_rcv_skb+0x3a/0x50 net/core/sock.c:451
 llc_sap_state_process+0x3e3/0x4e0 net/llc/llc_sap.c:220
 llc_sap_rcv net/llc/llc_sap.c:294 [inline]
 llc_sap_handler+0x695/0x1320 net/llc/llc_sap.c:434
 llc_rcv+0x6da/0xed0 net/llc/llc_input.c:208
 __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
 process_backlog+0xe5/0x6c0 net/core/dev.c:4839
 napi_poll net/core/dev.c:5202 [inline]
 net_rx_action+0xe70/0x1900 net/core/dev.c:5267
 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
 
 do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
 do_softirq kernel/softirq.c:176 [inline]
 __local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181
 local_bh_enable include/linux/bottom_half.h:31 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline]
 __dev_queue_xmit+0xd87/0x2860 net/core/dev.c:3399
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3405
 llc_build_and_send_ui_pkt+0x240/0x330 net/llc/llc_output.c:74
 llc_ui_sendmsg+0x98d/0x1430 net/llc/af_llc.c:928
 sock_sendmsg_nosec net/socket.c:635 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:645
 ___sys_sendmsg+0x9d2/0xae0 net/socket.c:1985
 __sys_sendmsg+0x138/0x320 net/socket.c:2019
 SYSC_sendmsg net/socket.c:2030 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2026
 entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458b9
RSP: 002b:7fa8005abb58 EFLAGS: 0286 ORIG_RAX: 002e
RAX: ffda RBX: 0006 RCX: 004458b9
RDX: 00040880 RSI: 20003000 RDI: 0006
RBP: 006e1b00 R08:  R09: 
R10:  R11: 0286 R12: 00708000
R13: 0082 R14: 2000 R15: 
Code: 4b 50 fe e9 b1 f8 ff ff e8 3e 4a 50 fe e9 78 f8 ff ff e8 34 4a
50 fe e9 6d f9 ff ff e8 2a 4a 50 fe e9 93 f9 ff ff e8 20 0a 26 fe <0f>
0b e8 19 0a 26 fe be 3c 01 00 00 48 c7 c7 e0 e9 22 85 e8 b8
RIP: skb_set_owner_r include/linux/skbuff.h:2389 [inline] RSP: 88003ec06b58
RIP: __sock_queue_rcv_skb+0x8c0/0xda0 net/core/sock.c:425 RSP: 88003ec06b58
---[ end trace 58af2d02ad7f84f0 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_sendmsg
#define __NR_sendmsg 46
#endif
#ifndef __NR_mmap
#define __NR_mmap 9
#endif
#ifndef __NR_socket
#define __NR_socket 41
#endif
#ifndef __NR_connect
#define __NR_connect 42
#endif

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

const int kFailStatus = 67;
const int kErrorStatus = 68;
const int kRetryStatus = 69;

__attribute__((noreturn)) void doexit(int status)
{
  volatile unsigned i;
  syscall(__NR_exit_group, status);
  for (i = 0;; i++) {
  }
}

__attribute__((noreturn)) void fail(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(e == ENOMEM ? kRetryStatus : kFailStatus);
}

__attribute__((noreturn)) void exitf(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(kRetryStatus);
}

static int flag_debug;

void debug(const char* 

[PATCH] net: neterion: vxge: use new api ethtool_{get|set}_link_ksettings

2017-02-12 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/neterion/vxge/vxge-ethtool.c |   47 -
 1 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c 
b/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c
index 9a29670..db55e6d 100644
--- a/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c
+++ b/drivers/net/ethernet/neterion/vxge/vxge-ethtool.c
@@ -38,9 +38,9 @@
 };
 
 /**
- * vxge_ethtool_sset - Sets different link parameters.
+ * vxge_ethtool_set_link_ksettings - Sets different link parameters.
  * @dev: device pointer.
- * @info: pointer to the structure with parameters given by ethtool to set
+ * @cmd: pointer to the structure with parameters given by ethtool to set
  * link information.
  *
  * The function sets different link parameters provided by the user onto
@@ -48,44 +48,51 @@
  * Return value:
  * 0 on success.
  */
-static int vxge_ethtool_sset(struct net_device *dev, struct ethtool_cmd *info)
+static int
+vxge_ethtool_set_link_ksettings(struct net_device *dev,
+   const struct ethtool_link_ksettings *cmd)
 {
/* We currently only support 10Gb/FULL */
-   if ((info->autoneg == AUTONEG_ENABLE) ||
-   (ethtool_cmd_speed(info) != SPEED_1) ||
-   (info->duplex != DUPLEX_FULL))
+   if ((cmd->base.autoneg == AUTONEG_ENABLE) ||
+   (cmd->base.speed != SPEED_1) ||
+   (cmd->base.duplex != DUPLEX_FULL))
return -EINVAL;
 
return 0;
 }
 
 /**
- * vxge_ethtool_gset - Return link specific information.
+ * vxge_ethtool_get_link_ksettings - Return link specific information.
  * @dev: device pointer.
- * @info: pointer to the structure with parameters given by ethtool
+ * @cmd: pointer to the structure with parameters given by ethtool
  * to return link information.
  *
  * Returns link specific information like speed, duplex etc.. to ethtool.
  * Return value :
  * return 0 on success.
  */
-static int vxge_ethtool_gset(struct net_device *dev, struct ethtool_cmd *info)
+static int vxge_ethtool_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
-   info->supported = (SUPPORTED_1baseT_Full | SUPPORTED_FIBRE);
-   info->advertising = (ADVERTISED_1baseT_Full | ADVERTISED_FIBRE);
-   info->port = PORT_FIBRE;
+   ethtool_link_ksettings_zero_link_mode(cmd, supported);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, 1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
 
-   info->transceiver = XCVR_EXTERNAL;
+   ethtool_link_ksettings_zero_link_mode(cmd, advertising);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, 1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
+
+   cmd->base.port = PORT_FIBRE;
 
if (netif_carrier_ok(dev)) {
-   ethtool_cmd_speed_set(info, SPEED_1);
-   info->duplex = DUPLEX_FULL;
+   cmd->base.speed = SPEED_1;
+   cmd->base.duplex = DUPLEX_FULL;
} else {
-   ethtool_cmd_speed_set(info, SPEED_UNKNOWN);
-   info->duplex = DUPLEX_UNKNOWN;
+   cmd->base.speed = SPEED_UNKNOWN;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
}
 
-   info->autoneg = AUTONEG_DISABLE;
+   cmd->base.autoneg = AUTONEG_DISABLE;
return 0;
 }
 
@@ -1126,8 +1133,6 @@ static int vxge_fw_flash(struct net_device *dev, struct 
ethtool_flash *parms)
 }
 
 static const struct ethtool_ops vxge_ethtool_ops = {
-   .get_settings   = vxge_ethtool_gset,
-   .set_settings   = vxge_ethtool_sset,
.get_drvinfo= vxge_ethtool_gdrvinfo,
.get_regs_len   = vxge_ethtool_get_regs_len,
.get_regs   = vxge_ethtool_gregs,
@@ -1139,6 +1144,8 @@ static int vxge_fw_flash(struct net_device *dev, struct 
ethtool_flash *parms)
.get_sset_count = vxge_ethtool_get_sset_count,
.get_ethtool_stats  = vxge_get_ethtool_stats,
.flash_device   = vxge_fw_flash,
+   .get_link_ksettings = vxge_ethtool_get_link_ksettings,
+   .set_link_ksettings = vxge_ethtool_set_link_ksettings,
 };
 
 void vxge_initialize_ethtool_ops(struct net_device *ndev)
-- 
1.7.4.4



Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Tariq Toukan


On 09/02/2017 6:56 PM, Eric Dumazet wrote:

Default, out of box.

Well. Please report :

ethtool  -l eth0
ethtool -g eth0

$ ethtool -g p1p1
Ring parameters for p1p1:
Pre-set maximums:
RX: 8192
RX Mini:0
RX Jumbo:   0
TX: 8192
Current hardware settings:
RX: 1024
RX Mini:0
RX Jumbo:   0
TX: 512

$ ethtool -l p1p1
Channel parameters for p1p1:
Pre-set maximums:
RX: 128
TX: 32
Other:  0
Combined:   0
Current hardware settings:
RX: 8
TX: 32
Other:  0
Combined:   0



RE: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.

2017-02-12 Thread Mintz, Yuval
> The original would return val == 1, period == 6249; While this does have
> some error [val / (period * 16 + 8) is slightly bigger than 1 / 10^9, error at
> 18[?] digit after dot], it's the best we can configure for the HW.

Correction. That's actually not *the best* we could configure -
 due to stopping at the first value between equal differences.
[you've already commented on that in the past, mentioning  that we
should use >= and not >].

But that doesn't change the fact that your approximation can choose
numbers which can't be configured to the HW, and as a result incorrectly
pick some that will not minimize the approximation error.

> One simple adjustment we could do is simply break from the loop If 'diff ==
> 0'. At least for small PPB value this would be hit relatively quickly.

Given the previous correction, the suggestion would also include
reversing the order of the iteration [7 -> 1 instead of 1 -> 7].


Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

2017-02-12 Thread Eric Dumazet
On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan  wrote:

>
> We consistently see this behavior: the higher the BW, the sharper the
> degradation.
>
> This is because the page-cache is of a fixed-size. Any fixed-size page-cache
> will always meet one of the following:
> 1) Too small to keep the pace when load is high.
> 2) Too big (in terms of memory footprint) when load is low.
>

So, we had the order-0 allocations for years at Google, then made the
horrible mistake to rebase mlx4 driver from the upstream one,
and we had all these issues under load.

I decided to redo the work I did years ago and upstream it.

I have warned Mellanox in the past (for cx-5 driver) that _any_ high
order allocation strategy was nice in benchmarks, but terrible in face
of real server workloads.
( And I am not even referring to malicious attacks )

Think about what happens on real servers : In the order of 100,000 TCP
sockets opened.

Then some incast or outcast problem (Mapreduce jobs are fond of this)
make thousands of TCP socket accumulate _millions_ of TCP messages in
their out of order queue per second.

There is no way you can hold millions of pages in mlx4 driver.
A "dynamic" page pool is going to fail very badly.

Sure, your iperf bench will look great. But who cares ? Doyou really
have customers dedicating hosts to run 1 iperf full time ?

Make sure you run tests with 100,000 TCP sockets, and add networking
small flaps, with 5% packet losses.
This is what we really care here.

I will send the v3 of the patch series, I really hope that it will go
in, because we at Google very much need it ASAP, and I would rather
not have to keep it private in our tree.

Do not focus on your benchmarks, that is marketing only
Focus on ability of the servers to _survive_ and continue their work.

You did not answer to my questions by the way.

ethtool -g eth0
ethtool -l eth0

Thanks.


RE: [PATCH net-next v4 1/2] qed: Add infrastructure for PTP support.

2017-02-12 Thread Mintz, Yuval
> > Your suggestion seems to:
> >   a. Assume that the required period should be in ns, not in
> >   16*ns units.
> >   b. mishandles the +8/-8 in the calculation.
> >   c. Doesn't seem to consider the upper bound on period.
> 
> Duh, you would have to convert the result into the proper form for the HW
> register and add bounds checking.  I mean, that goes without saying.
> The important fact is that your algorithm it not optimal for ppm < 60.

Your algorithm ignores the HW limitation. Consider (ppb == 1): 
your logic would output N == 7, *M == 70,
   Which has perfect accuracy [N / *M is 1 / 10^9].
But the solution for
   'period' * 16 + 8 == 7 * 10^9
isn't a whole number, so this result doesn't really reflect the actual
approximation error since we couldn't configure it to HW.

The original would return val == 1, period == 6249; While this
does have some error [val / (period * 16 + 8) is slightly bigger
than 1 / 10^9, error at 18[?] digit after dot], it's the best we can
configure for the HW.

One simple adjustment we could do is simply break from the loop
If 'diff == 0'. At least for small PPB value this would be hit relatively 
quickly.

> > One thing I still don't get is *why* we're trying to optimize this
> > area of the code -
> 
> So you prefer using 21 64-bit divisions when using 8 produces better results?

No. In an ideal world, I would have liked optimizing everything.
But in this world if I do find time to spend on optimizations 
I rather do that for the stuff that matters. I.e., datapath.


Re: [PATCH V5 for-next 16/21] RDMA/bnxt_re: Support poll_cq verb

2017-02-12 Thread Leon Romanovsky
On Fri, Feb 10, 2017 at 03:19:48AM -0800, Selvin Xavier wrote:
> Enables the fastpath ib_poll_cq verb.
>
> v2: Fixed sparse warnings
> v3: Fixes endianness related warnings reported by sparse. Also, fixes
> smatch and checkpatch warnings
> v5: Uses ETH_P_IBOE macro for RoCE ethertype
>
> Signed-off-by: Eddie Wai 
> Signed-off-by: Devesh Sharma 
> Signed-off-by: Somnath Kotur 
> Signed-off-by: Sriharsha Basavapatna 
> Signed-off-by: Selvin Xavier 
> ---
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c | 522 
>  drivers/infiniband/hw/bnxt_re/ib_verbs.h |   1 +
>  drivers/infiniband/hw/bnxt_re/main.c |  22 +-
>  drivers/infiniband/hw/bnxt_re/qplib_fp.c | 560 
> ++-
>  drivers/infiniband/hw/bnxt_re/qplib_fp.h |   7 +-
>  5 files changed, 1107 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c 
> b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> index 54d85bc..33af2e3 100644
> --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> @@ -2230,6 +2230,528 @@ struct ib_cq *bnxt_re_create_cq(struct ib_device 
> *ibdev,
>   return ERR_PTR(rc);
>  }
>
> +static u8 __req_to_ib_wc_status(u8 qstatus)
> +{
> + switch (qstatus) {
> + case CQ_REQ_STATUS_OK:
> + return IB_WC_SUCCESS;
> + case CQ_REQ_STATUS_BAD_RESPONSE_ERR:
> + return IB_WC_BAD_RESP_ERR;
> + case CQ_REQ_STATUS_LOCAL_LENGTH_ERR:
> + return IB_WC_LOC_LEN_ERR;
> + case CQ_REQ_STATUS_LOCAL_QP_OPERATION_ERR:
> + return IB_WC_LOC_QP_OP_ERR;
> + case CQ_REQ_STATUS_LOCAL_PROTECTION_ERR:
> + return IB_WC_LOC_PROT_ERR;
> + case CQ_REQ_STATUS_MEMORY_MGT_OPERATION_ERR:
> + return IB_WC_GENERAL_ERR;
> + case CQ_REQ_STATUS_REMOTE_INVALID_REQUEST_ERR:
> + return IB_WC_REM_INV_REQ_ERR;
> + case CQ_REQ_STATUS_REMOTE_ACCESS_ERR:
> + return IB_WC_REM_ACCESS_ERR;
> + case CQ_REQ_STATUS_REMOTE_OPERATION_ERR:
> + return IB_WC_REM_OP_ERR;
> + case CQ_REQ_STATUS_RNR_NAK_RETRY_CNT_ERR:
> + return IB_WC_RNR_RETRY_EXC_ERR;
> + case CQ_REQ_STATUS_TRANSPORT_RETRY_CNT_ERR:
> + return IB_WC_RETRY_EXC_ERR;
> + case CQ_REQ_STATUS_WORK_REQUEST_FLUSHED_ERR:
> + return IB_WC_WR_FLUSH_ERR;
> + default:
> + return IB_WC_GENERAL_ERR;
> + }
> + return 0;
> +}
> +
> +static u8 __rawqp1_to_ib_wc_status(u8 qstatus)
> +{
> + switch (qstatus) {
> + case CQ_RES_RAWETH_QP1_STATUS_OK:
> + return IB_WC_SUCCESS;
> + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_ACCESS_ERROR:
> + return IB_WC_LOC_ACCESS_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_HW_LOCAL_LENGTH_ERR:
> + return IB_WC_LOC_LEN_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_PROTECTION_ERR:
> + return IB_WC_LOC_PROT_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_LOCAL_QP_OPERATION_ERR:
> + return IB_WC_LOC_QP_OP_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_MEMORY_MGT_OPERATION_ERR:
> + return IB_WC_GENERAL_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_WORK_REQUEST_FLUSHED_ERR:
> + return IB_WC_WR_FLUSH_ERR;
> + case CQ_RES_RAWETH_QP1_STATUS_HW_FLUSH_ERR:
> + return IB_WC_WR_FLUSH_ERR;
> + default:
> + return IB_WC_GENERAL_ERR;
> + }
> +}
> +
> +static u8 __rc_to_ib_wc_status(u8 qstatus)
> +{
> + switch (qstatus) {
> + case CQ_RES_RC_STATUS_OK:
> + return IB_WC_SUCCESS;
> + case CQ_RES_RC_STATUS_LOCAL_ACCESS_ERROR:
> + return IB_WC_LOC_ACCESS_ERR;
> + case CQ_RES_RC_STATUS_LOCAL_LENGTH_ERR:
> + return IB_WC_LOC_LEN_ERR;
> + case CQ_RES_RC_STATUS_LOCAL_PROTECTION_ERR:
> + return IB_WC_LOC_PROT_ERR;
> + case CQ_RES_RC_STATUS_LOCAL_QP_OPERATION_ERR:
> + return IB_WC_LOC_QP_OP_ERR;
> + case CQ_RES_RC_STATUS_MEMORY_MGT_OPERATION_ERR:
> + return IB_WC_GENERAL_ERR;
> + case CQ_RES_RC_STATUS_REMOTE_INVALID_REQUEST_ERR:
> + return IB_WC_REM_INV_REQ_ERR;
> + case CQ_RES_RC_STATUS_WORK_REQUEST_FLUSHED_ERR:
> + return IB_WC_WR_FLUSH_ERR;
> + case CQ_RES_RC_STATUS_HW_FLUSH_ERR:
> + return IB_WC_WR_FLUSH_ERR;
> + default:
> + return IB_WC_GENERAL_ERR;
> + }
> +}
> +

Why don't you use these defines directly?

Thanks


signature.asc
Description: PGP signature


  1   2   >